PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
5
6
6
3
3.0
置信度
正确性2.8
贡献度2.8
表达2.8
ICLR 2025

Reformulating Strict Monotonic Probabilities with a Generative Cost Model

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05
TL;DR

We propose a generative method that solves the monotonic modeling task.

摘要

关键词
monotonic modelvariational inferencegenerative model

评审与讨论

审稿意见
5

The paper proposes a latent variable model capturing both a monotonic part and a part that need not be monotonic. They give some background, and then propose a loss function for a variational training with neural nets.

优点

It is an interesting problem and their construction is quite novel and creative.

缺点

The objective function and implementation leaves a lot of details missing and unexplained. The paper does not inspire confidence in the method.

问题

Overview with some questions embedded:

Lines 054-059 could be clearer with a graphical model diagram. Edit: there is one later, nice.

Line 063 may not be quite right; p_theta(x,r) cannot be ignored in the evidence unless you drop the dependence on theta. Maybe rework the setup / explanation a bit, this shouldn’t be a big issue.

Line 101 equivalent to what?

Line 150 consider the machinery within https://arxiv.org/abs/2301.11695 and also the separate line of work on normalizing flows that all involve e.g. invertibility, monotone transformations, etc.

Line 210 (0,1) should be {0,1}

Line 189 should y and r have the same dimension?

Line 289 elementwise

Line 304 looks like it would suffer from high variance since pi(z) is fixed but p(z|x) depends on x.

Line 323 the combination of losses looks a little suspicious.

Line 323 Note that IWAE (Burda) is equal to ELBO (Jordan et al) for IWAE number of samples set to one; why is adding ELBO to IWAE sensible? Shouldn’t pi be something else and then no ELBO? Please expand, this is unconvincing.

Line 324 it seems like this doesn’t really match the beta vae setup; it would if you put a beta in front of the kl in line 318. What is this doing?

Line 378 how about some ablations of alpha and beta terms? Edit: there are some in appendix C.

Table 1 does not seem to match appendix C for any value of the parameters, what is missing? Please add details.

Line >= 378 how did you set all of the parameters?

Line 698 affect

评论

Thank you for the comprehensive review and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).

# Question 1: p(x,z)p(x,z) is ignored in the evidence.

We ignore p(x,r)p(x,r) because we always have the given variables xx and rr, which are sampled from their prior p(x,r)p(x, r). So, sampling (x,r,y)(x,r,y) from p(x,y,r)p(x,y,r) requires only sampling yy based on p(yx,r)p(y|x,r) by the given (x,r)(x,r).

# Question 2: Should yy and rr have the same dimension?

No, it only requires y(x,r1)y(x,r2)y|(x,r_1) \prec y|(x,r_2) for any r1r2r_1 \prec r_2. The definition of monotonic conditional probability is in the background section (Section 2 - Monotonic Conditional Probability).

# Question 3: The high variance issue for importance sampling.

We modified the model for zz being a categorical variable. In the original version, we assume that zz is a KK-categorical variable that z{1,,K}z \in \{1,\cdots,K\}, and we take the sampling distribution π(z)1K\pi(z)\equiv \frac{1}{K}, this leads to the evidence estimator by importance sampling as p(yz,r)1Nj=1Np(zkx)p(yzk,r)p(y|z,r)\approx \frac{1}{N}\sum_{j=1}^{N}p(z_k|x)p(y|z_k,r), where zkπz_k \sim \pi. This is equivalent to the law of total probability when we take K=NK=N and zk=kz_k=k. As a result, we are aware that we do not need to introduce the importance sampling technique. In the new revision, we change the model for categorical latent variable based on the law of total probability. For complex latent distribution assumptions, we follow the reparameterization trick as in VAE and use the recognition model qϕ(zx,r,y)q_\phi(z|x,r,y) to sample zz, which helps reduce the high variance of the evidence estimator.

# Question 4: The combination of losses.

Thanks for the suggestion, we rethink about this and realized that the combination of LL and ELB is not really necessary. We also remove the KL divergence term of DKL(p(zx)p(z))D_{KL}(p(z|x)\|p(z)) in LGCM\mathcal L_{GCM}, and now the loss functions of GCM and GCM-VI become:

LGCM(θ;x,r,y)=log[1Kk=1Kpθ2(yx,r,zk)]LL,\mathcal L_{GCM}(\theta; x, r, y)= -\log \left[\frac{1}{K}\sum_{k=1}^{K}p_{\theta_2}\left(y|{x},{r},{z}_k\right)\right]\approx -LL,

where zkpθ1(zx)z_k \sim p_{\theta_1}(z|x), and

LGCMVI(θ,ϕ;x,r,y)=log[1Kk=1Kpθ2(yzk,r)pθ1(zkx)qϕ(zkx,r,y)]=ELBLL,\mathcal L_{GCM-VI}(\theta, \phi; x, r, y)= -\log \left[\frac{1}{K}\sum_{k=1}^K\frac{p_{\theta_2}(y| z_k, r)p_{\theta_1}(z_k| x)}{q_{\phi}(z_k| x, r, y)}\right]=-ELB\geq -LL,

where zkqϕ(zx,r,y)z_k \sim q_\phi(z|x,r,y).

Therefore, in the new formulation, we do not need the combination L=LLαELB+βKL(p(zx)p(z))\mathcal L=-LL-\alpha ELB + \beta KL(p(z|x)\|p(z)) and the hyperparameters of α\alpha and β\beta are dropped. So we also remove the ablation study on α\alpha and β\beta, and add an ablation study on the sample number KK and latent dimension DD in Appendix E.

The experimental result of this new setting still shows that GCM-Vi performs best in 3 of the four public datasets. Proving that the combination of LLLL and ELBELB is not necessary.

# Question 5: About IWAE and ELBO.

In the latest revision, we follow the form of original IWAE for the GCM-VI loss. This is shown in our reply to question 4, the experimental results are also renewed.

# Question 6: The loss does not match β\beta -VAE.

To further simplify the model, we remove the additional KL divergence term DKL(pθ(zx)p(z))D_{KL}(p_\theta(z|x) \| p(z)). As a result, the β\beta parameter is removed from the latest revision.

# Question 7: Table 4 (table 1 in the original paper) does not match the ablation study.

This is because we used α=0.5\alpha=0.5 in the experiment section, but did not include it in the ablation study (where we take α{0,0.2,0.4,,2.0}\alpha \in \{ 0, 0.2, 0.4, \cdots, 2.0 \}) and we also use different random seeds in the experiment part and the ablation part. So in the latest revision, the result of the experiment in Table 3 (Adult ACC =0.8858±0.00140.8858\pm 0.0014) is not identical to the ablation result in Table 5 (Adult ACC =0.8855±0.00130.8855\pm 0.0013 when D=4D=4 and K=32K=32).

# Question 8: Parameter details.

We add the hyperparameter details in Section 5 (line 480), which we take D=4D=4 and K=32K=32 for all experiments on the public datasets, and we upload the code to: https://github.com/iclr-2025-4464/GCM, it has the full realization of the GCM, GCM-VI and baseline models, as well as the parameter settings.

审稿意见
6

The paper proposes the Generative Cost Model (GCM) to enforce strict monotonicity by modeling a latent cost variable, using variational inference and importance sampling. This approach avoids architectural constraints and outperforms traditional methods on synthetic and real datasets.

优点

The paper seems to present a novel approach to formulate the problem of strict monotonic probability, it provides convincing theoretical analysis and emprical results.

缺点

The paper lacks empirical validation across diverse, high-dimensional real-world datasets, limiting the demonstrated generalizability of the Generative Cost Model (GCM). Moreover, the code was not provided in the supplementary materials to assess the reproducibility of the results.

问题

  1. Can the authors provide additional justification or empirical validation for the assumption of conditional independence between the latent variable z and revenue r given x, as this assumption may affect model flexibility in real-world applications?

  2. Could the authors elaborate on the computational efficiency of the Generative Cost Model (GCM) compared to traditional monotonic models, especially when applied to high-dimensional datasets? How does the computational time compare to the benchmarks?

评论

Thank you for the reviews and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).

# Weakness 1: Lack of validation.

We added two more real-world datasets: Diabetes and Blog Feedback, where the Diabetes is a large dataset with 253,680 examples, and the task of Blog Feedback dataset is a regression task with 8 monotonic features and 272 nonmonotonic features. The experiment on both datasets proving our GCM method is consistently well performed. You can find the details in Table 3 in the experiment section and more detailed results in the tables in Appendix E.

# Weakness 2: Lack of code.

We uploaded the anonymous code to https://github.com/iclr-2025-4464/GCM.

# Question 1: The assumption of zrx z\perp r \mid x needs additional justification.

We rethink about this assumption and find that it is not necessary. In Appendix D, we build another generation model that generates both the revenue variable rr and the cost variable cc by the latent variable zz, this helps our model avoid zrx z\perp r \mid x. We call the new cogeneration method the GCRM (generative cost-revenue model). We also performed an experiment based on GCRM, and found that this approach performs close to the original GCM in four public datasets, which shows an optimistic potential.

# Question 2: Elaborate on the computational efficiency.

We validate the time consumption of different monotonic modeling methods, the detailed data are shown in Appendix F. The inference task requires one to estimate p(yx,r)p(y|x,r), where xx is fixed and rr varies in a set of candidates {r1,,rn}\{r_1 , \cdots, r_n\}. The GCM is slow when the candidate number nn is small, however, when the number nn grows, the GCM exceeds the baseline methods. The efficiency of GCM in the large number of candidates rr is that GCM does not need to put all pairs of (x,ri)(x, r_i) in a deep neural network as traditional methods, which extend the computation time by nn times. Instead, GCM only requires us to put xx into a deep neural network once to generate cc, which does not take additional time spent, after obtaining cxc|x, it is efficient to calculate p(cri)p(c \prec r_i) for ri{r1,,rn}r_i \in \{r_1,…,r_n\} since this step does not need a DNN, we can simply solve it using equation (10). As a result, when the candidate number of r r reaches the extreme value of 1024, GCM can save up to 72% time cost compared to the fastest baseline model.

审稿意见
6

This paper proposes a novel approach to model strict monotonic probabilities. Without loss of generality, it considers the binary classification formulation yx,rBernoulli(y;G(x,r))y|x, r\sim Bernoulli(y;G(x,r)), where G(x,r)G(x,r) is a function that is monotonic in rr. The target is to learn the function GG. Instead of directly learning GG, the paper introduces a cost variable cc and reformulates G(x,r)G(x,r) as an integration c<rp(cx)dc\int_{c<r} p(c|x)d c. The paper then introduces a generative cost model to approximate the conditional distribution p(cx)p(c|x).

优点

The paper tackles an interesting problem of modeling monotonic probabilities. The problem itself is important, and the reformulation proposed by the paper is unique. Extensive experiments are conducted to support this new method.

缺点

The paper lacks theoretical results on the finite sample efficiency of the proposed algorithm. For the experiment design, an important setup that requires strict monotonicity is quantile regression, where the conditional quantile should be monotonic in the quantile argument. It would be interesting to see experiments designed for it and a comparison with the existing benchmarks.

问题

No additional questions.

评论

Thank you for the reviews and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).

# Weakness 1: The paper lacks theoretical results on the finite sample efficiency of the proposed algorithm.

We find some difficulties in proving the finite sample efficiency. To prove the finite sample efficiency, we need to validate if the variance of an estimator TT of the parameter θ\theta reaches the Cramér–Rao lower bound, i.e. var(T)=Iθ1{\rm var} (T) = \mathcal I_\theta^{-1} holds. However, we do not have an analytical estimator T(x,y,r)T(x,y,r) of the parameters θ\theta and ϕ\phi, since they are optimized using gradient descent algorithms. Furthermore, we are unable to compute Fisher information Iθ\mathcal I_\theta, since it is computationally impossible to integrate a complex deep neural network.

If we misunderstand your question, please inform us, and we will try to consider a proper explanation as soon as possible.

# Weakness 2: Application in quantile regression.

It is an interesting application of monotonicity. We added an additional experiment of quantile regression through a simulation, and we put the details in Section 5.1. We show a comparison of the quantile curve between all methods in Figure 3 and MAE of the rrth quantile predicted by all methods with different settings of rr. The test result shows that GCM performs the best in MAE metrics. We can also see that the traditional strict monotonic methods (MM, SMM, CMNN) suffer from approximation accuracy, as the strict monotonic structures (e.g. positive weighting matrices and monotonic activations) weaken the universal approximation ability of neural networks. For the non-monotonic (MLP) and weak-monotonic (Hint, PWL) methods, they have better approximation accuracy than the strict monotonic methods, but their rrth quantile curves of different rr's are not sufficiently separated as shown in Figure 3, due to the lack of strict monotonic constraints.

审稿意见
3

The paper models a conditional monotonic distribution as a marginalization over latent variables. The key idea is to modify the monotonic modeling problem into modeling an element-wise cumulative distribution function (over a cost variable C). To actually model the latter, the authors introduce a latent variable modeling problem and solve it via a standard importance-weighted likelihood estimate with or without an additional ELBO term. The paper presents improved results in two experiments over a variety of baselines.

优点

  • Clean reformulation of the monotonic problem into a classification over a latent variable.
  • Well-motivated techniques to solve the problem.
  • Comparison against a variety of baselines.

缺点

No confidence intervals in the results tables. It's not clear that the improvement achieved by GCM is large enough (even over the simple MLP) to justify the much more complicated modeling procedure.

问题

  • How wide are the confidence intervals for each of the metrics? Just do standard bootstrap if possible and report please.
  • Why introduce the extra prior pθ(Z)?p_\theta(Z)? Also, why is the prior different from π(Z)\pi(Z)?

伦理问题详情

评论

Thank you for the reviews and valuable suggestions. We have revised our paper as you requested and here are our replies (updated in 2024-11-29).

# Weakness 1 & question 1: No confidence intervals in the results tables.

We add a 95% confidence interval to the experimental results; the confidence intervals are estimated by repeating the experiments 10 times. The results with confidence interval can be found in Table 1 & 3 in the section 5, and tables in Appendix A, C and E.

# Weakness 2: It's not clear that the improvement achieved by GCM is large enough.

With the demonstration of confidence intervals, the improvement in the GCM compared to the baselines, including the MLP, is significant. The detailed experimental results are shown in Table 1 & 3 in the section 5 and in Appendix E. As past studies (e.g. paper 1&2 ) on monotonic networks show, the improvement of AUC and ACC on Adult and COMPAS datasets are not numerically large, due to the limitation (dataset size, random noise, etc.) of datasets. In fact, a 0.003 improvement in the AUC metric is considered huge in various scenarios.

We also add two more sets of experiments on the Diabetes dataset and the Blog Feedback dataset, and the GCM-VI is still the best method in all datasets, and the results are reported in Appendix E.

Moreover, MLP is not a monotonic model, we use it to see the effect if there is no monotonic constraint of a neural network. Due to the difficulty of training a strict monotonic neural network (for example, all the weighting matrix must be positive in the classic Min-Max monotonic network, see paper 3; can not apply the layer normalization technique, which is not monotonic; lack of scalability, see paper 4, etc), it is an achievement that a monotonic model ties with the unconstrained MLP, while preserving the advantage of giving strict monotonic predictions. In fact, not all monotonic models can defeat MLP due to the rigorous monotonic constraints, which are shown in Table 1 & 3. However, in all sets of experiments, the GCM model surpasses the MLP and baseline models as shown in Table 1 & 3, which proves the ability of our model.

paper 1: https://arxiv.org/pdf/2306.01147

paper 2: https://arxiv.org/pdf/2011.10219

paper 3: https://proceedings.neurips.cc/paper/1997/file/83adc9225e4deb67d7ce42d58fe5157c-Paper.pdf

paper 4: https://openreview.net/pdf?id=DjIsNDEOYX

# Question 2: Why introduce the extra prior pθ(z)p_\theta(z)? Also, why is the prior different from π(z)\pi(z)?

We take the sampling distribution π(z)\pi(z) same as the prior p(z)p(z) in the original paper. In the latest revision, we remove the π(z)\pi(z) and modify the evidence estimator from pθ(yx,r)1kj=1kpθ1(zjx)π(zj)pθ2(yzj,r)p_{\theta}(y|x, r)\approx \frac{1}{k}\sum_{j=1}^{k}\frac{p_{\theta_1}(z_j|x)}{\pi( z_j)}p_{\theta_2}(y|z_j, r) to the exact estimator pθ(yx,r)=j=1kpθ1(zj=kx)pθ2(yzj=k,r)p_{\theta}(y|x, r)=\sum_{j=1}^{k}p_{\theta_1}(z_j=k|x)p_{\theta_2}(y|z_j=k, r) for a categorical variable zz, which is equivalent to the importance sampling when zz is categorical and π(z)constant\pi(z)\equiv {\rm constant}. In the new revision, we also provide an estimator of the evidence for a Gaussian latent variable zz using the reparameterization trick instead of importance sampling, where zj=z(θ1,ϵj)pθ1(zx)z_j=z(\theta_1, \epsilon_j) \sim p_{\theta_1}(z|x) and pθ(yx,r)1Kj=1kpθ2(yzj,r)p_{\theta}(y|x, r)\approx \frac{1}{K}\sum_{j=1}^{k}p_{\theta_2}(y|z_j, r), avoiding the high-variance issue in importance sampling. The experimental results in Section 5.1 are updated using the Gaussian latent variable zz. We also compared the performance between Gaussian and categorical latent variables in Appendix C.2, showing that they can achieve similar performance in four public datasets.

评论

Dear all reviewers, thank you for all the comprehensive reviews and valuable suggestions, we have submitted the final revision, and here are the modifications we made comparing to the original paper.

1. We revised the latent distribution and simplified the loss functions of GCM and GCM-VI.

The latent distribution we used in the original paper is categorical, and to deal with it, we have to use importance sampling to estimate evidence. However, if we traverse all possible values of zz, it is equivalent to the law of total probability, which gives the exact estimate of the evidence by:

pθ(yz,r)=1Kj=1Kpθ1(z=kx)pθ2(yz=k,r).p_\theta(y|z,r)= \frac{1}{K}\sum_{j=1}^{K}p_{\theta_1}(z=k|x)p_{\theta_2}(y|z=k,r).

In this case, we can directly optimize the evidence and we have the loss function:

LGCMcate(θ;x,r,y)=log[1Kj=1Kpθ1(z=kx)pθ2(yz=k,r)].\mathcal L_{GCMcate}(\theta; x, r, y)= -\log \left[\frac{1}{K}\sum_{j=1}^{K}p_{\theta_1}(z=k|x)p_{\theta_2}(y|z=k,r)\right].

For complex latent variables, for example, zz form the Gaussian distribution, we cannot use the law of total probability (impossible to integrate on zz) or importance sampling (high variance). We follow the classical reparameterization trick and reformulate the loss of GCM as:

LGCM(θ;x,r,y)=log[1Kk=1Kpθ2(yx,r,zk)],\mathcal L_{GCM}(\theta; x, r, y)= -\log \left[\frac{1}{K}\sum_{k=1}^{K}p_{\theta_2}\left(y|{x},{r},{z}_k\right)\right],

where zkpθ1(zx)z_k \sim p_{\theta_1}(z|x). To make a better estimate of zz, we adopt a recognition model qϕ(zx,r,y)q_\phi(z|x,r,y), similar to the IWAE, the variational version of GCM (GCM-VI) has a loss function formulated as:

LGCMVI(θ,ϕ;x,r,y)=log[1Kk=1Kpθ2(yzk,r)pθ1(zkx)qϕ(zkx,r,y)],\mathcal L_{GCM-VI}(\theta, \phi; x, r, y)= -\log \left[\frac{1}{K}\sum_{k=1}^K\frac{p_{\theta_2}(y| z_k, r)p_{\theta_1}(z_k| x)}{q_{\phi}(z_k| x, r, y)}\right],

where zkqϕ(zx,r,y)z_k \sim q_{\phi}(z|x,r,y). The three revised loss functions are simplified compared to the original paper. We remove the hyperparameters α\alpha and β\beta in the original version formulated as L=LLαELB+βDKL(pθ(zx)p(z))\mathcal L=-LL-\alpha ELB+\beta D_{KL}(p_\theta(z|x)\|p(z)). So we now have fewer hyperparameters and the loss function is cleared without a redundant linear combination of LLLL and ELBELB.

We performed an ablation study on these three loss functions in Appendix C.2, and the results are as follows.

MethodAdult ACCCOMPAS ACCDiabetes ACCBlogFeedback RMSE
GCM-Categorical0.8850 ±\ \pm0.00130.6983 ±\ \pm0.00100.8443 ±\ \pm0.00030.0988 ±\ \pm0.0010
GCM-Gaussian0.8854 ±\ \pm0.00130.6991 ±\ \pm0.00110.8441 ±\ \pm0.00010.0994 ±\ \pm0.0003
GCM-Gaussian-VI0.8858 ±\ \pm0.00140.7011 ±\ \pm0.00110.8442 ±\ \pm0.00020.1005 ±\ \pm0.0004

We can see that GCM-VI and GCM-categorical perform the best, this is consistent with their objectives, since GCM-categorical is trained by the exact LLLL and GCM-VI provides a better estimation of the latent z z than the original GCM.

评论

2. We replaced the card-gambling experiments with the quantile regression experiment.

The original rules of the card-gambling experiment are hard to fully understand. Therefore, we adopt a simpler task of quantile regression based on simulations in Section 5.1. The task is formulated as follows:

sample  rU([0,1]),{\rm sample}\ \ r \sim \mathcal U([0,1]),

sample  y^rpθ(yx,r),{\rm sample}\ \ \hat y_r \sim p_\theta(y|x,r),

minimize  r(yy^r)++(1r)(y^ry)+.{\rm minimize} \ \ r(y - \hat y_r )_{+} +(1-r)(\hat y_r - y ) _{+}.

Here y^r\hat y_r is the prediction of the rrth quantile of yxy|x and the objective function in the third line follows the classical quantile regression. According to the definition of the rrth quantile, which is monotonic with respect to rr, the prediction value y^r\hat y_r should also be monotonic with respect to rr. So we can test the monotonic methods based on this task. In our experiment, the training examples of xx and yy are generated by:

sample  xU([1.5,1.5]),{\rm sample}\ \ x \sim \mathcal U([-1.5,1.5]),

sample  ϵU([0,1]),{\rm sample}\ \ \epsilon \sim \mathcal U([0, 1]),

y=0.3sin(2(x+0.8))+0.4sin(3(x1.3))+0.3sin(5x)+0.4(0.8x2+0.6)ϵ. y= 0.3 \sin(2 (x + 0.8)) + 0.4 \sin(3 (x - 1.3)) + 0.3 \sin(5 x) + 0.4 (0.8 x^2+0.6) \epsilon.

The test results are shown in Figure 3 and Table 1 in Section 5.1. Proving that GCM has the best performance among all methods. We show the MAE metrics as follows.

Methodr=0.1r=0.1r=0.3r=0.3r=0.5r=0.5r=0.7r=0.7r=0.9r=0.9
MLP0.1495 ±\ \pm0.03400.1157 ±\ \pm0.02830.1057 ±\ \pm0.02550.1230 ±\ \pm0.03090.1477 ±\ \pm0.0386
MM0.2002 ±\ \pm0.05720.1103 ±\ \pm0.03200.0723 ±\ \pm0.02450.1067 ±\ \pm0.03460.1745 ±\ \pm0.0495
SMM0.2345 ±\ \pm0.06930.1194 ±\ \pm0.03560.0812 ±\ \pm0.02460.1236 ±\ \pm0.03660.1919 ±\ \pm0.0556
CMNN0.1768 ±\ \pm0.03400.1119 ±\ \pm0.01740.0823 ±\ \pm0.01610.1007 ±\ \pm0.01980.1480 ±\ \pm0.0332
Hint0.1402 ±\ \pm0.02850.1137 ±\ \pm0.02630.1068 ±\ \pm0.02920.1154 ±\ \pm0.03680.1316 ±\ \pm0.0374
PWL0.1793 ±\ \pm0.02820.1476 ±\ \pm0.01640.1394 ±\ \pm0.01930.1524 ±\ \pm0.02160.1698 ±\ \pm0.0207
GCM0.0984 ±\ \pm0.01880.0777 ±\ \pm0.01190.0669 ±\ \pm0.00960.0759 ±\ \pm0.01270.0991 ±\ \pm0.0211

3. We renewed the experiments in public datasets.

We added two public datasets, so now we have four public datasets in the experiment, they are:

datasettotal examplesdimension of xdimension of rtarget
Adult48,842334classification
COMPAS7,21494classification
Diabetes253,6801054classification
BlogFeedback52,3972728regression

They are tested in Section 5.2, and GCM-VI achieve the best performance in Adult, COMPAS and Diabetes, while GCM performs the best in BlogFeedback.

MethodAdult ACCCOMPAS ACCDiabetes ACCBlogFeedback RMSE
MLP0.8837 ±\ \pm0.00120.6955 ±\ \pm0.00080.8431 ±\ \pm0.00040.1042 ±\ \pm0.0004
MM0.8836 ±\ \pm0.00100.6949 ±\ \pm0.00210.8409 ±\ \pm0.00080.1100 ±\ \pm0.0018
SMM0.8837 ±\ \pm0.00110.6955 ±\ \pm0.00200.8401 ±\ \pm0.00130.1114 ±\ \pm0.0008
CMNN0.8832 ±\ \pm0.00130.6997 ±\ \pm0.00110.8393 ±\ \pm0.00150.1118 ±\ \pm0.0005
Hint0.8846 ±\ \pm0.00110.6861 ±\ \pm0.00240.8407 ±\ \pm0.00050.1118 ±\ \pm0.0013
PWL0.8835 ±\ \pm0.00120.6960 ±\ \pm0.00130.8417 ±\ \pm0.00030.1069 ±\ \pm0.0006
GCM0.8854 ±\ \pm0.00130.6991 ±\ \pm0.00110.8441 ±\ \pm0.00010.0994 ±\ \pm0.0003
GCM VI0.8858 ±\ \pm0.00140.7011 ±\ \pm0.00110.8442 ±\ \pm0.00020.1005 ±\ \pm0.0004

4. We uploaded the code.

Our code is uploaded to https://github.com/iclr-2025-4464/GCM, welcome to try it and feel free to give us some advice.

In the end, we sincerely appreciate your reviews and are eager to hear from you in the future.

AC 元评审

The paper addresses the problem of conditional monotonic distribution modeling by representing it as a marginalization over latent variables. The central idea is to reformulate the monotonic modeling challenge into modeling an element-wise cumulative distribution function (CDF) with respect to a cost variable CCC. To achieve this, the authors introduce a latent variable modeling framework, which is solved using a standard importance-weighted likelihood estimation, optionally incorporating an ELBO term. The original paper has many issues and errors. For example, the basic conditional independence assumption was found unnecessary in the rebuttal period. These major changes are not fully verified in the rebuttal phase and it was hard for the reviewers to fully review the new version again. I do not think the current version is ready for acceptance.

审稿人讨论附加意见

None of the reviewers engaged in discussions with the authors, likely because many of the concerns raised by the reviewers stem from errors made by the authors. While the authors attempted to address these issues in the revision, the extent of the changes is too substantial for a second review. Additionally, the significance of the results remains unconvincing. For instance, the authors argue that a 0.003 improvement in the AUC metric is highly significant in various scenarios, a claim that appears questionable.

最终决定

Reject