PaperHub
5.8
/10
Rejected4 位审稿人
最低3最高8标准差1.8
8
3
6
6
3.5
置信度
正确性2.5
贡献度2.8
表达3.0
ICLR 2025

MissDiff: Training Diffusion Models on Tabular Data with Missing Values

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
Diffusion ModelMissing ValueTabular Data

评审与讨论

审稿意见
8

Motivated by missingness in tabular data, this paper proposes a way to learn diffusion models via score-matching with missing values. The paper includes experiments on both real and synthetic data demonstrating the advantages of this approach versus baseline approaches.

优点

  1. This paper offers a general, rigorous framework for learning diffusion models from data with missingness. I think the tabular ML subfield, and other ML subfields that face missingness will benefit from this analysis. Furthermore, future multimodal models that combine textual and tabular data will surely benefit from this framework.

  2. The paper comes with extensive experiments, for example with varying missingness rates, incomplete-columns, and incomplete rows. And the experiments evaluate not only MCAR, but also the more realistic MAR and NMAR scenarios, showing benefits for these settings as well.

缺点

  1. [UPDATE: addressed] Missing related work: There are works that utilize the ability of gradient-boosted decision trees (GBDTs) to handle missing features in order to avoid the impute-then-generate framework. Examples include [ForestDiffusion 1, DiffPuter 2] that use diffusion modeling and [UnmaskingTrees 3] that uses autoregression modeling. In particular, it is not clear whether the diffusion modeling approaches in [1, 2] can be seen as instances of the authors' proposed framework. (Granted, I realize that [1, 2, 3] may be parallel works, and that these rely on GBDTs for missingness, which is not a very general solution. So while I think these are relevant related works, I would like to emphasize that these do not diminish the potential impact of MissDiff, which is the first to enable the leap to neural networks, which will be important for modeling textual fields, and which is the first to be rigorously shown to not rely on assumptions like MCAR.)

  2. [UPDATE: addressed] It would be preferable to report how MissDiff performs on the imputation benchmark of [ForestDiffusion 1], for the following reasons. First, it has a larger and more diverse set of 27 datasets. Second, unlike Zheng & Charoenphakdee (2022)'s benchmark, this one shows that MissForest is better than GAIN, which is frankly more plausible. Third, without any further reimplementation work, it would enable one to compare MissDiff to the current claimed SotA diffusion [2] and autoregression [3] methods. Fourth, it should be simple to run MissDiff on this benchmark during the rebuttal period.

[1] Jolicoeur-Martineau, Alexia, Kilian Fatras, and Tal Kachman. "Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.

[2] Zhang, Hengrui, Liancheng Fang, and Philip S. Yu. "Unleashing the Potential of Diffusion Models for Incomplete Data Imputation." arXiv preprint arXiv:2405.20690 (2024).

[3] McCarter, Calvin. "Unmasking Trees for Tabular Data." arXiv preprint arXiv:2407.05593 (2024).

问题

  • [UPDATE: addressed] Line 351 "MICE based on random forest (MissForest) (Stekhoven, 2015)". MissForest is not exactly following the MICE framework using random forest. Are you comparing to MICE-Forest (presumably the implementation of AnotherSamWilson/miceforest) or truly MissForest (presumably the implementation of epsilon-machine/missingpy)? Please clarify such details in the manuscript.

  • [UPDATE: addressed] While leaving the architectural details to the manuscript appendix is fine, it would be nice if the main text indicated that a neural network model architecture is used.

评论

We thank the reviewer for the comments, and we appreciate the time you spent on the paper.

Q: Missing related work: There are works that utilize the ability of gradient-boosted decision trees (GBDTs) to handle missing features in order to avoid the impute-then-generate framework. Examples include [ForestDiffusion 1, DiffPuter 2] that use diffusion modeling and [UnmaskingTrees 3] that uses autoregression modeling. In particular, it is not clear whether the diffusion modeling approaches in [1, 2] can be seen as instances of the authors' proposed framework. (Granted, I realize that [1, 2, 3] may be parallel works, and that these rely on GBDTs for missingness, which is not a very general solution. So while I think these are relevant related works, I would like to emphasize that these do not diminish the potential impact of MissDiff, which is the first to enable the leap to neural networks, which will be important for modeling textual fields, and which is the first to be rigorously shown to not rely on assumptions like MCAR.)

A: Thank you for your comments. We include a discussion of the related work in lines 310-317. Especially, [ForestDiffusion 1] adopts XGBoost to estimate the score, [DiffPuter 2] leverages the Expectation-Maximization that first learns the joint distribution of both the observed and currently estimated missing data and then re-estimates the missing data based on the conditional probability given the observed data, and [UnmaskingTrees 3] adopts tree-based autoregressive modeling of tabular data. These methods can be used for imputation together with generation. However, we believe our work still has the takeaway. We propose the masked score matching loss and prove the theoretical soundness of this technique.

评论

Q: It would be preferable to report how MissDiff performs on the imputation benchmark of [ForestDiffusion 1], for the following reasons. First, it has a larger and more diverse set of 27 datasets. Second, unlike Zheng & Charoenphakdee (2022)'s benchmark, this one shows that MissForest is better than GAIN, which is frankly more plausible. Third, without any further reimplementation work, it would enable one to compare MissDiff to the current claimed SotA diffusion [2] and autoregression [3] methods. Fourth, it should be simple to run MissDiff on this benchmark during the rebuttal period.

A: Thank you very much for your insightful suggestions. We appreciate your detailed reasoning and agree that a broader evaluation could provide additional insights. We provide the results on the imputation benchmark of [ForestDiffusion 1] as follows:

Tabular data imputation (27 datasets, 3 experiments per dataset, 10 imputations per experiment) with 20% missing values; raw score - mean (standard error)

MethodMinMAE ↓AvgMAE ↓W_train ↓W_test ↓MAD ↑R_imp² ↑F1_imp ↑P_bias ↓Cov_rate ↑
KNN0.16 (0.03)0.16 (0.03)0.42 (0.08)1.89 (0.49)0.00 (0.00)0.59 (0.09)0.75 (0.04)1.27 (0.25)0.40 (0.11)
ICE0.10 (0.01)0.21 (0.03)0.52 (0.09)1.99 (0.49)0.69 (0.10)0.59 (0.09)0.74 (0.04)1.05 (0.29)0.39 (0.09)
MICE-Forest0.08 (0.02)0.13 (0.03)0.34 (0.07)1.86 (0.48)0.29 (0.08)0.61 (0.10)0.76 (0.04)0.61 (0.20)0.75 (0.11)
MissForest0.10 (0.03)0.12 (0.03)0.32 (0.07)1.85 (0.48)0.10 (0.03)0.61 (0.10)0.76 (0.04)0.62 (0.22)0.79 (0.08)
Softimpute0.22 (0.03)0.22 (0.03)0.53 (0.07)1.99 (0.48)0.00 (0.00)0.58 (0.09)0.74 (0.04)1.18 (0.34)0.31 (0.09)
OT0.14 (0.02)0.19 (0.03)0.56 (0.10)1.98 (0.49)0.28 (0.05)0.59 (0.10)0.75 (0.04)1.09 (0.27)0.39 (0.12)
GAIN0.16 (0.03)0.17 (0.03)0.49 (0.11)1.95 (0.51)0.01 (0.00)0.60 (0.10)0.75 (0.04)1.04 (0.25)0.54 (0.12)
Forest-VP0.14 (0.04)0.17 (0.03)0.55 (0.13)1.96 (0.50)0.25 (0.03)0.61 (0.10)0.74 (0.04)0.81 (0.25)0.57 (0.14)
MissDiff0.15 (0.03)0.18 (0.03)0.49 (0.12)1.96 (0.49)0.27 (0.04)0.62 (0.10)0.74 (0.04)0.93 (0.31)0.61 (0.11)
Oracle0.00 (0.00)0.00 (0.00)0.00 (0.00)1.87 (0.49)0.00 (0.00)0.64 (0.09)0.78 (0.04)0.00 (0.00)1.00 (0.00)

Q: Line 351 "MICE based on random forest (MissForest) (Stekhoven, 2015)". MissForest is not exactly following the MICE framework using random forest. Are you comparing to MICE-Forest (presumably the implementation of AnotherSamWilson/miceforest) or truly MissForest (presumably the implementation of epsilon-machine/missingpy)? Please clarify such details in the manuscript.

A: Thank you very much for pointing this out. We adopt truly the MissForest implementation of https://github.com/vanderschaarlab/hyperimpute. We revise the description of MissForest in line 360.

Q: While leaving the architectural details to the manuscript appendix is fine, it would be nice if the main text indicated that a neural network model architecture is used.

A: Thank you very much for your suggestion. We add the description of the architecture details in the revised manuscript (lines 425-427).

评论

Thanks for the thorough responses. I have restored my score back to the original score of 8, and noted that the weaknesses in my original review are now all addressed. I am keeping my confidence level at 3, because I do not have the technical competence to judge the ongoing discussion with reviewers related to the theoretical grounding of the proposed method. Still, my overall assessment is that this is the first paper on neural methods for tabular data with missingness I have come across with a method that "makes sense" and has solid empirical results. I am particularly impressed with the performance on the Jolicoeur-Martineau et al benchmark, which has smaller datasets on which I would expect neural methods to do worse. So I think the ICLR community will benefit from the acceptance of this paper.

评论

We sincerely thank the reviewer for the valuable feedback. We appreciate the time you spent on the paper.

审稿意见
3

This paper introduces MissDiff, a novel imputation method designed to address missing values in tabular datasets. Motivated by recent advances in using generative models like VAEs and GANs for imputation, MissDiff incorporates a diffusion model to handle incomplete data. Experimental results demonstrate the effectiveness of this approach, highlighting its potential to improve imputation accuracy.

优点

see summary.

缺点

  • The authors claim that the imputation-then-gernerate method may lead to inconsistent estimator. However, MISSDIFF the author proposed is trying to fill 0 to the observed data (the i-th element of xobsx_{obs} equals to 0, when the i-th element is missing) and use this imputed data to train the model. The only difference is that the author use a mask to relieve the effect of the loss function. As such, since MISSDIFF is also kind of imputation-then-gernerate method, why MISSDIFF does not suffer from the issue of imputation method? Is there any intuition to explain the superiority of MISSDIFF?
  • The proof of Thereom 3.2 is not convincing. In the footnote 8, page 15, the author claim the score function of gaussian product the mask with the partial-observed data can recover the one with fully observed data. However, the property does not hold for sθ(x(t),t)s_{\theta}(x(t),t). If one impute the i-th element of x(t) by 0, then possibly every element of sθ(x(t),t)s_{\theta}(x(t),t) will change. In that case, how to show the equation line at 773-778?
  • I did not see any challenge in extending the proposed method to scale up for addressing missing data issues in high-dimensional cases, such as image imputation. Could you provide relevant results to demonstrate the method's effectiveness in these scenarios?

问题

See weakness.

评论

Q: I did not see any challenge in extending the proposed method to scale up for addressing missing data issues in high-dimensional cases, such as image imputation. Could you provide relevant results to demonstrate the method's effectiveness in these scenarios?

A: Thank you very much for your question. We think our paper mainly focuses on the tabular data. Existing literature including [1-7] do not contain the experiments in the image. We would like to show more evidence of the effectiveness of the proposed method. Therefore, we conduct the experiments in imputation tasks against misGAN, the experimental results are as follows:

The effectiveness of MissDiff on MNIST.

Methodfid
misGAN1.21
MissDiff0.93

[1]: Kim, J., Lee, C.E., & Park, N. (2023). STaSy: Score-based Tabular data Synthesis. International Conference on Learning Representations.

[2]: Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular data using Conditional GAN. Neural Information Processing Systems.

[3]: Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. International Conference on Machine Learning.

[4]: Zhang, H., Zhang, J., Srinivasan, B., Shen, Z., Qin, X., Faloutsos, C., Rangwala, H., & Karypis, G. (2024). Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. International Conference on Learning Representations.

[5]Jolicoeur-Martineau, A., Fatras, K., & Kachman, T. (2023). Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. International Conference on Artificial Intelligence and Statistics.

[6]Zhang, H., Fang, L., & Yu, P.S. (2024). Unleashing the Potential of Diffusion Models for Incomplete Data Imputation. ArXiv, abs/2405.20690.

[7]McCarter, C. (2024). Unmasking Trees for Tabular Data. ArXiv, abs/2407.05593.

评论

To help the reviewer understand the proof of Theorem 3.2 well, we summarize the main procedure in the proof as follows,

  1. we relate the score on the incomplete data is equivalent to the score on the complete data when performing element-wise multiplication with mask, i.e., when Gaussian transition distribution with the isotropic covariance matrix, we have xobs(t)logp(xobs(t)xobs(0))m=x(t)logp(x(t)x(0))m\nabla_{\mathbf{x}^{\text{obs}}(t)} \log p(\mathbf{x}^{\text{obs}}(t) |\mathbf{x}^{\text{obs}}(0)) \odot \mathbf{m} = \nabla_{\mathbf{x}(t)} \log p(\mathbf{x}(t) |\mathbf{x}(0)) \odot \mathbf{m}.
  2. Then, we can get the optimal solution of the Denosing Score Matching objective on missing data admits the same solution of the Denosing Score Matching objective on the complete data. argminθ Ep(xobs(0),m)Ep(xobs(t)xobs(0))[(sθ(xobs(t),t)xobs(t)logp(xobs(t)xobs(0)))m22]\underset{\boldsymbol{\theta}}{\arg \min} \ \mathbb{E} _{p(\mathbf{x}^{\text{obs}}(0),\mathbf{m})} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}(t) | \mathbf{x}^{\text{obs}}(0))} [\|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}(t), t)-\nabla _{\mathbf{x}^{\text{obs}}(t)} \log p(\mathbf{x}^{\text{obs}}(t) |\mathbf{x}^{\text{obs}}(0))) \odot \mathbf{m}\|_2^2] has the same solution as argminθ Ep(x(0),m)Ep(x(t)x(0))[(sθ(x(t),t)x(t)logp(x(t)x(0)))m22]\underset{\boldsymbol{\theta}}{\arg \min} \ \mathbb{E} _{p(\mathbf{x}(0),\mathbf{m})} \mathbb{E} _{p(\mathbf{x}(t) | \mathbf{x}(0))} [\|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}(t), t)-\nabla _{\mathbf{x}(t)} \log p(\mathbf{x}(t) |\mathbf{x}(0))) \odot \mathbf{m}\|_2^2]. That is because when we can decompose the least square loss into three terms, two of which are the squares of their respective components, and the remaining one is a cross-term. By utilizing the previous conclusion, we can get equality.
  3. Finally, we take the conditional expectation of mask inside the least squares loss since mi{0,1}\mathbf{m}_i\in\{0,1\} we have E[mi2]=E[mi]\mathbb{E}[\mathbf{m}_i^2]=\mathbb{E}[\mathbf{m}_i] for any distribution of m\mathbf{m}.

Illustration: The derivation is correct no matter what value we use to impute the missing value in xobs(0)\mathbf{x}^{\text{obs}}(0). Let θobs\boldsymbol{\theta}^* _{\text{obs}} denotes the optimal solution for the Denosing Score Matching objective on missing data, i.e., θobs=argminθ Ep(xobs(0),m)Ep(xobs(t)xobs(0))[(sθ(xobs(t),t)xobs(t)logp(xobs(t)xobs(0)))m22]\boldsymbol{\theta}^* _{\text{obs}}= \underset{\boldsymbol{\theta}}{\arg \min} \ \mathbb{E} _{p(\mathbf{x}^{\text{obs}}(0),\mathbf{m})} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}(t) | \mathbf{x}^{\text{obs}}(0))} [\|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}(t), t)-\nabla _{\mathbf{x}^{\text{obs}}(t)} \log p(\mathbf{x}^{\text{obs}}(t) |\mathbf{x}^{\text{obs}}(0))) \odot \mathbf{m}\| _2^2], and let θ\boldsymbol{\theta}^* denotes the optimal solution for the Denosing Score Matching objective on complete data, i.e., θ=argminθ Ep(x(0),m)Ep(x(t)x(0))[(sθ(x(t),t)x(t)logp(x(t)x(0)))m22]\boldsymbol{\theta}^*= \underset{\boldsymbol{\theta}}{\arg \min} \ \mathbb{E} _{p(\mathbf{x}(0),\mathbf{m})} \mathbb{E} _{p(\mathbf{x}(t) | \mathbf{x}(0))} [\|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}(t), t)-\nabla _{\mathbf{x}(t)} \log p(\mathbf{x}(t) |\mathbf{x}(0))) \odot \mathbf{m}\|_2^2], we are not saying that sθobs(xobs(t),t)=sθ(x(t),t)\mathbf{s} _{\boldsymbol{\theta}^* _{\text{obs}}}(\mathbf{x}^{\text{obs}}(t), t)=\mathbf{s} _{\boldsymbol{\theta}^*}(\mathbf{x}(t), t) . We show that sθobs(x(t),t)=sθ(x(t),t)\mathbf{s} _{\boldsymbol{\theta}^* _{\text{obs}}}(\mathbf{x}(t), t)=\mathbf{s} _{\boldsymbol{\theta}^*}(\mathbf{x}(t), t).

评论
  • In point 2, the two arg min learning objectives do not lead to the same solution.
  • In point 3, the last equation includes x(t)x(t), which is unavaliable in the training procedure, making this equation meaningless.
评论

Then, combining with the first term Ep(xtobs)[(sθ(xtobs,t)m22]\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t)}[\|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t) \odot \mathbf{m}\|_2^2], we get $

\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)}[|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t)-\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)) \odot \mathbf{m}|_2^2]=\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t)}[|\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t) - \nabla _{\mathbf{x}_t} \log p(\mathbf{x}_t)|_2^2 \odot \mathbf{m} ]+C2,

$

where C2=Ep(xtobs)[xtobslogp(xtobsx0obs)m22xtlogp(xt)m22]C2 = \mathbb{E} _{p(\mathbf{x}^{\text{obs}} _t)}[\|\nabla _{\mathbf{x}^{\text{obs}} _t} \log p(\mathbf{x}^{\text{obs}} _t \mid \mathbf{x}^{\text{obs}} _0) \odot \mathbf{m}\| _2^2 - \|\nabla _{\mathbf{x}_t}\log p(\mathbf{x}_t) \odot \mathbf{m}\|_2^2] is independent with θ\boldsymbol{\theta}. Then, we prove our denoising score matching on observed data can recover the complete score.

After a careful review, we understand the reviewer wants to point out that the input of the neural network sθ\mathbf{s} _{\boldsymbol{\theta}} is xtobs\mathbf{x}^{\text{obs}}_t rather than xt\mathbf{x}_t. This difference can be reflected by this ratio p(xtobsx0m)p(xtx0)\frac{p(\mathbf{x}^{\text{obs}}_t\mid \mathbf{x}_0 \odot \mathbf{m})}{p(\mathbf{x}_t\mid \mathbf{x}_0) } in the above analysis. When we do not have missing data, the ratio is 1 and the proof obviously goes through. When there are missing data and sθ\mathbf{s} _{\boldsymbol{\theta}} take xtobs\mathbf{x}^{\text{obs}}_t as input. We prove sθ(xtobs,t)\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t) can recover the score on observed data in Lemma A.2 (Line 864 in the revised version). However, we need a stronger assumption to prove the sθ(xtobs,t)\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t) can recover the score on complete data, for example, a sufficient assumption is the same data assumption in [1], p(x)=Πi=1dp(xi)p(x)=\Pi _{i=1}^{d}p(x_i) (the only difference with Data setting D1 in [1] is we did not consider the latent variable zz). In this case, the score in each data dimension is independent of the value in other data dimension. No matter which value we impute x0obs\mathbf{x}^{\text{obs}}_0, sθ(xtobs,t)\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t) can recover the score on complete data. This assumption can be relaxed to allow the dd variables x1,,xd{x_1, \ldots, x_d} in x\mathbf{x} to be partitioned into cc non-overlapping groups X1,,Xc\mathcal{X}_1, \ldots, \mathcal{X}_c, where variables in different groups are independent, and the missing mechanism is such that variables in each group is either jointly missing or jointly observed.

In summary, because of Lemma A.2 and theorem 3.2 with stronger assumptions, we still believe the theoretical guarantee of our objective is sound to motivate the proposed framework and to interpret the effectiveness of our method on imputation and generation tasks.

[1]: Ma, C., & Zhang, C. (2021). Identifiable Generative Models for Missing Not at Random Data Imputation. NeurIPS.

评论

We sincerely thank the reviewer for the valuable feedback. We appreciate the time you spent on the paper.

To take a careful look at the relationship of masked denoising score matching on observed data and score matching objective on complete data, we can do the following decomposition:

$

\begin{aligned} &\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}} _0)}[|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t)-\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)) \odot \mathbf{m}|_2^2]\\ =&\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t)}[|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t) \odot \mathbf{m}|_2^2] - 2\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)}[\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t),\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0) \odot \mathbf{m}\rangle] +C, \end{aligned}

$

where C=Ep(xtobs)[xtobslogp(xtobsx0obs)m22]C= \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t)}[\|\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0) \odot \mathbf{m}\|_2^2] is independent with θ\boldsymbol{\theta}. We mainly look at the second term. We first replace xtobslogp(xtobsx0obs)m\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0) \odot \mathbf{m} by x(t)logp(x(t)x(0))m\nabla _{\mathbf{x}(t)} \log p(\mathbf{x}(t) |\mathbf{x}(0)) \odot \mathbf{m}, and we model the observe data x0obs=x0m\mathbf{x}^{\text{obs}}_0=\mathbf{x}_0 \odot \mathbf{m}, in this sense, we can build connection with score on complete data:

$

\begin{aligned} &\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)}[\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t), t),\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t\mid \mathbf{x}^{\text{obs}}_0) \odot \mathbf{m}\rangle]\\ =&\mathbb{E} _{p(\mathbf{x}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}_0 \odot \mathbf{m})}[\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t),\nabla _{\mathbf{x}_t} \log p(\mathbf{x}_t \mid \mathbf{x}_0) \odot \mathbf{m}\rangle] \text{\ utilizing equality above} \\ =&\int
\int p_0(\mathbf{x}_0) p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}_0 \odot \mathbf{m})\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t), \nabla _{\mathbf{x}_t} \log p(\mathbf{x}_t \mid \mathbf{x}_0) \odot \mathbf{m}\rangle \mathrm{d}\mathbf{x}_t \mathrm{d} \mathbf{x}_0 \\ =& \int \int p_0(\mathbf{x}_0) \frac{p(\mathbf{x}^{\text{obs}}_t\mid \mathbf{x}_0 \odot \mathbf{m})}{p(\mathbf{x}_t \mid \mathbf{x}_0)}\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t), \nabla _{\mathbf{x}_t} p(\mathbf{x}_t \mid \mathbf{x}_0) \odot \mathbf{m}\rangle \mathrm{d}\mathbf{x}_t \mathrm{~d} \mathbf{x}_0, \end{aligned}

$

where the ratio p(xtobsx0m)p(xtx0)=exp(xtobsx0m22)exp(xtx022)\frac{p(\mathbf{x}^{\text{obs}}_t\mid \mathbf{x}_0 \odot \mathbf{m})}{p(\mathbf{x}_t\mid \mathbf{x}_0) } = \frac{-\exp(\|\mathbf{x}^{\text{obs}}_t-\mathbf{x}_0 \odot \mathbf{m}\|_2^2) }{-\exp(\|\mathbf{x}_t -\mathbf{x}_0\|_2^2) }, if the ratio is 1, we have

$

\begin{aligned} &\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)}[\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t),\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0) \odot \mathbf{m}\rangle ]\\ =&
\int \langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t), \frac{\mathrm{d} \int _{\mathbf{x}_0} p_0(\mathbf{x}_0) p(\mathbf{x}_t \mid \mathbf{x}_0) \mathrm{d} \mathbf{x}_0 }{\mathrm{d} \mathbf{x}_t} \odot \mathbf{m}\rangle \mathrm{d}\mathbf{x}_t\\ =&
\int \langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t), \frac{\mathrm{d} p_t(\mathbf{x}_t) }{\mathrm{d} \mathbf{x}_t} \odot \mathbf{m}\rangle \mathrm{d}\mathbf{x}_t\\ =&
\int p_t(\mathbf{x}_t) \langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t), \frac{\mathrm{d} \log p_t(\mathbf{x}_t) }{\mathrm{d} \mathbf{x}_t} \odot \mathbf{m} \rangle \mathrm{d} \mathbf{x}_t\\ =& \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t)}[\langle\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t),\nabla _{\mathbf{x}_t} \log p(\mathbf{x}_t) \odot \mathbf{m}]. \end{aligned}

$

评论

We thank the reviewer for the comments, and we appreciate the time you spent on the paper.

Q: The authors claim that the imputation-then-gernerate method may lead to inconsistent estimator. However, MISSDIFF the author proposed is trying to fill 0 to the observed data (the i-th element of xobsx_{\text{obs}} equals to 0, when the i-th element is missing) and use this imputed data to train the model. The only difference is that the author use a mask to relieve the effect of the loss function. As such, since MISSDIFF is also kind of imputation-then-gernerate method, why MISSDIFF does not suffer from the issue of imputation method? Is there any intuition to explain the superiority of MISSDIFF?

A: Thank you very much for your question. We clarify this point below. First of all, we prove the imputation-then-generate method is biased in Remark 3.1. We prove this line of methods no longer captures the data variability since Ep0(xmissxobs)[pϕ(xobs,xmiss)]unbiased likelihoodpϕ(xobs,Ep0(xmissxobs)[xmiss])imputation-then-generate likelihood\underbrace{\mathbb{E} _{p_0\left(\mathbf{x}^{\mathrm{miss}} _{\text{}}| \mathbf{x}^{\mathrm{obs}}\right)}[p _{\mathbf{\phi}}(\mathbf{x}^{\text {obs}}, \mathbf{x}^{\text {miss}})]} _{\text{unbiased likelihood}}\neq \underbrace{p _{\mathbf{\phi}}(\mathbf{x}^{\mathrm{obs}}, \mathbb{E} _{p_0\left(\mathbf{x}^{\mathrm{miss}} _{\text{}}| \mathbf{x}^{\mathrm{obs}}\right)}[\mathbf{x}^{\mathrm{miss}}])} _{\text{imputation-then-generate likelihood}}. However, MISSDIFF does not maximize the likelihood of imputed data pϕ(xobs,Ep0(xmissxobs)[xmiss])p _{\mathbf{\phi}}(\mathbf{x}^{\mathrm{obs}}, \mathbb{E} _{p_0\left(\mathbf{x}^{\mathrm{miss}} _{\text{}}| \mathbf{x}^{\mathrm{obs}}\right)}[\mathbf{x}^{\mathrm{miss}}]). We proposed a masked score matching objective to learn the score function in the observed data. In Theorem 3.2, we prove our objective is the lower bound on the log-likelihood of observed data rather than the likelihood of imputed data (which is the objective of imputation-then-generate methods). Therefore, we don't regard MISSDIFF as a kind of imputation-then-generate method. Moreover, we verify the rationality of MISSDIFF by Theorem 3.2, i.e., our masked score matching objective can recover the oracle score on complete data, which leads to an unbiased method for generation tasks.

Q: The proof of Thereom 3.2 is not convincing. In the footnote 8, page 15, the author claim the score function of gaussian product the mask with the partial-observed data can recover the one with fully observed data. However, the property does not hold for sθ(x(t),t)s_\theta(x(t), t). If one impute the i-th element of x(t) by 0, then possibly every element of sθ(x(t),t)s_\theta(x(t), t) will change. In that case, how to show the equation line at 773-778? A: Thank you very much for your question. We clarify this point below. First of all, x(t)\mathbf{x}(t) denotes the complete (unobservable) data in time tt and xobs(t)\mathbf{x}^{\text{obs}}(t) denotes the observed data in time tt. To keep the xobs(t)\mathbf{x}^{\text{obs}}(t) has the same dimension over different missingness, we use 0 to replace na\mathrm{na} for continuous variables in xobs(0)\mathbf{x}^{\text{obs}}(0) and a new category to represent na\mathrm{na} for discrete variables. We are going to show that this operation does not affect the correctness of the proof in Theorem 3.2.

  • For footnote 8, page 15 Since the current diffusion model used in practice is the Gaussian transition distribution with the isotropic covariance matrix, we have p(xobs(t)xobs(0))=N(xobs(t);μobs,Σ)p(\mathbf{x}^{\text{obs}}(t) |\mathbf{x}^{\text{obs}}(0))=\mathcal{N}(\mathbf{x}^{\text{obs}}(t);\mathbf{\mu}^{\text{obs}},\Sigma) and p(x(t)x(0))=N(x(t);μ,Σ)p(\mathbf{x}(t) |\mathbf{x}(0))=\mathcal{N}(\mathbf{x}(t);\mathbf{\mu},\Sigma), with Σ=(1αˉt)I\Sigma=(1-\bar{\alpha}_t)\mathbb{I} and μobs=μm\mathbf{\mu}^{\text{obs}} =\mathbf{\mu} \odot \mathbf{m}. Then, we can get xobs(t)logp(xobs(t)xobs(0))m=1(1αˉt)(xobs(t)μobs)m=1(1αˉt)(x(t)μ)m=x(t)logp(x(t)x(0))m\nabla _{\mathbf{x}^{\text{obs}}(t)} \log p(\mathbf{x}^{\text{obs}}(t) |\mathbf{x}^{\text{obs}}(0)) \odot \mathbf{m} = - \frac{1}{(1-\bar{\alpha}_t)}(\mathbf{x}^{\text{obs}}(t)-\mathbf{\mu}^{\text{obs}})\odot \mathbf{m} = -\frac{1}{(1-\bar{\alpha}_t)}(\mathbf{x}(t)-\mathbf{\mu})\odot \mathbf{m} = \nabla _{\mathbf{x}(t)} \log p(\mathbf{x}(t) |\mathbf{x}(0)) \odot \mathbf{m}. This equality is correct no matter what value we use to impute the missing value in xobs(0)\mathbf{x}^{\text{obs}}(0).
  • For lines 773-778 (775-777 in the revised version): We take the conditional expectation of the binary mask inside the least square loss.
审稿意见
6

The authors present MissDiff, a diffusion based model to handle datasets with missing data. In addition to the generation task, it can also impute missing values. Theoretical results is provided showing the connection between the training objective of MissDiff and the maximum likelihood objective of observed data. Experiments applied on multiple tabular datasets demonstrate its usefulness.

优点

Strengths:

  1. The paper is well-written and motivated.
  2. Experimental results demonstrate MissDiff's performance advantages over existing approaches.
  3. Authors also provide promising simulation results.

缺点

Weakness

  1. The paper's theoretical claims appear overstated, particularly regarding its potential as a "general framework" for handling incomplete data. Author claim "our method and theoretical guarantees aim to provide a general framework for learning on incomplete data and generate complete data." Missing data assumptions are crucial for statistical efficiency. And theoretical results presented in the paper ignoring these observations. Ignoring observations with missing elements may lead to invalid inference (imputation) and loss of statistical efficiency. As a result, MissDiff might have better experimental results and can't derive the identifiability result. My questions are:
  • Could authors clarify how do the theoretical guarantees change under different missing data mechanisms (e.g. MCAR, MAR, MNAR)? It seems to me MissDiff is a general imputation framework, what's the strong assumption under the hood?
  • Can any connections be built between (partial) identifiability and diffusion? If not, can authors demonstrate why the "general framework" considered in the paper is better than assuming missing data mechanisms (from a theoretical and utility perspective)? For example, when we don't have a user-defined assumption, how MissDiff can benefit users?
  1. The technical novelty is limited. The algorithm is not that much different from score-based diffusion. This is a minor concern.

Overall, W1 is my main concern. While MissDiff shows promising empirical results, its theoretical foundations and technical innovations worth more development. It would be very exciting to see an identifiable diffusion based models for imputation, e.g., [1], or the

[1] Identifiable Generative Models for Missing Not at Random Data Imputation, NeurIPS 2021

问题

See above, also

  1. Have you considered how ignoring observations with missing elements impacts statistical efficiency and identifiability of the model? Could you provide analysis on this?

  2. Results in table 4 and appendix C2 provide some analysis on Q1. But why MICE is not compared? From my experience, MICE is a pretty strong baseline for these two scenarios (MAR and NMAR).

评论

Q: Can any connections be built between (partial) identifiability and diffusion? If not, can authors demonstrate why the "general framework" considered in the paper is better than assuming missing data mechanisms (from a theoretical and utility perspective)? For example, when we don't have a user-defined assumption, how MissDiff can benefit users?

A: Thank you very much for your insightful question. We acknowledge that we did not explicitly consider the issue of identifiability in our work. To the best of our knowledge, the identifiability of diffusion model parameters remains an open problem without a definitive answer. This is primarily due to the architecture used in the diffusion model is Rd\mathcal{R}^d to Rd\mathcal{R}^d. We also admit that we are not experts on identifiability and offer some relevant analysis:

To analyze the identifiability of the diffusion model without missing data, we decompose it into two parts:

  1. the connection between data distribution pθ(x)p_\theta(x) and its score sθ(x)s_\theta(x);

  2. the connection between sθ(x)s_\theta(x) and θ\theta, in other words, if \theta^*=\underset{\theta}{\arg \min } \frac{T}{2}\mathbb{E}_t\Big\\{\lambda(t) \mathbb{E} _{x(0)} \mathbb{E} _{x(t) \mid x(0)} \left[\left\|\mathbf{s} _{\theta} (x(t), t)- \mathbf{s} ^*(x(t), t)\right\| _2^2\right]\Big\\}.

    1. can be easily proved: Assume there are two different probability distributions pθ1(x)p_{\theta_1}(x) and qθ1(x)q_{\theta_1}(x) with the same score function, s(x)=xlogpθ1(x)=xlogqθ2(x)s(x)=\nabla_x \log p_{\theta_1}(x)=\nabla_x \log q_{\theta_2}(x). logpθ1(x)=logqθ2(x)+C\log p_{\theta_1}(x)=\log q_{\theta_2}(x)+C

Due to the integrals over the domain must equal 1, we have pθ1(x)=qθ2(x)p_{\theta_1}(x)=q_{\theta_2}(x).

  • According to the similar data setting in [1], p(x)=Πi=1dp(xi)p(x)=\Pi_{i=1}^{d}p(x_i) (the only difference with Data setting D1 in [1] is we did not consider the latent variable zz), an immediate conclusion is score s(x)=[s1(x1),...,sd(xd)]s(x) = [s_1(x_1), ..., s_d(x_d)]. Then the objective function becomes element-wise \underset{\theta}{\arg \min } \frac{T}{2} \mathbb{E} _t \Big\\{ \lambda(t) \mathbb{E} _{x(0)} \mathbb{E} _{x(t) \mid x(0)} \left[ \sum _{i=1}^{d} \left\| \mathbf{s} _{\theta_i}(x _i(t), t)- \mathbf{s}^*_i(x_i(t), t)\right\| _2^2 \right] \Big\\}. If oracle score s(x)s^*(x) is unique, then θ\theta^* is unique.

Therefore, we find that diffusion models have good properties and are identifiable when there is no missing data. When there is missing data, we prove the score network learned from our objective can recover the oracle score of complete data in Theorem 3.2. Therefore, we demonstrate the identifiability of the score function. To demonstrate the connection between sθ(x)s_\theta(x) and θ\theta, we aim to prove \theta^*=\underset{\theta}{\arg \min } \frac{T}{2}\mathbb{E} _t\Big\\{\lambda(t) \mathbb{E} _{x^{\text{obs}}(0)} \mathbb{E} _{x^{\text{obs}}(t) \mid x^{\text{obs}}(0)} \left[\left\|\left(\mathbf{s} _{\theta}(x^{\text{obs}}(t), t)- \mathbf{s}^*(x^{\text{obs}}(t) \right) \odot \mathbf{m}\right\| _2^2\right]\Big\\}.

Under the same data setting, we can also decompose the score function into element-wise, i.e., \theta^*=\underset{\theta}{\arg \min } \frac{T}{2} \mathbb{E}_t \Big\\{ \lambda(t) \mathbb{E} _{x^{\text{obs}}(0)} \mathbb{E} _{x^{\text{obs}}(t) \mid x^{\text{obs}}(0)} \left[ \sum _{i=1}^d \left\|\left(\mathbf{s} _{\theta_i}(x_i^{\text{obs}}(t), t)- \mathbf{s}^*_i(x_i^{\text{obs}}(t) \right) \odot \mathbf{m} \right\| _2^2\right]\Big\\}. If oracle score s(x)s^*(x) is unique and the supreme of missing rates ρmax<1\rho _{\text{max}}<1, then θ\theta^* is unique.

Q: Have you considered how ignoring observations with missing elements impacts statistical efficiency and identifiability of the model? Could you provide analysis on this?

A: Thank you very much for your suggestion. While we did not provide a specific analysis on this aspect, we believe the impact would heavily depend on the underlying missing data mechanism and the missing ratio. The performance is likely to degrade due to the significant reduction in the size of the training data. For example, if the missing mechanism results in very few complete observations, there would be insufficient training data to effectively train the diffusion model.

评论

Thank authors for the detailed response. It has addressed some concerns of mine. And I will keep the score.

评论

We sincerely thank the reviewer for the valuable feedback. We appreciate the time you spent on the paper.

评论

We thank the reviewer for the comments, and we appreciate the time you spent on the paper.

Q: Missing data assumptions are crucial for statistical efficiency. And theoretical results presented in the paper ignoring these observations. Ignoring observations with missing elements may lead to invalid inference (imputation) and loss of statistical efficiency. As a result, MissDiff might have better experimental results and can't derive the identifiability result. Could authors clarify how do the theoretical guarantees change under different missing data mechanisms (e.g. MCAR, MAR, MNAR)? It seems to me MissDiff is a general imputation framework, what's the strong assumption under the hood?

A: Thank you very much for your insightful question. In Theorem 3.2, we prove that if the supreme of missing rates ρmax:=maxi=1,,dsupxρi(x)<1\rho_{\text{max}}:=\max_{i=1,\ldots,d} \sup_{\mathbf{x}} \rho_i(\mathbf{x})<1, then our objective can recover the score on complete data. This assumption is quite different from missing data mechanisms (e.g. MCAR, MAR, MNAR). Even in MCAR, we can have one dimension of the data completely missing, which avoids our assumption ρmax<1\rho_{\text{max}}<1. Even in MNAR, ρmax<1\rho_{\text{max}}<1 might be not a too strong assumption.

After learning the score of complete data, we use it for imputation. The same as the procedure in [1] page 5 last paragraph: the Bayesian inference problem pθ(xmissxobs)pθ(xmiss,xobs)p_\theta(x^{\text{miss}}|x^{\text{obs}}) \propto p_\theta(x^{\text{miss}},x^{\text{obs}}). Therefore, MissDiff requires ρmax<1\rho_{\text{max}}<1 for guarantee good performance.

Q: Results in table 4 and appendix C2 provide some analysis on Q1. But why MICE is not compared? From my experience, MICE is a pretty strong baseline for these two scenarios (MAR and NMAR).

A: Thank you very much for raising this question. We believe the experimental results in Table 1 are convincing to prove the effectiveness of the proposed method, which is the benchmark of previous imputation tasks. We further demonstrate the effectiveness against the existing diffusion-based method, therefore, we change the missing mechanism and provide the experimental results in Appendix C2. Additionally, we included comparisons with MICE as follows. We believe the additional results further substantiate the competitiveness of our approach. However, we compare the utility of the different methods for generating new samples. MICE cannot be directly applied to generate new samples. Therefore, we did not add the experimental results of MICE in Table 4.

The effectiveness of MissDiff on imputation tasks under MAR and NMAR.

MethodMARNMAR
MICE0.1621 (0.002)0.1683 (0.002)
CSDI_T0.1205 (0.004)0.1274 (0.005)
MissDiff0.1053 (0.005)0.1092 (0.006)
审稿意见
6

In this paper, authors propose a diffusion generative model that learns the complete score function from missing observations. The proposed method not only impute missing components, but also generate new samples from the data distribution in "one-shot". The proposed method is justified through minimizing a likelihood upper bound of the observed data. Experiments demonstrate that the proposed method outperforms the state-of-the-art methods on tabular datasets.

优点

The propose method is novel to my best knowledge.

It has nice intuition and theoretical justifications (minimizing an upper bound of KL[p_0|p_theta])

Indeed, some existing methods do have to perform two stage inference to generate new samples. The one-short approach is not only algorithmically simpler but also avoids biases introduced in imputation step.

The proposed method avoids unstable adversarial training which is what many existing works (such as Yoon et al., 2018 and Li et al., 2019) rely on.

缺点

This paper lacks a proper motivation for generating new samples in addition to imputing missing data. In the second paragraph in the Introduction, there is one sentence on "deep generative models can be used to... enhance the performance of image classification tasks". However, I suggest authors discuss why imputation alone is not enough, and why additional samples need to be generated and back it up with specific examples.

The authors do not explain why other generator based imputation algorithm, such as GAIN or misGAN, could not be straightforwardly extended to generate new samples while imputing missing values. Both GAIN and misGAN train a generator G which outputs a complete sample. The generator is then "tweaked" to impute missing data (similar to the tweaking authors have done in line 9, algorithm 2). If they can be straightforwardly modified to generate new images, I think authors should also compare them in the experiments or at least acknowledge them as earlier unified imputation/generation algorithms.

The idea of misGAN is also very similar to the proposed algorithm. On the high level, it tries to generate samples looks like the observed data after masking. It trains a generator by minimizing a divergence between observed data (p_obs) and generated samples (p_theta). The KL divergence between p_obs and p_theta is the same as the cross entropy in Theorem 3.3 up to a constant. I would appreciate discussions on the similarities and differences between the proposed method and misGAN.

  • There is another work that similarly maximizes the observed likelihood using normalizing flow: https://arxiv.org/pdf/2003.12628 which I think could also be straightforwardly extended to generating new samples.

问题

  1. Why does the imputation work?

Algorithm 2 describes the imputation algorithm which is similar to the reverse process described in eq 2 using a joint score model. However, the training process only add noise to the observed coordinates, and the unobserved coordinates does not contribute to the loss (due to "odot M"), so the learned score model seems to be a "marginal score".

This is different from training a conditional score model then using it to sample from a conditional probability distribution. What is the justification of authors' imputation process?

Could authors please write down:

What is the optimal solution of equation 5? what is the corresponding forward process and backward process that give rise to the imputation algorithm?

  1. equation (12) in Li et al., 2019 defines the imputer of misGAN. I can just set m = [0, 0, 0... ] in (12) to generate new sample, right?

  2. The same question above. Equation (2) and (3) in Yoon et al., 2018 defines the imputer of GAIN. Can I just set m = [0, 0, 0, ...] to generate new sample straightforwardly from GAIN?

  3. If yes to question 2 and 3 , what is the unique advantage of the proposed unified framework for imputing/generating samples comparing to misGAN and GAIN?


After reading the final rebuttal from the authors, I am convinced that the imputation procedure is mathematically correct.

伦理问题详情

N/A

评论

Q: The idea of misGAN is also very similar to the proposed algorithm. On the high level, it tries to generate samples looks like the observed data after masking. It trains a generator by minimizing a divergence between observed data (pobsp_{\text{obs}}) and generated samples (pθp_\theta). The KL divergence between pobsp_{\text{obs}} and pθp_\theta is the same as the cross entropy in Theorem 3.3 up to a constant. I would appreciate discussions on the similarities and differences between the proposed method and misGAN.

A: Thank you for your question. on the high level, MissDiff minimizes the KL divergence between pobsp_{\text{obs}} and pθp_\theta while misGAN minimizes the Jensen–Shannon (JS) between pobsp_{\text{obs}} and pθp_\theta, which shows the similarities between MissDiff and MisGAN. On the low level, we approximate the divergence by different modeling. MissDiff builds a trajectory from noise distribution to the data distribution, then the KL divergence can be calculated by the l2 norm of the velocity field (score function) of the trajectory. While misGAN calculates the JS divergence by using the domain classifier.

MissDiff builds upon diffusion models, which offer two major advantages over GAN-based generative models: (1) the ability to produce higher-quality synthetic data, as demonstrated by Dhariwal and Nichol (2021); and (2) a more stable training objective, which makes the training process significantly easier compared to GANs.

Ref: Dhariwal, Prafulla, and Alexander Nichol. "Diffusion models beat gans on image synthesis." Advances in neural information processing systems 34 (2021): 8780-8794.

Q: There is another work that similarly maximizes the observed likelihood using normalizing flow: https://arxiv.org/pdf/2003.12628 which I think could also be straightforwardly extended to generating new samples.

A: Thank you for pointing out the reference. We have added this reference to related work in lines 299-300. This work is based on normalizing flow-type generative models. We would like to briefly note that flow-based methods typically fall short in data quality compared to the GAN- and diffusion-based generative models employed in our work.

Q: Why does the imputation work? Algorithm 2 describes the imputation algorithm which is similar to the reverse process described in eq 2 using a joint score model. However, the training process only add noise to the observed coordinates, and the unobserved coordinates does not contribute to the loss (due to "odot M"), so the learned score model seems to be a "marginal score". This is different from training a conditional score model then using it to sample from a conditional probability distribution. What is the justification of authors' imputation process? Could authors please write down: What is the optimal solution of equation 5? what is the corresponding forward process and backward process that give rise to the imputation algorithm?

A: Thank you for your question. For each sample, our learning objective is to learn the "marginal score" for the particular missingness. In the population sense, our learning objective can learn the "joint score", which is proven in Theorem 3.2. We rely on the assumption that the supreme of missing rates ρmax<1\rho_{\text{max}}<1, then the optimal solution of our objective recover score on complete data. Therefore, by utilizing the existing literature that uses the score function estimated in complete data to impute data [1,2], we can verify the rationality of the imputation process.

Ref: [1]: Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Gool, L.V. (2022). RePaint: Inpainting using Denoising Diffusion Probabilistic Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11451-11461.

[2]: Rout, L., Parulekar, A., Caramanis, C., & Shakkottai, S. (2023). A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. ArXiv, abs/2302.01217.

评论

Q: equation (12) in Li et al., 2019 defines the imputer of misGAN. I can just set m = [0, 0, 0... ] in (12) to generate new sample, right? The same question above. Equation (2) and (3) in Yoon et al., 2018 defines the imputer of GAIN. Can I just set m = [0, 0, 0, ...] to generate new sample straightforwardly from GAIN? If yes to question 2 and 3, what is the unique advantage of the proposed unified framework for imputing/generating samples comparing to misGAN and GAIN?

A: Thank you for your question. We can set m = [0, 0, 0... ] to use misGAN and GAIN for generating new samples. (We believe there is a small difference between misGAN and GAIN. GAIN takes addition x~\tilde{x} and mm as input, which might be hard for it to generalize since during training, the network input x~\tilde{x} and mm will not be all zero. The misGAN is more similar to our work that transfers the initial value zz/xTx_T to the output xx/x0x_0. The smaller the missing ratio is, the smaller the discrepancy will be for misGAN and our method, while the discrepancy will be larger for GAIN.) The main advantage we claim in the paper is our proposed method does not have additional parameters to deal with missing values, while misGAN needs to train additional networks GmG_m and GxG_x. There are two key advantages of using diffusion models to model data distributions:

Stability in Training: Diffusion models avoid adversarial training, which is a common source of instability in GAN-based models. The absence of adversarial interactions between a generator and a discriminator simplifies the training process and often leads to more stable convergence.

Modeling Uncertainty: Diffusion models inherently capture the uncertainty of generated data due to their stochastic nature. Since the generation process is governed by a stochastic differential equation, multiple generations can naturally reflect the underlying variability of the data distribution. In contrast, GAN-based methods are deterministic, lacking the ability to directly model this uncertainty.

评论

We thank the reviewer for the comments and suggestions. We appreciate the time you spent on the paper. Below we address the concerns and comments that you have provided.

Q: This paper lacks a proper motivation for generating new samples in addition to imputing missing data. In the second paragraph in the Introduction, there is one sentence on "deep generative models can be used to... enhance the performance of image classification tasks". However, I suggest authors discuss why imputation alone is not enough, and why additional samples need to be generated and back it up with specific examples.

A: Thank you for your insightful suggestion. In the revised manuscript, we have expanded the discussion on the importance of generating additional samples alongside imputing missing data, which can be found in lines 40-42. We also provide the motivation in the following.

Two primary motivations for generating synthetic tabular data are: (1) protecting the privacy of original tabular data by releasing fake tabular data, and (2) augmenting the original tabular data with fake data for better training machine learning models [1]. These advantages have been demonstrated in recent generative models on tabular data, such as [1-4]. However, these methods are not designed to handle missing values, a common and long-standing challenge in various domains.

In contrast, most existing research on missing data focuses on imputation tasks. Although some methods are able to generate new samples, they often incur additional computational costs, due to two-stage inference processes or the need to train additional networks, as detailed in Appendix B.1.

To address these limitations, our main objective in this work is to provide a unified framework (our learning objective is the same for imputation and generation). By modeling the score of the complete data distribution, we can handle both imputation and generation tasks efficiently, by avoiding two-stage inference problems or additional network training.

Ref: [1]: Kim, J., Lee, C.E., & Park, N. (2023). STaSy: Score-based Tabular data Synthesis. International Conference on Learning Representations.

[2]: Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular data using Conditional GAN. Neural Information Processing Systems.

[3]: Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. International Conference on Machine Learning.

[4]: Zhang, H., Zhang, J., Srinivasan, B., Shen, Z., Qin, X., Faloutsos, C., Rangwala, H., & Karypis, G. (2024). Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space. International Conference on Learning Representations.

Q: The authors do not explain why other generator based imputation algorithm, such as GAIN or misGAN, could not be straightforwardly extended to generate new samples while imputing missing values. Both GAIN and misGAN train a generator G which outputs a complete sample. The generator is then "tweaked" to impute missing data (similar to the tweaking authors have done in line 9, algorithm 2). If they can be straightforwardly modified to generate new images, I think authors should also compare them in the experiments or at least acknowledge them as earlier unified imputation/generation algorithms.

A: Thank you for your suggestion. We provide the discussion about misGAN in Appendix B.1 in the previous version (GAN-based approaches can also be used for generation tasks, while misGAN trains two generator-discriminator pairs for the masks and data respectively, which increases the computational cost). In the revised manuscript, we have expanded the discussion of GAIN in lines 922-924.

Utility (classification accuracy) evaluation of MissDiff on Census dataset.

MissDiffDiff-deleteDiff-meanSTaSy-deleteSTaSy-meanMisGANGAINCSDI_T
Row Missing79.48%-78.45%-70.79%76.29%72.36%79.15%
Column Missing71.68%72.89%79.60%68.96%74.47%75.42%70.32%80.31%
Independent Missing79.49%75.39%75.96%78.36%77.34%76.12%73.17%79.12%
评论

Thanks for the detailed reply.

For each sample, our learning objective is to learn the "marginal score" for the particular missingness. In the population sense, our learning objective can learn the "joint score", which is proven in Theorem 3.2.

  1. I do not understand why the objective learn marginal score and the joint score at the same time. Do you mean, for each missing pattern, the neural network can learn a marginal score? Given 2^d possible missing patterns, it requires the neural network to have incredible amount of capacity.

  2. I am still not sure why Theorem 3.2 justifies that the objective can learn the joint score? Theorem 3.2 is an upper bound of the likelihood. It is unclear how tight the bound is. Could the authors please elaborate?

评论

We sincerely thank the reviewer for the valuable feedback. We appreciate the time you spent on the paper.

Q: I do not understand why the objective learn marginal score and the joint score at the same time. Do you mean, for each missing pattern, the neural network can learn a marginal score? Given 2^d possible missing patterns, it requires the neural network to have incredible amount of capacity.

A: Thank you for your question. The masked denoising score matching objective proposed in this paper learns the score function under different missing patterns, i.e., Ep(x0obs)Ep(xtobsx0obs)[(sθ(xtobs,t)xtobslogp(xtobsx0obs))m22]\mathbb{E} _{p(\mathbf{x}^{\text{obs}}_0)} \mathbb{E} _{p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)}[\|(\mathbf{s} _{\boldsymbol{\theta}}(\mathbf{x}^{\text{obs}}_t, t)-\nabla _{\mathbf{x}^{\text{obs}}_t} \log p(\mathbf{x}^{\text{obs}}_t \mid \mathbf{x}^{\text{obs}}_0)) \odot \mathbf{m}\|_2^2]. For each sample with its own missing pattern m\mathbf{m}, our learning objective only cares about the estimated score on the observed dimension (although the network output is the same dimension as complete data). Due to the different missing patterns in the population sense, we hope the trained neural network can learn the complete score so as to be generalized to unseen missing patterns in imputation and generation tasks. We believe our method requires the same capacity as the current literature that learns a conditional distribution for imputation [MisGAN, GAIN, CSDI_T].

Q: I am still not sure why Theorem 3.2 justifies that the objective can learn the joint score? Theorem 3.2 is an upper bound of the likelihood. It is unclear how tight the bound is. Could the authors please elaborate?

A: Thank you for your question. Theorem 3.2 in our paper (lines 263 - 268 for the revised manuscript) proves the optimal solution of our proposed objective is the complete score. In Theorem 3.3, we further verify the effectiveness of our training objective by proving our objective upper bounds the likelihood of the observed data. The inequality only comes from line 833 and line 857 in the revised manuscript. The first inequality is the KL divergence between the real data distribution and the synthetic data distribution can be upper bounded by the KL divergence of path measure. Theorem 2 in Song et al. (2021a) provides a sufficient condition for the bound in line 833 to be tight, which essentially requires the learned score function to exactly match the true score function of the reverse-time diffusion process with the initial distribution π\pi. This represents a relatively ideal scenario that assumes perfect training and a sufficiently large capacity of the score network. In practice, when the score network is trained reasonably well, the bound in line 833 is typically not overly loose and often approximates tightness. The second inequality is equality if missing completely at random and the missing rate for different variables are the same (homogeneous), in such cases, E[m]1ρmax\mathbb E[\mathbf{m}]\equiv 1-\rho_{\max} and the inequality become equality. Therefore, the bound will not be overly loose when the missing ratio is balanced among the data dimensions.

Song et al. (2021a): Song, Y., Durkan, C., Murray, I., & Ermon, S. (2021). Maximum Likelihood Training of Score-Based Diffusion Models. Neural Information Processing Systems.

公开评论

Hi,

I recently came across your paper and found it to be very interesting. However, I noticed another paper presented at ICML 2013 titled "MissDiff: Training Diffusion Models on Tabular Data with Missing Values," which seems to have similar content.

Here is the link to the ICML 2013 paper:

https://openreview.net/pdf?id=S435pkeAdT

Could you please clarify any differences between your paper and the ICML 2013 publication?

Thank you.

AC 元评审

This paper proposes a diffusion-based framework—MissDiff—for simultaneously handling missing-value imputation and data generation. Reviewers found its unification of the two tasks promising and noted improvements over the state of the art on various tabular datasets. However, even after the rebuttal, they remained concerned that the paper’s theoretical foundations do not fully mesh with the proposed method in practice. Moreover, they suggested that MissDiff is similar to earlier work by Jolicoeur-Martineau et al., with the main distinction being that MissDiff’s objective is adapted for neural networks.

审稿人讨论附加意见

The reviewers actively engaged in the discussion, and while several issues were addressed, they still expressed concerns about the approach’s novelty and a disconnect between its theoretical framework and practical implementation.

最终决定

Reject