PaperHub
5.0
/10
Poster5 位审稿人
最低3最高6标准差1.1
6
5
5
3
6
3.2
置信度
正确性2.8
贡献度2.6
表达2.8
NeurIPS 2024

Rethinking the Diffusion Models for Missing Data Imputation: A Gradient Flow Perspective

OpenReviewPDF
提交: 2024-05-05更新: 2024-11-06
TL;DR

We propose a novel, easy-to-implement, numerical tabular data imputation approach based on joint wasserstein gradient flow.

摘要

Diffusion models have demonstrated competitive performance in missing data imputation (MDI) task. However, directly applying diffusion models to MDI produces suboptimal performance due to two primary defects. First, the sample diversity promoted by diffusion models hinders the accurate inference of missing values. Second, data masking reduces observable indices for model training, obstructing imputation performance. To address these challenges, we introduce $\underline{\text{N}}$egative $\underline{\text{E}}$ntropy-regularized $\underline{\text{W}}$asserstein gradient flow for $\underline{\text{Imp}}$utation (NewImp), enhancing diffusion models for MDI from a gradient flow perspective. To handle the first defect, we incorporate a negative entropy regularization term into the cost functional to suppress diversity and improve accuracy. To handle the second defect, we demonstrate that the imputation procedure of NewImp, induced by the conditional distribution-related cost functional, can equivalently be replaced by that induced by the joint distribution, thereby naturally eliminating the need for data masking. Extensive experiments validate the effectiveness of our method. Code is available at [https://github.com/JustusvLiebig/NewImp](https://github.com/JustusvLiebig/NewImp).
关键词
Missing Data ImputationGradient FlowReproducing Kernel Hilbert SpaceFunctional Optimization

评审与讨论

审稿意见
6

This paper addresses two primary issues in Missing Data Imputation (MDI) using Diffusion Models (DMs): inaccurate imputation due to sample diversification and difficult training caused by the complexity of designing the mask matrix. The authors propose a novel approach, Kernelized Negative Entropy-regularized Wasserstein Gradient Flow Imputation (KnewImp), which aims to resolve these issues for numerical tabular datasets.

优点

  • The paper addresses two issues in the domain of DM-based MDI—sample diversification leading to inaccurate imputation and the complex training process due to mask matrix design. By identifying and explicitly addressing these issues, the paper provides a fresh perspective on improving MDI techniques. The introduction of the negative entropy-regularized cost functional is a creative and innovative approach to discourage diversification, aligning the generative model’s objectives more closely with the needs of MDI tasks. The integration of the WGF framework with RKHS to derive an imputation procedure is an original and elegant solution.
  • The paper provides thorough theoretical analyses and proofs, ensuring the soundness of the proposed approach. The extensive experiments conducted on multiple real-world datasets from the UCI repository underscore the robustness and effectiveness of the KnewImp approach. The detailed ablation studies and sensitivity analyses enhance the quality of the research by thoroughly examining the contributions of different components and the impact of key hyperparameters.
  • The paper is well-organized, with a clear delineation of problems, proposed solutions, theoretical foundations, experimental setup, and results. Each section logically flows into the next, making it easy to follow the authors’ arguments.

缺点

  • The choice of kernel in the Reproducing Kernel Hilbert Space (RKHS) can significantly impact the performance of the method. The paper primarily uses the radial basis function (RBF) kernel but does not explore the effects of using different types of kernels or the rationale behind choosing the RBF kernel. A detailed analysis of how different kernel choices affect the imputation quality and computational efficiency would strengthen the technical rigor of the study.
  • The theoretical foundations of KnewImp rely on certain assumptions about the data distribution, such as the smoothness of the underlying density functions. The robustness of the method to violations of these assumptions is not thoroughly investigated. Including experiments that test the method’s performance on datasets with varying statistical properties (e.g., heavy-tailed distributions, multimodal distributions) would provide a more comprehensive assessment of its robustness.

问题

  1. How robust is KnewImp to variations in data distributions, such as heavy-tailed, multimodal, or skewed distributions?
  2. Why was the radial basis function (RBF) kernel chosen for the RKHS, and how does the choice of kernel affect the performance of KnewImp?

局限性

Yes. Discussed in Appendix F.

作者回复

Thank you for your comments, our rebuttal organized point to point is posted as follows:

W1 & Q2: Why we use RBF Kernel Function:

  • The selection of the RBF kernel function was strategically driven by the need to satisfy the following condition: X(miss)[u(X(miss),τ)r(X(miss))]dX(miss)=r(X(miss))_X(miss)[u(X(miss),τ)]dX(miss)_E_r(X(miss))[_X(miss)[u(X(miss),τ)]]+u(X(miss),τ)X(miss)[r(X(miss))]dX(miss)=0\int{\nabla_{\boldsymbol{X}^{(miss)}}[u(\boldsymbol{X}^{(miss)} ,\tau) r(\boldsymbol{X}^{(miss)})]\mathrm{d}\boldsymbol{X}^{(miss)}}=\underbrace{\int{r(\boldsymbol{X}^{(miss)})\nabla\_{\boldsymbol{X}^{(miss)}}[u(\boldsymbol{X}^{(miss)} ,\tau) ]\mathrm{d}\boldsymbol{X}^{(miss)}}}\_{\mathbb{E}\_{r(\boldsymbol{X}^{(miss)})}[ \nabla\_{\boldsymbol{X}^{(miss)}}[u(\boldsymbol{X}^{(miss)} ,\tau) ] ]} + \int{u(\boldsymbol{X}^{(miss)} ,\tau)^\top\nabla_{\boldsymbol{X}^{(miss)}}[r(\boldsymbol{X}^{(miss)}) ]\mathrm{d}\boldsymbol{X}^{(miss)}} =0. This condition is pivotal as it allows us to circumvent the direct, explicit estimation of r(X(miss))r(\boldsymbol{X}^{(miss)}) during the imputation procedure.

  • A sufficient condition for X(miss)[u(X(miss),τ)r(X(miss))]dX(miss)=0\int{\nabla_{\boldsymbol{X}^{(miss)}}[u(\boldsymbol{X}^{(miss)} ,\tau) r(\boldsymbol{X}^{(miss)})]\mathrm{d}\boldsymbol{X}^{(miss)}}=0 is that "r(X(miss))r(\boldsymbol{X}^{(miss)}) is bounded and limX(miss)u(X(miss),τ)=0\lim_{\Vert \boldsymbol{X}^{(miss)} \Vert\rightarrow \infty}u(\boldsymbol{X}^{(miss)} ,\tau)=0".

  • Consequently, the key to choosing kernel function is to validate the condition as follows: For xx, kernel function K(x,x)\mathcal{K}(x,x') should satisfy the boundary condition: limxK(x,x)=0\lim_{\Vert x \Vert\rightarrow \infty}\mathcal{K}(x,x')=0.

  • Conventional kernel functions, like linear kernel, sine kernel, polynomial kernel, cosine similarity kernel... cannot satisfy this condition. Thus we choose the RBF kernel in our manuscript similar to previous work [1].

  • Furthermore, we propose the following experimental results at 0.3 missing rate (to consist with the table in main context, results for MNAR scenario are not posted due to space limit): |Scenario|Kernel|BT-MAE|BT-Wass|BCD-MAE|BCD-Wass|CC-MAE|CC-Wass|CBV-MAE|CBV-Wass|IS-MAE|IS-Wass|PK-MAE|PK-Wass|QB-MAE|QB-Wass|WQW-MAE|WQW-Wass| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| |MAR|linear|0.77|0.73|0.83*|3.36*|0.85*|0.77*|0.83*|0.97*|0.72*|2.42*|0.8*|2.5*|0.7*|4.58*|0.76*|1.02*| ||polynomial|1.12*|1.4*|1.44*|8.36*|1.06*|1.16*|1.08*|1.55*|1.07*|5.93*|1.33*|5.93*|1.28*|13.63*|1.02*|1.76*| ||sigmoid|0.89*|1.02*|1.44*|10.8*|0.82|0.74*|0.81*|0.93*|1.17*|12.57|1.3*|9.87*|1.04*|7.42*|0.77*|1.04*| ||cosine similarity|0.68|0.53|0.8*|3.14*|0.82*|0.73*|0.81*|0.92*|0.71*|2.37*|0.78*|2.37*|0.74|4.56|0.74*|0.98*| ||sine|8.62*|95.45|8.74*|290.79*|14.83*|281.68*|17.93*|492.97*|10.56*|456.27*|10.25*|306.76*|8*|313.9*|15.14*|386.73*| ||RBF|0.52|0.38|0.34|0.82|0.35|0.25|0.31|0.2|0.39|1.31|0.44|1.21|0.45|3.5|0.46|0.55| |MCAR|linear|0.71*|0.45*|0.87*|5.87*|0.83*|0.84*|0.81*|1.28*|0.81*|5.65*|0.83*|4.14*|0.62*|6.31*|0.76*|1.25*| ||polynomial|0.94*|0.88*|1.31*|12.19*|0.98*|1.21*|0.99*|1.85*|1.11*|9.94*|1.27*|8.64*|1.11*|18.71*|0.92*|1.95*| ||sigmoid|0.74*|0.48*|0.97*|8*|0.81*|0.82*|0.8*|1.23*|0.92*|7.36*|0.94*|6.54*|0.77*|7.74*|0.77*|1.28*| ||cosine similarity|0.7*|0.42*|0.84*|5.51*|0.81*|0.81*|0.8*|1.24*|0.8*|5.48*|0.81*|3.96*|0.63*|6.01*|0.74*|1.2*| ||sine|9.76*|95*|7.77*|434.72*|13.53*|331.59*|11.78*|352.67*|8.27*|542.27*|8.62*|384.97*|7.36*|468.07*|10.61*|289.41*| ||RBF|0.48|0.18|0.25|0.8|0.47|0.34|0.42|0.44|0.44|3.05|0.32|1.01|0.34|3.66|0.53|0.76|

    From the results presented in the table, it is evident that the RBF kernel outperforms other kernels in scenarios where the boundary condition is not met. This observation empirically substantiates the effectiveness of the RBF kernel, thereby validating the justification for using RBF kernel.

  • We decide to add the abovementioned contents to explain why we mainly consider RBF kernel in our revised manuscript.

W2 & Q1: Data Distribution Property

Thank you for your comments, our responses are listed as follows:

  • We have expanded the experimental results in the attached PDF to include KnewImp performance across various data distributions, such as heavy-tailed, multimodal, and skewed distributions. Please refer to Section 2.2 in the supplementary PDF provided in the common rebuttal chat window.
  • The additional results demonstrate that KnewImp maintains consistent performance when transitioning from Gaussian to Skewed-Gaussian, Student's-t, and Gaussian Mixture distributions. This underscores the robustness of KnewImp across diverse data distributions.
  • We will incorporate these findings into the revised manuscript to enhance its comprehensiveness and clarity.

Thank you for reading our response, we hope the above discussion fully addressed your concerns about our work, and we would really appreciate it if you could be generous in raising your score.


Refs
[1]. "Stein variational gradient descent: A general purpose bayesian inference algorithm." NeurIPS'16.

评论

Thank you for your detailed response. I have read through most of the comments by other reviewers, as well as the rebuttal. I decide to increase my score to support this paper to be accepted.

评论

Dear Reviewer [9XXU],

Thank you very much for your supportive feedback and for taking the time to thoroughly review both our manuscript and the additional materials provided during the rebuttal process. We are grateful for your decision to increase your score and appreciate your endorsement of our paper.

Warm regards,
Authors

审稿意见
5

This paper presents KnewImp, a kernelized negative entropy-regularized Wasserstein gradient flow imputation approach to numerical tabular data imputation. The authors argue that existing missing data imputation frameworks based on diffusion models suffer from two major limitations. Firstly, diffusion models primarily focus on sample diversification rather than accuracy, which results in discrepancy between the training objective of the diffusion models and the aim of tabular data imputation. Secondly, existing approaches are trained by masking parts of the observed data and then predicting the masked entries. This results in training difficulty due to the need of designing complex mask matrix selection mechanisms. To address the limitations of existing models, the authors propose a Wasserstein gradient flow based framework, which employs a novel cost functional with diversification-discouraging negative entropy as regularization. KnewImp is derived within the Wasserstein gradient flow framework, reproducing the kernel Hilbert space. To bypass the need of the mask matrix and to make the model easier to train, the authors further develop a novel cost functional based on joint distribution. Experimental results on a variety of real-world datasets show that KnewImp achieves state-of-the-art, outperforming a number of state-of-the-art baseline alternatives.

优点

  • KnewImp is theoretically sound and achieves state-of-the-art results on a number of real-world datasets.

  • The paper is in general well-written and easy-to-follow.

缺点

  • My main concern is that, the authors only presented results in terms of MAE and Wass, but did not show results on downstream tasks. It is therefore unclear whether KnewImp would be effective in real-world scenarios. I would consider raising my score if the authors can provide the result of KnewImp on downstream tasks. The authors can refer to Figure 5 of the TDM paper in terms of downstream results.

  • From the ablation study, I think the majority of the performance improvement comes from modeling the joint distribution directly. The diversification-discouraging negative entropy regularization only marginally improves the performance. The main novelty of KnewImp is from the NER objective. However, ablation study suggests that it does not add much to the final performance.

问题

  • In line 150, the authors claim that "This diversification is fundamentally at odds with the precision required in MDI tasks." I understand that the need of discouraging diversification in favor of precision is the main motivation of KnewImp. However, I wonder if the authors can provide a more intuition explanation about this claim, or if the discrepancy between accuracy and diversification can be somehow quantified.

  • Why did the authors primarily focus on MAR and MCAR rather than MNAR? Is this because of the setting of KnewImp is not particularly suitable for MNAR?

  • In Table 2, not all results in bold are best results.

局限性

The authors have adequately addressed the limitations and potential negative societal impact of their work, and have provided sufficient justification.

作者回复

Thank you for your comments, our rebuttal is posted as follows:

W1: Downstream Tasks

According to your suggestions, we added the performance on downstream classification task similar to Fig. 5 in the TDM paper [1]. Please see Table 1 located in Section 2.1 from the pdf attached in common rebuttal chat window.

W2: Ablation Study

Thank you for your feedback on our ablation study.

  • There seems to be some misunderstanding about the contributions of KnewImp. Our method primarily introduces WGF to analyze and enhance DM-based MDI. This introduces two main innovations: NER and Joint Distribution Modeling.
  • The function of NER term is that it introduces a functionally effective modification to the existing DM -based MD—Optimizing the NER functional aligns well with optimizing the joint/conditional log-likelihood objective.
  • Based on this, the ablation study aimed to show that including NER maintains an effective lower bound for MDI tasks—meaning it does not degrade model performance in general.
  • Besides, we can also turn to Table E.4, when we add NER term into the model, the standard deviation of KnewImp with NER is generally smaller than those without NER term. These results further confirm that NER, although providing marginal performance gains, while enabling the KnewImp has a smaller standard deviation, is crucial for the model's theoretical robustness.

Q1: Motivation

  • To clarify the trade-off between diversification and accuracy, which is a central theme of our method KnewImp, let's consider a practical example. Suppose we denote the true value by xx and the imputed value by x~ \tilde{x}. The goal in terms of accuracy is to minimize the discrepancy Dis(x,x~)Dis(x, \tilde{x}), where DisDis is discrepancy metric.
  • Diversification tends to increase either the variance Var(x~)\text{Var}(\tilde{x}) or the entropy H(x~)\mathbb{H}(\tilde{x}). These measures do not directly involve ground truth xx.
  • In DMs, where entropy is used as a term to encourage diversification (our proposition 3.1), the outcome may converge towards a uniform distribution [2]. In such a distribution, every potential value within the support is equally probable as the imputed value, which is undesirable for MDI.

Q2: Scenario Restriction

Thank you for your question regarding our focus on MAR and MCAR rather than MNAR.

  • We prioritize MAR and MCAR because MNAR involves complexities due to its dependency on unobserved data, requiring in-depth knowledge of the missingness mechanism which is often challenging to determine [3,4].
  • For instance, in privacy-sensitive studies like diabetes medication usage, the non-response is directly linked to the privacy nature of the participants, typical of MNAR scenarios.
  • Given these challenges, we focus on MAR and MCAR for their more straightforward assumptions and applicability.
  • Nevertheless, to ensure a comprehensive analysis, we include findings related to MNAR in Tables E.1 and E.3 of our manuscript, demonstrating our approach's applicability under these conditions as well.

Q3: Marking Mistake

Thank you for pointing out our problems, we will revise this table in our revised manuscript.


Thank you for reading our rebuttal, we hope our response alleviates your problem and we would really appreciate it if you could be generous in raising your score.


Refs
[1]. "Transformed distribution matching for missing value imputation." ICML'23.
[2]. "Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models" ICML'19
[3]. "StableDR: Stabilized doubly robust learning for recommendation on data missing not at random." ICLR'23
[4]. "not-MIWAE: Deep generative modelling with missing not at random data" ICLR'21

评论

I would like to thank the authors for the detailed response and for providing additional experiment results. The downstream classification results suggest that the imputations generated by KnewImp can potentially lead to better downstream performance. I have therefore raised my score accordingly.

I am also curious if the authors have tried applying KnewImp over downstream regression tasks? In addition, what is its performance on real-world datasets with missing values, rather than datasets that you synthesize missing values on your own? Although the authors have experimented with three different scenarios, I think there still exists discrepancy between real-world datasets with inherent missing values.

评论

Dear Reviewer [fS2D]:

Thank you for your encouraging feedback and for raising your score, which greatly supports our work.

Regarding your additional inquiries:

  • Application to Regression Tasks:

    • Currently, our focus has been primarily on classification tasks, as suggested by the framework outlined in Figure 5 of the TDM paper [1], primarily due to our time and resource constraints.
    • However, we recognize the importance of exploring the utility of KnewImp in regression scenarios. We aim to include preliminary results on this aspect in another comment chat window within the discussion period set by the NeurIPS'24 committee as much as possible.
    • Additionally, we will ensure to incorporate a detailed evaluation on downstream regression tasks in the revised version of our manuscript.
  • Performance on Real-World Datasets with Inherent Missing Values:

    • Currently, due to platform constraints, we have not yet implemented our algorithm in real-world industrial settings, such as recommender systems [2], where various metrics foucus on business value, area under the curve, return on investment [3], are typically employed.
    • However, we are actively seeking opportunities to apply our methodology in industrial scenarios and plan to explore this in future work.
    • We will outline these plans and the potential for real-world applications in the future research directions section of our revised manuscript.

We appreciate your insightful questions, which guide our ongoing and future research efforts.


Refs
[1]. "Transformed distribution matching for missing value imputation." ICML'23.
[2]. "StableDR: Stabilized doubly robust learning for recommendation on data missing not at random." ICLR'23
[3]. "Kalman Filtering Attention for User Behavior Modeling in CTR Prediction" NeurIPS'20


Sincerely,
Authors

评论

Dear Reviewer [fS2D]:

Thank you for your insightful suggestions regarding the downstream regression task. In response to your feedback, we have conducted additional experiments focusing on downstream regression task. We utilized Mean Square Error (MSE) and Mean Absolute Error (MAE) as our evaluation metrics according to references [1] on the CC dataset, specifically designed for regression analyses. The results for missing rate at 0.3 are posted as follows:

Dataset-ScenarioCC-MARCC-MARCC-MCARCC-MCARCC-MNARCC-MNAR
Model/MetricMSEMAEMSEMAEMSEMAE
CSDI_T3.07E+02*1.41E+01*3.07E+02*1.41E+01*3.07E+02*1.41E+01*
MissDiff2.31E+02*1.25E+01*2.44E+02*1.27E+01*2.45E+02*1.27E+01*
GAIN2.24E+021.23E+012.40E+02*1.26E+01*2.38E+021.68E+01*
MIRACLE3.51E+02*1.59E+01*3.71E+02*1.60E+01*3.80E+02*1.64E+01*
MIWAE2.23E+02*1.23E+012.43E+02*1.27E+01*2.42E+021.27E+01
Sink3.48E+02*1.58E+01*3.85E+02*1.65E+01*3.87E+02*1.66E+01*
TDM2.19E+021.58E+012.38E+02*1.26E+01*2.37E+021.66E+01
ReMasker3.44E+02*1.58E+01*3.69E+02*1.62E+01*3.72E+02*1.64E+01*
KnewImp2.20E+021.22E+012.33E+021.24E+012.37E+021.26E+01
Ground Truth1.57E+021.01E+011.55E+029.83E+001.68E+021.04E+01

Refs:
[1]. "Deep Time Series Models: A Comprehensive Survey and Benchmark"


Thank you for reading our comments, we hope our response answer your problem. Given your busy schedule, please do not feel obliged to respond to this message.

Sincerely,

Authors

评论

Dear Reviewer fS2D,

​​Once again, we are grateful for your time and effort for reviewing our paper!

Since the discussion period will end in a few hours, we are very eager to get your feedback on our response. We understand that you are very busy, but we would highly appreciate it if you could take into account our response when updating your final rating and having a discussion with AC and other reviewers.

Thanks for your time,

Authors of Submission 1850

审稿意见
5

This paper considered tackling the Missing Data Imputation (MDI) problem via diffusion models, which treats MDI as an generative problem. As DM-based methods focus on sample diversification rather than accuracy, which is the primary evaluation metric for MDI, the authors proposed one cost functional to discourage diversification in sample generation based on the Wasserstein Gradient Flow framework. Moreover, given that the true values of the missing data are unknown, the authors proposed to replace the joint distribution with the conditional distribution throughout the learning procedure.

优点

This paper focuses on two important questions faced by the diffusion model based solver of the missing data imputation problem. Extensive numerical experiments are provided to validate the effectiveness of the proposed methodology.

缺点

The main issue here is that incorporating Wasserstein Gradient Flow (WGF) with generative modeling doesn't seem to be a new idea. However, it seems that the authors didn't include any related references in section 2.3, which provides a brief review of WGF. For instance, it might be necessary to cite and discuss the following articles [1-6].

问题

It seems to the reviewer that the authors have made a strong assumption throughout this paper. Specifically, the assumption in Proposition 3.4 says that the joint distribution r(Xjoint)r(X^{\text{joint}}) can be factorized as r(X(miss))r(Xobs)r(X^{(\text{miss})})r(X^{\text{obs}}), where rr denotes the probability density function. Would it be possible for the authors to discuss whether such assumption is realistic or not? Intuitively, it seems that the missing and observed entries can't be utterly uncorrelated.

局限性

It seems to the reviewer that there are too many typos in the current version of this paper, especially for the theoretical derivation part in the appendix. For instance, derivation of the equations from line 583 to line 584 seem to have lots of issues - the first step should contain some dot product?

References:

[1] Ansari, A. F., Ang, M. L., & Soh, H. (2020). Refining deep generative models via discriminator gradient flow. arXiv preprint arXiv:2012.00780.

[2] Cheng, X., Lu, J., Tan, Y., & Xie, Y. (2024). Convergence of flow-based generative models via proximal gradient descent in Wasserstein space. IEEE Transactions on Information Theory.

[3] Choi, J., Choi, J., & Kang, M. (2024). Scalable Wasserstein Gradient Flow for Generative Modeling through Unbalanced Optimal Transport. arXiv preprint arXiv:2402.05443.

[4] Gao, Y., Jiao, Y., Wang, Y., Wang, Y., Yang, C., & Zhang, S. (2019, May). Deep generative learning via variational gradient flow. In International Conference on Machine Learning (pp. 2093-2101). PMLR.

[5] Heng, A., Ansari, A. F., & Soh, H. Deep generative Wasserstein gradient flows.

[6] Xu, C., Cheng, X., & Xie, Y. (2024). Normalizing flow neural networks by JKO scheme. Advances in Neural Information Processing Systems, 36.

作者回复

Thank you for your comments.

W1: Contributions & Novelty of This Paper

  • Our primary focus is on analyzing Diffusion Model (DM)-based Missing Data Imputation (MDI) using Wasserstein Gradient Flow (WGF, initially designed for functional optimization), not merely integrating WGF into a generative model.
  • The central contribution of this work is leveraging WGF as a tool to analyze and demonstrate the limitations of Diffusion Model-based MDI. This led us to redesign a novel functional and develop a unique computational approach for MDI.
  • To the best of our knowledge, Proposition 3.1 has not been documented in previous literature. Specifically, for the VP-SDE model, the functional of the Ornstein–Uhlenbeck process incorporates a variance term Er[X(miss)][X(miss)]\mathbb{E}_{r}[\boldsymbol{X}^{(miss)}]^\top[\boldsymbol{X}^{(miss)}] to promote diversification. This feature is notably absent in traditional gradient field-based (GF-based) generative models as referenced in [1-6].
  • In our manuscript, we want to convey the concept that: MDI should not be treated as a generative problem (in our manuscript, we want to discourage diversification, but the generative models will encourage entropy function as per reference [1]: The differential entropy term H(r)\mathbb{H}(r) improves diversity and expressiveness when the gradient flow is simulated for finite time-steps. Similar entropy terms can be found in Eq. (15) in [1], Eq. (16) in [2], Eq. (3) in [3], Eq (1) in [4], Eq. (4) in [5], and Eq. (5) in [6]).
  • Hence the references [1-6] you recommended about using GF to improve generative models were not considered during the preparation of our manuscript, and our related works subsections located in Section 5 are chiefly organized from the perspective of DM's application in MDI and the application of WGF works which model the conditional distribution by joint distribution.
  • We plan to add another subsection to discuss the application of WGF in improving generative models to demonstrate the applicability of GF to improve generative model-related tasks and cite these references.

W2: Justification of Decomposition:

  • As for the decomposition, it is not a strong assumption for MDI.
  • As we mentioned in our manuscript and common rebuttal chat window, our task is to find the missing value X(miss)\boldsymbol{X}^{{(miss)}} and the X(obs)\boldsymbol{X}^{{(obs)}} will not be changed (p(X(obs))p(\boldsymbol{X}^{{(obs)}}) is a constant measure).
  • Based on this, when we want to sample X(miss)\boldsymbol{X}^{{(miss)}} from distribution r(X(miss))r({\boldsymbol{X}^{{(miss)}}}), the results are the same as sampling from r(X(joint))r({\boldsymbol{X}^{{(joint)}}}) since r(X(obs))=p(X(obs))r(\boldsymbol{X}^{{(obs)}})=p(\boldsymbol{X}^{{(obs)}}) is a constant measure according to reference [7,8], which is unchanged.
  • The key is that p(X(joint))p(X(miss))p(X(obs))p({\boldsymbol{X}^{{(joint)}}})\neq p({\boldsymbol{X}^{{(miss)}}}) p({\boldsymbol{X}^{{(obs)}}}), but r(X(joint))=r(X(miss))p(X(obs))r({\boldsymbol{X}^{{(joint)}}})= r({\boldsymbol{X}^{{(miss)}}}) p({\boldsymbol{X}^{{(obs)}}}) is justified.
  • Furthermore, similar assumption on the rr, the 'ansatz'/variational distribution/approximate distribution/proposal distribution, can be found in mean-filed variational inference represented by reference [9].

L1: Typos on Eq. (B.2):

We regret any oversight regarding the omission of the \nabla\cdot operator (divergence operator), and velocity term vτv_\tau in our description of the continuity equation.

  • This equation is given as r(X(miss))τ=[vτ(X(miss))r(X(miss))]\frac{\partial r(\boldsymbol{X}^{{(miss)}})}{\partial \tau}=-\nabla\cdot[v_{\tau}(\boldsymbol{X}^{{(miss)}})r(\boldsymbol{X}^{{(miss)}})], the second equality can be obtained similar to the derivation of Fokker-Planck-Kolmogorov equation, where vτ(X(miss))=logp(X(miss)X(obs))λlogr(X(miss))v_{\tau}(\boldsymbol{X}^{{(miss)}})=-\nabla{\log{p(\boldsymbol{X}^{{(miss)}}\vert \boldsymbol{X}^{{(obs)}})}}-\lambda\nabla{\log{r(\boldsymbol{X}^{{(miss)}})}} and [(logr(X(miss)))r(X(miss))]=[r(X(miss)))r(X(miss))r(X(miss)]=r(X(miss))\nabla\cdot[(\nabla{\log{r(\boldsymbol{X}^{{(miss)}})}})r(\boldsymbol{X}^{{(miss)}})] =\nabla\cdot[\frac{\nabla {{r(\boldsymbol{X}^{{(miss)}})}})}{r(\boldsymbol{X}^{{(miss)}})}r(\boldsymbol{X}^{{(miss)}}] =\nabla\cdot\nabla r(\boldsymbol{X}^{{(miss)}}).
  • We commit to revising our manuscript to rectify any ambiguities and ensure clearer presentation.

Thank you for reading our rebuttal. Given the above infos, we hope that these points could be kindly considered in the evaluation of our work, and we would really appreciate it if you could be generous in raising your score.


Refs
[1]. "Refining deep generative models via discriminator gradient flow." ICLR'21
[2]. "Convergence of flow-based generative models via proximal gradient descent in Wasserstein space." IEEE TIT
[3]. "Scalable Wasserstein Gradient Flow for Generative Modeling through Unbalanced Optimal Transport." ICML'24
[4]. "Deep generative learning via variational gradient flow". ICML'19
[5]. "Deep generative Wasserstein gradient flows"
[6]. "Normalizing flow neural networks by JKO scheme" NeurIPS'23
[7]. "Posterior Sampling Based on Gradient Flows of the MMD with Negative Distance Kernel." ICLR'24.
[8]. "Nonparametric generative modeling with conditional sliced-Wasserstein flows." ICML'23.
[9]. "Variational algorithms for approximate Bayesian inference", Doctoral Thesis'03.

评论

Dear Reviewer [833r]:
Thank you for your response and further inquiries. Here is a refined version of our rebuttal addressing the concerns raised:

  1. Mean-Field Assumption: The mean-field approximation described on Page 52 of reference [1], titled The Mean Field Approximation, involves conditionally independent factorization of the approximation distribution, updated iteratively. This approach validates our factorization of rr, aligning with established variational methods where each factor is updated independently. This operation has also been included in the text book [2] in Page 464 to 465, for variational inference. (again, r(X(obs))r(\boldsymbol{X}^{(obs)}) is a constant, since X(obs)\boldsymbol{X}^{(obs)} remains unchange)

  2. Modeling Strategy Justification: We have to emphasize that, we have not used posterior distribution in our manuscript. Thus, we think you may doubt that how we can simulate the conditional distribution modeling by joint distribution modeling about distribution pp (not rr):

    • Reference [3] provides empirical validation in Appendix C of this strategy.
    • Discrepancy measurements between conditional and joint distributions are thoroughly examined in reference [4] including ff-divergence, where supplementary material includes detailed derivations supporting our approach.
    • Reference [5] further substantiates that our modeling strategy of conditional-by-joint can be effectively applied within the WGF framework, as detailed in Remark 7 and Theorem 11 of the paper.
    • We elaborate on these validations in Section 5.2 of our manuscript (Lines 319 to 320) and provide detailed proofs on Lines 658 to 688. More specifically, our approach utilizes a discrepancy metric akin to Kullback-Leibler (KL) divergence, expressed as r(x)logr(x)p(x)dx-\int r(x) \frac{\log r(x)}{p(x)} \mathrm{d}x. Unlike the traditional use of KL divergence which incorporates diversification-encouraging positive entropy H[r(x)]\mathbb{H}[r(x)], our study employs diversification-discouraging negative entropy H[r(x)]-\mathbb{H}[r(x)]. We think our manuscript further extend this modeling strategy in a way theoretically.
  3. Extra Example why rr's factorization is justified: To further clarify, let's consider an example from SVGD, which may help elucidate why, in WGF-based approaches that approximate p(zx)p(z|x) with q(z)q(z) (represented by rr in our study), the process does not explicitly involve the input xx or the posterior p(zx)p(z|x).

    • Refer to Figure 4 in reference [6] for an illustrative example (it can approximate the ground truth posterior without explicitly computing posterior). In SVGD, q(z)q(z) is represented as a group of particles. These particles are not directly influenced by the input xx or the explicit form of the posterior p(zx)p(z|x); instead, they are guided by the velocity field determined by the evidence p(xz)p(x|z) and the prior p(z)p(z), with p(x)p(x) effectively being 0 under the gradient operator.
    • This setup allows q(z)q(z) to approximate p(zx)p(z|x) effectively with the help of velocity filed vv. Similarly, in our case with rr, there is no necessity to consider it as an input of X(obs)\boldsymbol{X}^{(obs)}.
    • Instead, rr is guided by the velocity field determined by p(X(miss),X(obs))p(\boldsymbol{X}^{(miss)}, \boldsymbol{X}^{(obs)}), which inherently includes observation information (recall rτ=(rv)\frac{\partial r}{\partial \tau}=-\nabla\cdot(rv), where r(X(miss))r(\boldsymbol{X}^{(miss)}) is shaped by velocity filed vv, r(X(obs))r(\boldsymbol{X}^{(obs)}) remains unchange, the velocity term for r(X(obs))r(\boldsymbol{X}^{(obs)}) is zero. Notably, and v=X(miss)δF_jointNERδr(X(miss))v=-\nabla_{\boldsymbol{X}^{(miss)}}\frac{\delta \mathcal{F}\_{joint-NER}}{\delta r(\boldsymbol{X}^{(miss)})}, the term F_jointNER \mathcal{F}\_{joint-NER} contains information about X(obs)\boldsymbol{X}^{(obs)}).

References:
[1] "Variational algorithms for approximate Bayesian inference", Doctoral Thesis '03.
[2] "Pattern Recognition and Machine Learning", Text Book'06
[3] "Nonparametric generative modeling with conditional sliced-Wasserstein flows", ICML '23.
[4] "Conditional Wasserstein Generator", IEEE TPAMI '23.
[5] "Posterior sampling based on gradient flows of the MMD with negative distance kernel", ICLR '24.
[6] "VAE learning via Stein variational gradient descent", NeurIPS'17.

We appreciate your detailed feedback and look forward to further discussions.

Best regards, Authors

评论

Dear authors,

Thank you so much for your detailed rebuttal and global response. However, regarding the assumption on the distribution r, the reviewer still finds the explanation offered by the authors to be a bit untransparent. Would it be possible for the authors to write the derivations in terms of conditional probability and posterior distributions? Alternatively, would it be fine for the authors to specify which lemmas/theorems in the papers [1,2,3] lead to the desired claims?

Thanks in advance!

References:

[1] Hagemann, P., Hertrich, J., Altekrüger, F., Beinert, R., Chemseddine, J., & Steidl, G. (2023). Posterior sampling based on gradient flows of the MMD with negative distance kernel. arXiv preprint arXiv:2310.03054.

[2] Du, C., Li, T., Pang, T., Yan, S., & Lin, M. (2023). Nonparametric generative modeling with conditional sliced-Wasserstein flows. arXiv preprint arXiv:2305.02164.

[3] Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. University of London, University College London (United Kingdom).

Best regards,

Reviewer 833r

评论

Before reading our response, we should come to following agreements, which are the settings of MDI task:

  • Throughout the imputation procedure, X(obs)\boldsymbol{X}^{({obs})} remains invariant regardless of any modifications to X(miss)\boldsymbol{X}^{(miss)}.
  • Given this invariance, it is accurate to state that r(X(obs))r(\boldsymbol{X}^{({obs})}) is constant, and consequently, r(X(obs)X(miss))=r(X(obs))r(\boldsymbol{X}^{(obs)}|\boldsymbol{X}^{(miss)}) = r(\boldsymbol{X}^{(obs)}), reflecting the independence of X(obs)\boldsymbol{X}^{(obs)} from X(miss)\boldsymbol{X}^{(miss)}.

Based on this, according to the requirement of Reviewers [833r] and [EhUg], we should factorize the rr as r(X(joint))=r(X(obs),X(miss))=r(X(miss)X(obs))r(X(obs))r(\boldsymbol{X}^{(joint)})=r(\boldsymbol{X}^{(obs)},\boldsymbol{X}^{(miss)})=r(\boldsymbol{X}^{(miss)}|\boldsymbol{X}^{(obs)}) r(\boldsymbol{X}^{(obs)}). Now let's analyze the left-hand-side of continuity equation: r(X(obs),X(miss))τ\frac{\partial r(\boldsymbol{X}^{(obs)},\boldsymbol{X}^{(miss)})}{\partial \tau}.

  • First, we can get: r(X(obs),X(miss))τ=r(X(miss)X(obs))r(X(obs))τ\frac{\partial r(\boldsymbol{X}^{(obs)},\boldsymbol{X}^{(miss)})}{\partial \tau} =\frac{\partial r(\boldsymbol{X}^{(miss)}|\boldsymbol{X}^{(obs)}) r(\boldsymbol{X}^{(obs)}) }{\partial \tau} ,
  • Then, expand the product on the right-hand-side: r(X(obs))r(X(miss)X(obs))τ_r(X(miss)X(obs))=r(X(obs)X(miss))r(X(miss))r(X(obs))+r(X(miss)X(obs))r(X(obs))τ_0\underbrace{r(\boldsymbol{X}^{(obs)})\frac{\partial r(\boldsymbol{X}^{(miss)}|\boldsymbol{X}^{(obs)}) }{\partial \tau} }\_{ r(\boldsymbol{X}^{(miss)}|\boldsymbol{X}^{(obs)}) = \frac{r(\boldsymbol{X}^{(obs)}|\boldsymbol{X}^{(miss)})r(\boldsymbol{X}^{(miss)})}{r(\boldsymbol{X}^{(obs)})} }+ \underbrace{r(\boldsymbol{X}^{(miss)}|\boldsymbol{X}^{(obs)}) \frac{\partial r(\boldsymbol{X}^{(obs)}) }{\partial \tau} }\_{0} , where the first underbrace is the Bayesian formula, the second underbrace indicates that r(X(obs))r(\boldsymbol{X}^{(obs)}) remains unchanged.
  • Now expand the first underbrace, we get r(X(obs))r(X(obs))r(X(obs)X(miss))r(X(miss))τr(X(obs)X(miss))=r(X(obs))\underbrace{\frac{r(\boldsymbol{X}^{(obs)})}{r(\boldsymbol{X}^{(obs)})}\frac{\partial r(\boldsymbol{X}^{(obs)}|\boldsymbol{X}^{(miss)})r(\boldsymbol{X}^{(miss)}) }{\partial \tau}}_{r(\boldsymbol{X}^{(obs)}|\boldsymbol{X}^{(miss)}) = r(\boldsymbol{X}^{(obs)}) } . The first underbrace is based on the abovementioned agreement.
  • Finally, we get: r(X(obs))r(X(miss))τ r(\boldsymbol{X}^{(obs)}) \frac{\partial r(\boldsymbol{X}^{(miss)})}{\partial \tau}. i.e. r(X(obs),X(miss))τ=r(X(obs))r(X(miss))τ \frac{\partial r(\boldsymbol{X}^{(obs)},\boldsymbol{X}^{(miss)})}{\partial \tau} = r(\boldsymbol{X}^{(obs)})\frac{\partial r(\boldsymbol{X}^{(miss)})}{\partial \tau}. Notably, the factorization r(X(obs),X(miss))=r(X(miss))r(X(obs)) r(\boldsymbol{X}^{(obs)},\boldsymbol{X}^{(miss)})=r(\boldsymbol{X}^{(miss)})r(\boldsymbol{X}^{(obs)}) can also have the same result given that r(X(obs))r(\boldsymbol{X}^{(obs)}) is a constant according to the agreement.

In summary, we can see that within WGF, the operation is r(X(obs),X(miss))=r(X(miss))r(X(obs)) r(\boldsymbol{X}^{(obs)},\boldsymbol{X}^{(miss)})=r(\boldsymbol{X}^{(miss)})r(\boldsymbol{X}^{(obs)}) is justified for MDI task. The influence of X(obs)\boldsymbol{X}^{(obs)} is hidden in the velocity filed vv, where the continuity equation: r(X(miss))τ=[r(X(miss))v]\frac{\partial r(\boldsymbol{X}^{(miss)})}{\partial \tau}=-\nabla\cdot[ r(\boldsymbol{X}^{(miss)})v] shapes the "actor" r(X(miss))r(\boldsymbol{X}^{(miss)})'s performance by the "comments" vv (the velocity field), given by the "critic" p(X(miss)X(obs))p(\boldsymbol{X}^{(miss)}|\boldsymbol{X}^{(obs)}).

We plan to add the abovementioned derivation in our revised manuscript to increase uphold the rigor of our manuscript.


We hope the above discussion will fully address your concerns about our work, and we would really appreciate it if these responses could meet with your approval. We look forward to your insightful and constructive responses to improve this work. Thank you very much!

评论

Dear Reviewer [833r]:

Thank you for your constructive comments, which have significantly contributed to enhancing our manuscript. In response to your concerns, we would like to outline and summary how we have addressed your problem each point in detailed:

Universality of the Mean-Field Assumption on rr's Decomposition would it be fine for the authors to specify which lemmas/theorems in the papers [1,2,3] lead to the desired claims?

  • Following your query, we have referenced specific pages in the literature to demonstrate how this assumption is commonly applied in general mean-field variational inference approaches.
  • In light of your request, we have provided detailed explanations in the specific pages on the evolution of the strategy that model the joint distribution using conditional distribution, where they treat the velocity filed of the observation part "vanishes", which supports the justification of this assumption within WGF framework.

Theoretical Justification of rr's Decomposition within WGF framework Would it be possible for the authors to write the derivations in terms of conditional probability and posterior distributions?

  • We have included a comprehensive, step-by-step derivation to substantiate the mean-field assumption of rr within the WGF framework, as prompted by your insightful query.

Finally, we would like to conclude with a metaphor to further illustrate the plausibility of this factorization r(X(joint))=r(X(miss))r(X(obs))r(\boldsymbol{X}^{(joint)})=r(\boldsymbol{X}^{(miss)})r(\boldsymbol{X}^{(obs)}):

  • Consider rr as an actor in a play, capable of being molded and shaped. Initially, the actor may not fully embody the role, akin to r(X(miss))r(\boldsymbol{X}^{(miss)}) not containing information about X(obs)\boldsymbol{X}^{(obs)}.
  • However, just as a director shapes an actor's performance through guidance and rehearsal, all we need to do is ensure that rr is appropriately molded by the directorial guidance (mirrors the continuity equation rτ=(vr)\frac{\partial r}{\partial \tau}=-\nabla\cdot(vr)) of the velocity field vv and the script provided by the critic p(X(obs)X(miss))p(\boldsymbol{X}^{(obs)} | \boldsymbol{X}^{(miss)})/p(X(obs),X(miss))p(\boldsymbol{X}^{(obs)} , \boldsymbol{X}^{(miss)}).
  • As long as rr can adapt based on this feedback (akin to the WGF framework), it can overcome the limitations of its initial portrayal (akin to r(X(joint))=r(X(miss))r(X(obs))r(\boldsymbol{X}^{(joint)})=r(\boldsymbol{X}^{(miss)})r(\boldsymbol{X}^{(obs)})).

We remain committed to providing rigorous revisions following your suggestions irrespective of what decision you make. Given your busy schedule, please do not feel obliged to respond to this message.

Warm regards,
Authors

评论

Dear authors,

Thank you for the detailed response. I will take a further look.

Best regards,

Reviewer 833r

评论

Dear Reviewer [833r],

Thank you very much for your encouraging feedback and for acknowledging the clarifications provided in our response. We are grateful for your decision to increase the score and support the publication of our paper.

We also appreciate your attention to detail and your advice regarding the typos in the appendix. We assure you that we will rigorously review the manuscript again to correct all typographical errors and enhance its readability.

Thank you once again for your insightful contributions to the refinement of our work.

Best regards,
Authors

评论

I would like to thank the authors for their detailed response, which have resolved almost all questions. I now think it is fine to support the paper to be published, so I will be increasing my score from 3 to 5. However, please make sure to correct all typos you can find in the appendix to make the whole manuscript more readable.

审稿意见
3

The paper proposes a new algorithm for data imputation. The idea is to estimate the score function corresponding to the posterior p(x_miss/x_obs) using DSM and then infer the missing values using a WGF equivalence argued in this paper itself. These alternating steps are repeated until convergence. Simulations on benchmark datasets illustrate the efficacy of the proposal.

优点

  1. In empirical comparisons, the proposal seems to beat existing baselines.

缺点

  1. There are some technical concerns I have raised in the next section

  2. I found description in sec34 very cryptic. It would have been nice if important portions of related appendix were moved to the main section. Also, will the proposed alternating style algorithm converge? Do we know anything property of this converged solution? Some discussion around the final algorithm would have helped in understanding the methodology better.

问题

  1. I am not sure how the equivalence in (2) follows. It seems the objective in RHS is independent of X_miss as it is integrated out. objective in RHS is a function of r, p, X_obs. Whereas objective in LHS is a function of X_miss, does not involve r. It will be helpful if this equivalence is clarified. Wrt. to the optimization variable the objective in RHS seems to be a constant. Is there any notation that I am missing? Since it is a basic step and subsequent derivations depend on this critically, I was unable to check the correctness of soem of the steps.

  2. Morevoer, (4) seems to clearly show that the regularizer is not a function of X_miss. Then I am not sure how this regularizer, which is essentially a constant, matter.

  3. Reading line 158 gives an impression that r is the unknown. However, r does not seem to appear in LHS of (2) . If it does, when what is the relation with p in the objective of (2) ?

  4. is the assumptions of prop 3.4 meaningful? If X_Obs are X_miss are independent, then what information will the observations provide for imputation? will the imputation problem remain meaningful ? Please clarify this.

局限性

na

作者回复

Thank you for your comments. Before reading our response, we think we should come to the following agreements:

  1. Optimizing the instances xx from distribution r(x)r(x) is optimizing this distribution r(x)r(x), which is the basis of particle-based variational inference like Stein Variational Gradient Descent [1].
  2. The velocity filed vv drives the optimization of the cost functional (the 'function' of function), according to Eq. (A.10), is changing and hard to be 0, until we reach the equilibrium point (nearly impossible at the beginning of the imputation procedure).
  3. During imputation procedure, r(X(miss))dX(miss)=1\int{r(\boldsymbol{X}^{(miss)})\mathrm{d}\boldsymbol{X}^{(miss)}}=1 is a constant (normalization constraint), but r(X(miss))r(\boldsymbol{X}^{(miss)}) and H[r(X(miss))]r(X(miss))logr(X(miss))dX(miss)\mathbb{H}[r(\boldsymbol{X}^{(miss)})]\coloneqq-\int{r(\boldsymbol{X}^{(miss)})\log{r(\boldsymbol{X}^{(miss)})}\mathrm{d}\boldsymbol{X}^{(miss)}} are not constant, since X(miss)\boldsymbol{X}^{(miss)} is changing
  4. Furthermore, logp(X(miss)X(obs))\log{p(\boldsymbol{X}^{(miss)}\vert \boldsymbol{X}^{(obs)} )} is not a constant when we change X(miss)\boldsymbol{X}^{(miss)} unless it is a uniform distribution, which is also nearly impossible in practice.

Weakness:

W1: Technical Concerns

Please see next part for detailed information.

W2: Convergence

  • Our convergence discussions have already been listed in Appendix E.3 theoretically and empirically.
  • For the final solution, all we can know is that it may reach the equilibrium point, where X(miss)δFδr(X(miss))=0\nabla_{\boldsymbol{X}^{(miss)}}\frac{\delta \mathcal{F}}{\delta r(X^{(miss)})}=0 holds.

Questions:

Q1: Understanding of Eq. (2) It seems the objective in RHS is independent of X_miss ... of r, p, X_obs. Whereas objective in LHS ... X_miss, does not involve r.

The LHS is the content at the left side of \Rightarrow, and RHS is the content at the right side of \Rightarrow. Based on this, let's start to investigate Eq. (2) based on Agreement 1:

  • The LHS indicates that we are finding some X(miss)\boldsymbol{X}^{{(miss)}} from some unknown distribution r(X(miss))r(\boldsymbol{X}^{{(miss)}}) (X(miss)r(X(miss))\boldsymbol{X}^{{(miss)}}\sim r(\boldsymbol{X}^{{(miss)}})), such that we can maximum the likelihood function logp(X(miss)X(obs))\log{p(\boldsymbol{X}^{{(miss)}}\vert \boldsymbol{X}^{{(obs)}})}.
  • For optimizers, to the best of our knowledge whether GAMS solvers like GuROBi or Neural Network Solvers like Adam, can only handle scalar learning objectives.
  • Consequently, we can convert it to 1Mi=1Mlogp(X(miss)_iX(obs)_i)\frac{1}{M} \sum_{i=1}^{M}{\log{p(\boldsymbol{X}^{{(miss)}}\_i|\boldsymbol{X}^{{(obs)}}\_i)}}, where MM is the missing value size. This is the Monte Carlo (MC) Estimation of term Er[logp(Xi(miss))p(X(obs))]\mathbb{E}_{r}[\log{p(\boldsymbol{X}^{{(miss)}}_i)\vert p(\boldsymbol{X}^{{(obs)}})}] (to our understanding, the integrated out you mentioned is MC integration), which is the RHS.
  • Conversion in Eq. (2) is widely used in Quantum MC [2], where they sample some instances from an optimizable distribution and optimize the concerning functional based on these instances.

Q1: Constant Question to the optimization variable the objective in RHS seems to be a constant.

The functional H[r(X(miss))]\mathbb{H}[r(\boldsymbol{X}^{{(miss)}})] is changing, not a constant since we mentioned in Eq. (A.10) rτ=(rv)\frac{\partial r}{\partial \tau}=-\nabla\cdot(r v), with velocity vv unless v=0v=0 based on Agreement 4.

Q2: r and X(miss)\boldsymbol{X}^{{(miss)}}

Based on Agreement 1, this question is answered.

Q2: Is NER a constant?

Based on Agreements 2 to 4, this question is answered.

Q3: rr is missed at the LHS of Eq. (2)

The exact expression of rr is not our concern, the vital component is X(miss)\boldsymbol{X}^{(miss)}, the imputed value (Agreement 1). By investigating funcional related to rr and optimizing concerning functional (The X(miss)r(X(miss))\boldsymbol{X}^{{(miss)}}\sim r(\boldsymbol{X}^{{(miss)}}) in LHS indicates that rr occurs in the LHS of Eq. (2)), analyze why DM-based MDI approaches do not take effect, and propose improvements is the novelty of this manuscript.

Q3: rr is unkown:

It is hard to compute pp (existence of normalization constant) and rr, but we can still progressively increase the value of Er(X(miss))[logp(X(miss)X(obs))]λH[r(X(miss))]\mathbb{E}_{r(\boldsymbol{X}^{(miss)})}[\log{p(\boldsymbol{X}^{(miss)}\vert \boldsymbol{X}^{(obs)} )}] -\lambda \mathbb{H}[r(\boldsymbol{X}^{(miss)})] and realize MDI task (with input X(miss)\boldsymbol{X}^{(miss)}) throughout WGF, that's what we did in Section 3.2 and 3.3.

Q3: Relationship between rr and pp

  • Based on Agreement 1, we are representing X(miss)\boldsymbol{X}^{(miss)} by rr.
  • rr is 'some' proposal distribution, and pp is the 'evaluator', which 'evaluates wheter r(X(miss))r(\boldsymbol{X}^{(miss)})/X(miss)\boldsymbol{X}^{(miss)} is suitable', and 'reshape rr (realize by reshaping X(miss)\boldsymbol{X}^{(miss)}) to make it appropriate' based on cost functional Er[logp]λH(r)\mathbb{E}_r[\log{p}]-\lambda \mathbb{H}(r).
  • rr is akin to an actor, and pp is akin to a critic. The imputation procedure is akin to improving the actors (rr/X(miss)\boldsymbol{X}^{(miss)}) with guidance (WGF) from the critics (pp).

Q4: Assumption r(X(joint))r(X(miss))p(X(obs))r({\boldsymbol{X}^{{(joint)}}})\coloneqq r({\boldsymbol{X}^{{(miss)}}}) p({\boldsymbol{X}^{{(obs)}}})

Please see point 3) in the common rebuttal chat window, where this decomposition is widely applied in mean-field variational inference [3].


Thank you for reading our rebuttal! We hope the above discussion will fully address your concerns about our work, and we would really appreciate it if you could be generous in raising your score.


Refs
[1]. "Stein variational gradient descent: A general purpose bayesian inference algorithm." NeurIPS 16.
[2]. "Ab initio solution of the many-electron Schrödinger equation with deep neural networks." Physcial Reviews Research
[3]. "Variational algorithms for approximate Bayesian inference", Doctoral Thesis'03.

评论

During our interaction with Reviewer [833r], we further elaborated on the theoretical justification for our factorization of r(X(joint))=p(X(obs))r(X(miss))r(\boldsymbol{X}^{(joint)}) = p(\boldsymbol{X}^{(obs)})r(\boldsymbol{X}^{(miss)}). For a detailed explanation, please refer to our comments in the comment chat window addressed to Reviewer [833r], titled Theoretical Justification for Mean-Field Factorization of $r$ within the WGF Framework. We hope this clarification also addresses your concerns effectively.

评论

Dear Reviewer [EhUg],

Thank you for your constructive feedback, which has significantly enhanced our manuscript. Below, we address your concerns point by point with detailed explanations:

Q1: It would have been nice if important portions of related appendix were moved to the main section.

  • Due to page constraints, we cannot incorporate basic knowledge sections into the main content. However, we aim to provide a comprehensive background for understanding.
  • To this end, we have included detailed explanations of WGF, the MDI task, and the derivation of concerning proofs in the appendix.

Q1: Also, will the proposed alternating style algorithm converge? Do we know anything about the properties of this converged solution? Some discussion around the final algorithm would have helped in understanding the methodology better.

  • We have addressed the convergence properties of our proposed KnewImp approach both theoretically and empirically in Appendix E.3.
  • Detailed information about the convergence behavior and properties is provided in our individual rebuttal to you, where we may reach to the equilibrium point for X(miss)δFδr=0\nabla_{\boldsymbol{X}^{(miss)}}\frac{\delta \mathcal{F}}{\delta r} = 0.

W1: I am not sure how the equivalence in (2) follows.

  • Equivalence of r(X(miss))r(\boldsymbol{X}^{(miss)}) and X(miss)\boldsymbol{X}^{(miss)} in (2): Our analysis of the DM-based MDI task through the WGF framework explains that optimizing the instances from sample xx is essentially optimizing the sample distribution r(x)r(x).
  • An example for DMs:
    • Consider the initial DMs, which are used for data generation at the beginning. Starting with a group of samples drawn from white noise, we aim to progressively refine these samples until they closely approximate a target data distribution, such as an image of a dog.
    • Throughout this transformation, the data distribution evolves from random noise to the "true distribution" represented by the DM, by optimizing the sample points.
    • Consequently, optimizing individual instances from a sample xx can be understood as refining the sample's probablity density function r(x)r(x), which encapsulates the distribution of xx. For DMs' inference, the goal is to align r(x)r(x) as closely as possible with the true data distribution.
  • Our transformation is supported by Monte Carlo Estimation, where E_r(x)[f(x)]1Mi=1M[f(xi)],xir(x)\mathbb{E}\_{r(x)}[f(x)] \approx \frac{1}{M} \sum_{i=1}^{M}[f(x_i)],x_i\sim r(x), applied from left to right is used in particle variational inference [1] and Quantum MC [2]. But for theoretical analysis, it is applied from right to left (scenario in our manuscript).

W2: Then I am not sure how this regularizer, which is essentially a constant, matters.

  • Please note that, the constant regularization will not effect the optimized results since X(miss)Constant=0\nabla_{\boldsymbol{X}^{(miss)}}{\text{Constant}}=0.
  • Theoretically, in our model, the entropy regularization H[r]\mathbb{H}[r] is dynamic and not constant during the optimization process, unless the velocity filed vv is zero or rr becomes uniform distribution.
  • Practically, evidence in Section 1.2 of our attached PDF demonstrates that changing the regularization strength from negative to positive significantly alters the optimal values, confirming that entropy regularization is not a constant and does influence the optimized results.

W3: Reading line 158 gives an impression that $r$ is the unknown. However, $r$ does not seem to appear in LHS of (2). If it does, then what is the relationship with $p$ in the objective of (2)?

  • rr, represented by the particles/samples X(miss)\boldsymbol{X}^{(miss)}, and representing rr by X(miss)\boldsymbol{X}^{(miss)} is the quintessence of particle variational inferene approaches [1].
  • Obtaining the samples rather than the detailed expressions for rr matters.
  • For a detailed explanation of the relationship between rr and pp, please refer to the metaphor in our rebuttal chat with Reviewer [833r], titled Summary of the Response to Reviewer [833r].

W4: Is the assumptions of prop 3.4 meaningful?

  • Detailed justification for the mean-field factorization of rr within the WGF framework is provided in our discussions with Reviewer [833r], including literature reviews and theoretical derivations in the rebuttal chat entitled Response to [833r]'s additional question about $r$'s factorization and Theoretical Justification for Mean-field Factorization of $r$ within WGF framework.

Refs:
[1]. "Stein variational gradient descent: A general purpose bayesian inference algorithm.".
[2]. "Ab initio solution of the many-electron Schrödinger equation with deep neural networks."


Thank you for taking the time to read our rebuttal. We are committed to implementing thorough revisions based on your suggestions. Considering your busy schedule, please feel no obligation to respond to this message.

Best regards,
Authors

评论

In addition, we think your inquiry on the question, Also, will the proposed alternating style algorithm converge? may question Algorithm 4's convergence.

At the beginning, we would liek to thank you for your inquiry regarding the convergence of the proposed alternating style algorithm, particularly in relation to Algorithm 4. In response, we have expanded Appendix E.3, which initially focused on the convergence of the "impute" part of algorithm 4. In this comment chat window, we will present the proof of convergence for the "estimate" part for algorithm 4, specifically focusing on the loss function L_DSM\mathcal{L}\_{\text{DSM}} for the estimate part, denoted by neural network parameter θ\theta. We divide our proof into two main parts:

  • Monotonic Decreasing of L_DSM\mathcal{L}\_{\text{DSM}}:
    • Analyzing the evolution of L_DSM\mathcal{L}\_{\text{DSM}} over time τ\tau, we consider the differential equation: dL_DSMdτ=θL_DSM,dθdτ\frac{\mathrm{d}\mathcal{L}\_{\text{DSM}}}{\mathrm{d}\tau} = \langle \nabla_{\theta}\mathcal{L}\_{\text{DSM}}, \frac{\mathrm{d}\theta}{\mathrm{d}\tau} \rangle, where ,\langle \cdot, \cdot \rangle denotes the inner product.
    • The parameter θ\theta is updated via a gradient descent-like algorithm: θt+1=θtlrθL_DSM\theta_{t+1} = \theta_t - lr \nabla_{\theta}\mathcal{L}\_{\text{DSM}}. Taking the limit as lrlr approaches zero, we obtain: limlr0θt+1θtlr=dθdτ=θL_DSM\lim_{lr \rightarrow 0} \frac{\theta_{t+1} - \theta_t}{lr} = \frac{\mathrm{d}\theta}{\mathrm{d}\tau} = -\nabla_{\theta}\mathcal{L}\_{\text{DSM}}.
    • Substituting dθdτ=θL_DSM\frac{\mathrm{d}\theta}{\mathrm{d}\tau} = -\nabla_{\theta}\mathcal{L}\_{\text{DSM}} into the differential equation dL_DSMdτ=θL_DSM,dθdτ\frac{\mathrm{d}\mathcal{L}\_{\text{DSM}}}{\mathrm{d}\tau} = \langle \nabla_{\theta}\mathcal{L}\_{\text{DSM}}, \frac{\mathrm{d}\theta}{\mathrm{d}\tau} \rangle provides: dL_DSMdτ=θL_DSM,θL_DSM0\frac{\mathrm{d}\mathcal{L}\_{\text{DSM}}}{\mathrm{d}\tau} = -\langle \nabla_{\theta}\mathcal{L}\_{\text{DSM}}, \nabla_{\theta}\mathcal{L}\_{\text{DSM}} \rangle \leq 0, indicating that L_DSM\mathcal{L}\_{\text{DSM}} monotonically decreases over time.
    • In summary, the evolution of L_DSM\mathcal{L}\_{\text{DSM}} is monotonic decreasing along time τ\tau.
  • Lower-Bounded Property of L_DSM\mathcal{L}\_{\text{DSM}}:
    Reflecting on the definition of L_DSM\mathcal{L}\_{\text{DSM}}, which is: L_DSM12E_qσ(X^(joint)X(joint))[X^(joint)logp^(X(joint))X^(joint)logqσ(X^(joint)X(joint))2]\mathcal{L} \_{\text{DSM}} \coloneqq\frac{1}{2}\mathbb{E}\_{q_{\sigma}(\hat{\boldsymbol{X}}^{(joint)}\vert {\boldsymbol{X}}^{(joint)})}[\Vert \nabla_{{\hat{\boldsymbol{X}}}^{(joint)}}\log\hat{p}({\boldsymbol{X}}^{(joint)}) - \nabla_{\hat{\boldsymbol{X}}^{(joint)}} \log{q_{\sigma}(\hat{\boldsymbol{X}}^{(joint)}\vert {\boldsymbol{X}}^{{(joint)}})} \Vert^2] , we confirm that L_DSM0\mathcal{L}\_{\text{DSM}} \geq 0.

In conclusion, similar to Proposition E.1 in Section E.3 of the appendix, when the learning rate lrlr is sufficiently small, the "estimate" part may converge. Furthermore, the optimal parameter may reach an equilibrium point where θL_DSM=0\nabla_{\theta}\mathcal{L}\_{\text{DSM}} = 0.


We will add the abovementioned contents in our revised manuscript. Thank you once again for your comments, we hope this derivation addresses your problem about the convergence of in a further way.

评论

Dear reviewer EhUg,

Since the discussion period will end in a few hours, we will be online waiting for your feedback on our rebuttal, which we believe has fully addressed your concerns.

We would highly appreciate it if you could take into account our response when updating the rating and having discussions with AC and other reviewers.

Thank you so much for your time and efforts. Sorry for our repetitive messages, but we're eager to ensure everything is addressed.

Authors of Submission 1850

审稿意见
6

This paper introduces Kernelized Negative Entropy-regularized Wasserstein Gradient Flow Imputation (KnewImp), a novel approach for imputing missing data in numerical tabular datasets. The proposed method addresses two significant challenges in diffusion model-based missing data imputation (MDI): inaccurate imputation and difficult training. KnewImp integrates the Wasserstein gradient flow (WGF) framework with a negative entropy-regularized (NER) cost functional to enhance imputation accuracy and simplify the training process by eliminating the need for complex mask matrix designs. The method's efficacy is demonstrated through extensive experiments, showing superior performance compared to state-of-the-art imputation techniques.

优点

  1. The paper presents a unique integration of diffusion models with the Wasserstein gradient flow framework, incorporating a novel negative entropy regularization to address specific challenges in missing data imputation.

  2. The work is grounded in solid theoretical foundations, providing clear proofs and propositions that establish the effectiveness and validity of the proposed approach.

  3. Extensive experiments on real-world datasets validate the method's superiority, with significant improvements in both mean absolute error (MAE) and Wasserstein distance (Wass) metrics.

缺点

  1. The theoretical concepts and mathematical formulations presented are quite dense and may be challenging for readers not well-versed in advanced optimization and diffusion models. Simplifying these explanations or providing more intuitive descriptions could improve accessibility.

  2. While the method is compared to several models, including a wider range of baseline methods, particularly more recent advancements in diffusion-based MDI (e.g., [1,2]), would provide a more comprehensive evaluation.

  3. The paper could benefit from a more detailed discussion on the convergence properties of the proposed algorithm, including potential limitations and scenarios where the method might struggle.

[1] Du, Tianyu, Luca Melis, and Ting Wang. "ReMasker: Imputing Tabular Data with Masked Autoencoding." International Conference on Learning Representations, 2024.

[2] Zheng, Shuhan, and Nontawat Charoenphakdee. "Diffusion models for missing value imputation in tabular data." NeurIPS 2022 First Table Representation Workshop.

问题

  1. Can the authors elaborate on how the proposed method could be adapted or extended to handle other types of data, such as categorical or mixed-type datasets?

  2. What are the computational requirements for implementing KnewImp in practice, and how does it scale with larger datasets?

局限性

The authors have adequately addressed the limitations.

作者回复

Thank you for your insightful advice and valuable questions, we will respond to your concerns point by point.

Weaknesses

W1: Dense Mathematical

  • We acknowledge that the theoretical concepts and mathematical formulations in our manuscript could be challenging for readers not extensively familiar with advanced optimization and diffusion models.
  • To enhance the accessibility and readability of our work, we will include a new subsection in the appendix of our revised manuscript. This subsection will detail the derivation and meaning of each equation, ensuring that the mathematical underpinnings are more comprehensible.

W2: Extra Basline

  • For CSDI_T [1]:
    • Its primary contribution involves the introduction of one-hot encoding, analog bits encoding, and feature tokenization to manage categorical variables in MDI tasks on tabular data. Our research, however, specifically focuses on numerical tabular data and assumes the absence of categorical data. Due to this distinction, we did not consider CSDI_T as a relevant baseline for our study.
    • Nevertheless, the categorical feature extraction module in CSDI_T, once removed, essentially transforms it into the CSDI model, which we have indeed utilized and referenced in our manuscript.
  • For ReMasker [2]:
    • We have included an additional baseline model named ReMasker in our comparisons. Please see the common rebuttal chat window. We will add the results of this baseline model in the revised manuscript.

W3: Convergence

  • We have included discussions on the convergence located in Appendix E.3 with theoretical proofs and empirical validations on all datasets in our manuscript.
  • We will add a footnote in our revised manuscript to demonstrate this point.

Questions:

Q1: Mixed Type Data:

  • The key to applying our approach to such datasets involves decomposing the missing data distribution, r(X(miss))r(\boldsymbol{X}^{({miss})}), into a product of distributions for the dense and categorical components under a mean-field assumption common in variational inference [3]. Specifically, we express it as r(X(miss))=r(X(miss,dense))r(X(miss,cate))r(\boldsymbol{X}^{({miss})}) = r(\boldsymbol{X}^{({miss, dense})})r(\boldsymbol{X}^{({miss, cate})}).
  • For the dense data component, we continue to employ our KnewImp method.
  • For the categorical part, we can initially model it using a Dirichlet distribution, which naturally supports simplex spaces, the steps are summarized as follows.
    • First, we can implement mirror descent [4] with the operator xψ(x)=logx\nabla_x\psi(x) = \log{x}, which maps the distribution’s support from the simplex ΔC1\Delta^{\text{C}-1} (where C\text{C} represents the number of categories) onto RC\mathbb{R}^{\text{C}}.
    • Subsequently, we apply KnewImp in this transformed space (RC\mathbb{R}^{\text{C}}).
    • After that, we can revert the distribution back to the simplex using the inverse operator [xψ(x)]1=Softmax(x)[\nabla_x\psi(x)]^{-1} = \text{Softmax}(x).
  • Besides, our example located in Section 1.2 of the attached PDF has realized this scheme, where the constraint of variables is a three-dimensional standard simplex.

Q2: Requirement

  • Our computational requirement is given in lines 743 to 746: all experiments are conducted on a workstation equipped with an Intel Xeon E5 processor with four cores, eight Nvidia GTX 1080 GPUs, and 128 GB of RAM., and we found that this configuration is enough for WQW dataset with 4898 items.
  • In addition, we further provided detailed time complexity analysis in Appendix E.2 theoretically and empirically, which can mirror its scalability with larger datasets.

Thank you for reading our rebuttal! We hope the above discussion will fully address your concerns about our work, and we would really appreciate it if you could be generous in raising your score.


Refs
[1] "Diffusion models for missing value imputation in tabular data." NeurIPS'22 First Table Representation Workshop.
[2] "ReMasker: Imputing Tabular Data with Masked Autoencoding." ICLR'24.
[3] "Variational algorithms for approximate Bayesian inference", Doctoral Thesis'03.
[4] "Sampling with Mirrored Stein Operators" ICLR'22

评论

Thank you for addressing my concerns. Since my issues have been resolved, I will maintain my positive score.

评论

Dear Reviewer [Gtbe],

Thank you for your valuable feedback and encouraging comments, which have greatly motivated us throughout the rebuttal process. Out of respect for your review efforts, we would like to summarize how we have addressed your inquiries:

Clarifying the Paper:

  • In response to insights gained from discussions with other reviewers, we recognize that some of our mathematical formulations may appear dense. We plan to include metaphors to better illustrate concepts like our rr and pp dynamics [EhUg, 833r], explain transformations such as Monte Carlo estimation [EhUg], and discuss the selection of RKHS [9XXU] to enhance the manuscript's readability.

Incorporating Additional Baseline Models:

Following your recommendation, we have added the baseline model ReMasker. We will detail its integration and relevance to the DMs used in our study, tailored to our specific scenarios (numerical tabular data).

Enhanced Discussion on Convergence, Limitations, and Scenarios:

  • We will augment our manuscript with a detailed proof of the convergence for the "estimate" part, furthering our discussions with Reviewer [EhUg].

  • Additional experiments and analyses on toy case datasets will be included, especially considering data properties like multi-modality, heavy tails, and skewed distributions as suggested by Reviewer [9XXU].

Handling Mixed-Type Data

We plan to outline strategies for managing mixed-type data in our future research directions, building on our discussions with you.

Computational Resources

We will clearly indicate the computational resources used in our study in the revised manuscript to ensure transparency according to your guidance.


Thank you once again for your supportive and constructive review. We are grateful for your continued positive assessment. Given your busy schedule, please do not feel obliged to respond to this message.

Sincerely,

Authors

作者回复

Overall Response

We are encouraged by the reviewers' acknowledgment of the strengths in our paper, such as its robust performance [Gtbe] [EhUg] [833r] [fs2D] [9XXU], comprehensive experimentation [Gtbe] [fs2D] [9XXU], and clear, concise presentation [Gtbe] [fs2D] [9XXU]. However, we also recognize that there are common concerns raised by some reviewers regarding the 1). motivation for using Gradient Flow (GF) to discourage diversification [833r] [fs2D], 2). the assumptions related to the decomposition r(X(joint))r(X(miss))p(X(obs))r(\boldsymbol{X}^{(joint)})\coloneqq r(\boldsymbol{X}^{(miss)})p(\boldsymbol{X}^{(obs)}) [EhUg] [833r], and 3). the adequacy of our experimental validation [Gtbe] [fs2D][9XXU]. To address these concerns, we offer the following clarifications:

Motivations & Contributions:

The motivation behind KnewImp is to address specific challenges in diffusion model (DM)-based missing data imputation (MDI) tasks:

  • Inconsistency between DM's goals and MDI objectives: As generative models, DMs inherently aim to diversify data with implicit regularization terms, which intuitively conflicts with MDI requirements that often demand precise values.
  • Design of Mask Matrix for Model Training: DMs require a mask matrix to formulate conditional distributions [1], and the design of mask matrix is crucial to the imputation accuracy.

Based on this, our major contributions are summarized as follows:

  • Introduction of GF in DM-based MDI: We fconceptualize MDI as an optimization problem and employ GF, initially designed for functional optimization [2], to elucidate the shortcomings of DMs in MDI, particularly how DMs inadvertently promote diversification through terms like entropy and variance (Section 3.1).
  • Novel and Effective Cost Functional: We introduce an effective cost functional that incorporates the negative regularization term, with a rigorously derived implementation strategy (Section 3.2).
  • Sidestepping Mask Matrix Design through Joint Distribution Modeling: We demonstrate that within the GF framework, it is possible to circumvent the traditional mask matrix design and instead utilize a joint distribution modeling approach (Section 3.3).

Weaknesses & Questions Response:

1). Diversification Discouraging [833r] [fs2D]:

  • In Section 1.1 of our attached PDF, we analyze two distributions: uniform and normal. The uniform distribution exhibits higher entropy, aligning with the generative models‘ goal where each value within the support is equally probable (maximum entropy may result in uniform distribution [3]). This characteristic, however, does not align with the objectives of MDI, where specific values are often required.
  • Building on this analysis, Section 1.2 of our PDF compares KnewImp's performance when optimizing a cost functional related to a specified Dirichlet distribution. By gradually adjusting the weight of the negative entropy term (λ\lambda) of KnewImp from negative to positive, we demonstrate that increasing accuracy in MDI tasks may require a reduction in diversification.

2). Assumption of r(X(joint))r(X(miss))p(X(obs))r(X^{(joint)}) \coloneqq r(X^{(miss)})p(X^{(obs)}) [EhUg] [833r]:

  • We treat rr as a proposal distribution akin to the approximation distribution within the variational inference context, which is an ansatz meant to be optimized based on a functional related to the distribution p(X(joint))p(X^{(joint)}).
  • Thus, it is crucial to verify the application of p(X(joint))=?p(X(miss))p(X(obs))p(X^{(joint)}) \stackrel{?}{=} p(X^{(miss)}) p(X^{(obs)}), rather than r(X(joint))=?r(X(miss))r(X(obs))r(X^{(joint)}) \stackrel{?}{=} r(X^{(miss)}) r(X^{(obs)}). Fortunately, KnewImp does not assume p(X(joint))=p(X(miss))p(X(obs))p(X^{(joint)}) = p(X^{(miss)}) p(X^{(obs)}) (i.e. p(X(joint))p(X(miss))p(X(obs))p(X^{(joint)}) \neq p(X^{(miss)}) p(X^{(obs)})). Moreover, the mean-field assumption in variational inference, where r(X(joint))=r(X(miss))r(X(obs))r(X^{(joint)}) = r(X^{(miss)}) r(X^{(obs)}), is practical according to references represented by [4].
  • Finally, X(obs)X^{(obs)} remains unchanged in MDI, which indicates that r(X(obs))r(X^{(obs)}) and p(X(obs))p(X^{(obs)}) are identity and constant, thus we can replace r(X(obs))r(X^{(obs)}) with p(X(obs))p(X^{(obs)}).

3). Extra Experiments [Gtbe] [fs2D][9XXU]:

  • We added results from ReMasker model [5] according to reviewer [Gtbe]: |Dataset & Metric| BT MAE| BT Wass|BCD MAE|BCD Wass| CC MAE|CC Wass|CBV MAE|CBV Wass|IS MAE|IS Wass|PK MAE|PK Wass|QB MAE| QB Wass |WQW MAE|WQW Wass| |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| | MAR | 0.55| 0.43*| 0.50*|1.53* | 0.60*| 0.43*| 0.50*|0.40*|0.58*| 2.02*|0.53|1.31| 0.64*| 3.75*|0.53|0.59| | MCAR| 0.44| 0.15| 0.37*|1.56*| 0.55*| 0.37 | 0.56*| 0.63*|0.55*| 4.10*|0.47*|1.51| 0.47*| 4.14*|0.56*|0.78| | MNAR| 0.53| 0.26 | 0.42*|2.08*| 0.54*| 0.39*| 0.58*| 0.66*|0.50*| 3.57* | 0.56*| 2.59 |0.50| 5.53*|0.58|0.82|
  • We have included extra experimental results in Section 2.1 of our attached PDF as suggested by reviewer [fs2D], focusing on a downstream classification task similar to the experiments depicted in Fig. 5 of the TDM paper [6], using the cross_val_score function from the sklearn package (the rest of the settings are same as TDM paper).
  • We have incorporated additional experimental results concerning the performance of KnewImp across various data distributions in Section 2.2 of the attached PDF, as suggested by reviewer [9XXU].

In all, thank you for considering our responses, and we look forward to any further feedback that might help refine our work.


Refs
[1]. "CSDI: Conditional score-based diffusion models for probabilistic time series imputation." NeurIPS'21
[2]. "{Euclidean, metric, and Wasserstein} gradient flows: an overview." Bulletin of Mathematical Sciences
[3]. "Nonlinear Stein variational gradient descent for learning diversified mixture models" ICML'19
[4]. "Variational algorithms for approximate Bayesian inference", Doctoral Thesis'03.
[5]. "ReMasker: Imputing Tabular Data with Masked Autoencoding." ICLR'24.
[6]. "Transformed distribution matching for missing value imputation." ICML'23.

评论

Dear Reviewers, AC, SAC, and PC,

We would like to begin by expressing our sincere gratitude for your engagement throughout the rebuttal process, which has significantly enhanced the quality of our paper.

Our work introduces the Wasserstein Gradient Flow to offer a fresh perspective to Diffusion Model-based Missing Data Imputation (DM-based MDI) tasks for numerical tabular data. We have explored why DMs may not be effective for MDI, outlined strategies rigorously to mitigate these issues, and developed methods to train more effective models (with higher imputation accuracy) without the use of mask matrix.

In response to discussions with reviewers, we have revised the following contents in our manuscript:

  • Clarification and Detailing: We have added detailed explanations for the equations in our derivations [Gtbe, EhUg, 833r, 9XXU] and provided intuitive numerical experimental results [833r, fs2D].
  • Additional Experiments: We have included extra experiments to showcase the applicability of KnewImp across various baseline models [Gtbe], improvements in downstream tasks [fs2D], reasons for choosing RKHS [9XXU], and analyses of different data distributions [9XXU].
  • Corrections: We have revised typos and addressed specific details as kindly pointed out [833r], including corrections in the derivation of concerning propositions.
  • Data Type Extensions: Detailed strategies have been added to extend KnewImp's application to data with categorical variable [Gtbe].
  • Literature Review Expansion: A more comprehensive description of related works has been provided, including all suggested citations [833r].
  • Theoretical Proofs: Complete convergence proofs and complexity analyses have been included [Gtbe, EhUg].

We are immensely grateful for your feedback and suggestions for improving our manuscript. We kindly request your generous consideration of these points in the final evaluation of our paper.

Sincerely,

Authors of Submission 1850

最终决定

This paper discusses the applicability of Diffusion Models to tabular data. It makes the case that existing Models is not effective for imputation in numerical Tabular data. The paper suggests a novel regularisation term based on negative entropy and discusses a proedure for imputing Missing values. The proposal seems sound and empirically outperforms existing methods. The paper should be of interest to Neurips.