TabNAT: A Continuous-Discrete Joint Generative Framework for Tabular Data
摘要
评审与讨论
This paper proposes a model named TabNAT for tabular data synthesis (and imputation). The core idea is to use a diffusion model to generate all columns (of a row) with continuous values together, and a transformer to generate the categorical columns. Experimental results on multiple benchmark datasets show the effectiveness of the proposed model.
给作者的问题
The author(s) may want to comment on the choice of experimental datasets.
论据与证据
Yes. The claimed advantages of the proposed model in terms of statistical fidelity, data utility, and privacy protection are shown with experimental results on 10 benchmark datasets.
方法与评估标准
The proposed model is intuitive given the setup of the problem.
The chosen evaluation metrics have been used in the literature and seem suitable for the claims.
理论论述
There are no theorem of proofs in the paper.
实验设计与分析
An issue with the experiments is the choice of datasets. While the proposed model is quite relevant to DP-TBART/Tab-MT, many of the datasets used in the DP-TBART/Tab-MT papers are not used in the experiments of this paper. It is unclear why this has been the case.
For the other closely related work TabSyn, the datasets used in the TabSyn paper are used in this one which is good for comparison. However, the proposed model TabNAT is often outperformed by (or very close to) TabSyn over (some of) these datasets, as shown in Tables 8 to 14.
Instead of claiming an overall better model, perhaps the paper could focus on the datasets where TabNAT shows better performance and analyzes why TabNAT achieved strong results on those datasets.
补充材料
Yes. I have gone through the appendix.
与现有文献的关系
This paper builds upon the idea of tabular data generation models using transfomer (auto-regressive; DP-TBART/Tab-MT) and diffusion models (TabDDPM/TabSyn). Instead of using either model, it combines the two to achieve a different design while showing that it is also as effective.
遗漏的重要参考文献
None that I'm aware of.
其他优缺点
Overall this paper proposes a new heuristic structure of tabular data synthesis model based on diffusion models and transformer. Both the technical novelty and depth are somewhat limited (there is no theoretical analysis on the effectiveness of the model design). On the other hand, the proposed model has shown strong performance on some of the experimental datasets. It would make a more interesting paper if the strong performance could be supported by a theoretical analysis.
TabNAT can use a fixed set of hyperparameters for all datasets is an advantage.
=========================
Update after rebuttal: I'm happy with the additional experimental results and am raising my score from "Weak reject" to "Weak accept".
其他意见或建议
Typo: "After TabNAT is well-trained" => "After TabNAT is well trained"; "TabNAT ’s training speed" (extra whitespace)
We sincerely thank the reviewer for the thoughtful and constructive feedback on our manuscript. We appreciate the time and effort invested in providing these valuable comments, which will help improve the quality of our paper. Below, we address the points raised:
Choice of Experimental Datasets
The reviewer raised concerns about our dataset selection, particularly noting differences from those used in DP-TBART/Tab-MT papers. We would like to clarify that our experimental design primarily followed TabSyn's pipeline, as this was our initial baseline for comparison. Accordingly, we incorporated the six heterogeneous datasets used in the TabSyn paper to enable direct performance benchmarking.
To demonstrate TabNAT's versatility across different data types, we deliberately expanded our evaluation by adding two datasets containing only continuous columns and two datasets with only discrete columns. This comprehensive approach allows us to assess TabNAT's generalizability across diverse tabular data structures.
We acknowledge the reviewer's valid point regarding datasets used in TabMT and DP-Tbart. We have conducted additional experiments using five additional datasets used in these papers, with results presented in rebuttal.pdf in the anonymous link: https://anonymous.4open.science/r/ICML-TabNAT. These supplementary experiments further validate TabNAT's effectiveness across a broader range of benchmark datasets.
Performance Comparison with TabSyn
We appreciate the reviewer's insightful observation regarding TabNAT's performance relative to TabSyn across different datasets. Our detailed analysis reveals several important patterns in performance differences:
First, TabNAT consistently outperforms TabSyn on both continuous-only and discrete-only datasets. This advantage stems from TabNAT's architecture, which employs specialized generation models optimally suited for each data type. Unlike TabSyn's approach of using a VAE encoder to create a latent space, TabNAT processes different feature types more directly and naturally, minimizing information loss that typically occurs during VAE encoding.
For mixed-type datasets, our analysis reveals that TabNAT demonstrates clear advantages when the proportion of discrete features is higher, as evidenced by superior performance on the Adult and Beijing datasets. This performance differential can be attributed to TabNAT's transformer-based categorical generation component, which better captures the complex dependencies between categorical variables through its self-attention mechanism. Additionally, TabNAT's architecture enables more effective modeling of interactions between continuous and discrete features, particularly when discrete features play a dominant role in the underlying data distribution.
Importantly, even in datasets where TabNAT and TabSyn demonstrate comparable statistical performance, TabNAT offers significant practical advantages that TabSyn cannot match. Specifically, TabSyn's VAE-based architecture fundamentally limits its ability to perform flexible conditional generation. In contrast, TabNAT's arbitrary-order autoregressive generation capability supports any form of conditional generation, including missing data imputation. This flexibility stems from TabNAT's bidirectional masking strategy and specialized components for different data types, allowing users to specify any subset of features as conditions while generating the remaining features—a crucial capability for many real-world applications.
Theoretical Analysis
We thank the reviewer for suggesting a theoretical analysis to support TabNAT's empirical performance. Our approach is grounded in fundamental probability theory, specifically the decomposition of joint distributions as expressed in Equations (1) and (2). TabNAT's key theoretical contribution lies in transforming the tabular data generation problem into a series of conditional distribution modeling tasks through our bidirectional Masked Transformer architecture.
This decomposition allows us to leverage the most appropriate generative modeling techniques for different variable types: diffusion models for continuous variables and autoregressive models for discrete variables. Both approaches have strong theoretical foundations in distribution modeling. Diffusion models offer provable convergence to the target distribution through a gradual denoising process, while autoregressive
Minor Corrections
We have corrected the noted typographical errors:
- "After TabNAT is well-trained" → "After TabNAT is well trained"
- Removed the extra whitespace in "TabNAT 's training speed"
Additionally, we have thoroughly reviewed the manuscript to ensure consistency in terminology and formatting throughout.
The paper proposes a Non-Autoregressive transformer-based generative model for tabular data (TabNAT). It can handle both continuous and discrete columns. It uses a (non-causal) transformer to encode the input and masks to perform any-order training. The overall modeling paradigm is quite similar to TabDDPM (Kotelnikov et al. (2023)) and TabMT (Gulati et al. (2023)), with the key distinctions being how the continuous values are modeled (using continuous diffusion) and the use of the transformer encoder. The proposed approach shows promising results for the three standard evaluation criteria: statistical fidelity, data utility, and privacy protection. Ablations are also performed to demonstrate the impact of using the diffusion model for continuous values and the impact of any-order training.
给作者的问题
-
Why do you need position-specific mask embedding when separate position embeddings are applied?
-
MLE task: What is the performance of the classifiers and regressors trained on the original data? Can you describe, quantitatively, how many synthetic samples were used to get the results reported in the last two columns of Table 1?
-
What is the exact definition of the DCR metric? I could not find an expression for it in the paper or in the reference provided (Zhang et. al. (2024b)).
-
Why is the missing data imputation task set to predict missing values in the training set? Previous works (Du et. al. (2024), Jarret et. al. (2022)) seem to use the standard held-out set for imputation.
论据与证据
While the claims of superior statistical fidelity, utility, and privacy preservation are well supported. The imputation experiments look a bit weak. Please see the "Questions section" for the details.
方法与评估标准
The evaluation criteria used in the paper follow the established metrics in the literature.
理论论述
NA
实验设计与分析
Please see the "Questions section".
补充材料
I just reviewed the definitions of all the metrics used in the evaluation (Appendix C).
与现有文献的关系
The paper generally follows a line of works on neural generative models for tabular data (Assefa et al., 2021; Zheng & Charoenphakdee, 2022; Hernandez et al., 2022), and is most closely related to the most recent works that use discrete and continuous diffusion as well as masking based training (Castellon et al., 2023; Gulati & Roysdon, 2023; Kotelnikov et al., 2023; Kim et al., 2023; Lee et al., 2023; Zhang et al., 2024b). The overall modeling paradigm is quite similar to TabDDPM (Kotelnikov et al. (2023)) and TabMT (Gulati et al. (2023)), with the key distinctions being how the continuous values are modeled (using continuous diffusion) and the use of the transformer encoder.
遗漏的重要参考文献
None
其他优缺点
The paper is well written and easy to follow, especially the model description and Figure 3. The reported results also look promising, except for the imputation task (see the questions section below).
其他意见或建议
I don't know if Figures 1 and 2 add much value to the paper. In fact, looking at these figures before Figure 3 caused some confusion for me. You might consider moving these figures to the appendix or placing them after Figure 3, which sets up the context much better.
We sincerely thank the reviewer for their thoughtful assessment of our manuscript. We appreciate the recognition of our paper's strengths, including its clear writing, promising results, and well-supported claims regarding statistical fidelity, utility, and privacy preservation. Below, we address each of the reviewer's questions and concerns.
Response to Comments on Figures
We thank the reviewer for the suggestion regarding Figures 1 and 2. We agree that Figure 3 provides better context and will consider restructuring the presentation of figures in the revised manuscript to improve clarity.
Position-specific mask embedding and position embeddings
Intuitively, the position-specific mask embedding serves a different purpose than the position embeddings. While position embeddings encode the absolute position of each token in the sequence, the position-specific mask embedding is designed to capture information about which specific attributes are masked during the any-order training process. This allows the model to better understand the relationship between masked and unmasked attributes in different positions, enhancing its ability to handle partial information scenarios. In fact, we found that the empirical performance difference between the two designs is not substantial (with position-specific mask embedding performing slightly better). Considering that the additional parameters introduced by this design are negligible, we adopted this approach in our final model.
MLE task performance clarification
Thank you for raising this important point. We apologize for not including these results in Table 1. The complete results for classifiers and regressors trained on the original data can be found in the Appendix, page 20, Table 14. We commit to revising Table 1 to include this information in the main text for better clarity.
Regarding the number of synthetic samples used, the results reported in the last two columns of Table 1 were obtained using the same number of synthetic examples as in the training set, which varies according to the size of each dataset. This 1:1 ratio ensures a fair comparison across different methods. This information is provided in Appendix C.7.5.
DCR metric definition
We sincerely apologize for only providing the experimental setup for DCR without explaining how this metric is calculated.
DCR (Distance to Closest Record) is a commonly used metric for evaluating the privacy protection performance of synthetic data. For each synthetic example, we calculate its minimum distance to any example in the training set.
Formally, let's assume we have training examples {, , ..., } and synthetic examples {, , ..., }. For each synthetic example , we compute:
where represents the normalized Euclidean distance between two records after appropriate feature normalization.
Similarly, for holdout examples {, , ..., }, we can compute their DCRs:
Since the training set and holdout set are i.i.d. samples from the same distribution, if the synthetic samples learn the underlying distribution from the training set correctly, and should be similar. Otherwise, if the synthetic examples are copied from the training data, the of each synthetic example should be close to zero.
Setting of missing data imputation
Thank you for raising this point about our imputation task setup. We would like to clarify that in real-world scenarios, we often face in-sample missing data imputation tasks, such as during the data preprocessing phase of large data science projects.
Both in-sample and out-of-sample imputation approaches have been explored in previous research [1,2]. Methods like HyperImpute adopt out-of-sample evaluation primarily because, as a hybrid approach, it requires supervised labels for model selection. Our method does not have this requirement. We also note that works like Remasker seem to adopt the in-sample setting as well.
That said, we agree that out-of-sample imputation is equally important. Our model is naturally suitable for out-of-sample settings, and we will conduct additional experiments to demonstrate its effectiveness in this context. We have included these results in rebuttal.pdf in the anonymous link:https://anonymous.4open.science/r/ICML-TabNAT to provide a more comprehensive evaluation of our method's imputation capabilities across both settings.
The paper focuses on unconditional tabular data generation and conditional missing data imputation tasks. Considering the heterogeneous nature of tabular data, which includes both discrete and continuous variables, the proposed method utilizes a diffusion model to parameterize the conditional distributions for continuous columns. For discrete columns, it employs next-token prediction with KL divergence minimization. Additionally, a masked Transformer with bi-directional attention is used to support order-invariant generation. Experiments conducted on ten datasets with diverse attributes demonstrate the effectiveness of the proposed TabNAT method.
给作者的问题
Although both continuous columns in tabular data and images consist of continuous values, their overall continuity differs, which may warrant a more in-depth analysis.
论据与证据
The paper provides a detailed analysis of diffusion modeling and autoregressive modeling, employing a diffusion model to model continuous columns using Diffusion Loss and using next-token prediction for discrete columns with Cross Entropy Loss. While this claim appears reasonable at first glance, the continuous data in the table consists of individual continuous values, where the dependencies between data points are not as strong as in images, where contextual tokens exhibit high continuity. From this perspective, diffusion modeling may not necessarily be more suitable than autoregressive prediction for modeling continuous columns in tables. Therefore, the paper’s proposition of "modeling continuous columns in tables using the diffusion process" warrants further scrutiny.
方法与评估标准
The paper evaluates the effectiveness of the proposed model on multiple tabular datasets, demonstrating that the proposed methods outperform other approaches. For the task of Synthetic Tabular Data Generation, the results are assessed using metrics such as Statistical Fidelity and Machine Learning Efficiency (MLE). In the Missing Data Imputation task, evaluation is conducted using Average MAE for continuous features and Average Accuracy for discrete features. The chosen evaluation metrics are appropriate.
理论论述
The paper's approach to modeling and model utilization is well-founded. Continuous data is normalized to the [0,1] range and encoded using an MLP-based encoder, while discrete data is label-encoded and embedded. To account for the column invariance of tabular data, position encoding is applied at the column level. A Bi-directional Transformer is employed to simultaneously predict both continuous and discrete data, utilizing Diffusion Loss and Cross Entropy Loss, respectively.
实验设计与分析
• For the task of Synthetic Tabular Data Generation, the model is evaluated on ten datasets, including two with only continuous features, two with only discrete features, and six heterogeneous datasets. It is compared against six major categories of methods, including VAE, GAN, LLM, and Diffusion-based approaches. The results are assessed using metrics such as Statistical Fidelity and Machine Learning Efficiency (MLE), which are appropriate for this task. Additionally, corresponding ablation studies are conducted.
• For the Missing Data Imputation task, experiments are performed on five heterogeneous datasets. However, no ablation study is conducted for this task, and the compared baselines appear to be relatively weak.
补充材料
The Supplementary Material includes pseudocode for key components, detailed experimental setups, descriptions of the model architectures, evaluation metrics, and additional experimental results.
与现有文献的关系
• This paper explores modeling tabular data by applying the diffusion process to continuous columns and using masked generative modeling for discrete columns. However, the applicability of diffusion loss to continuous tabular data requires further in-depth analysis.
• The paper provides an overview of various existing approaches for tabular data generation, including VAE, GAN, and autoregressive modeling. However, its analysis of diffusion loss remains insufficient. In image modeling, diffusion loss involves multiple iterative predictions of the overall image distribution, effectively capturing correlations between tokens. In contrast, discrete tabular data differs from images in that image data exhibits strong global continuity beyond individual tokens. In comparison, individual values in tabular data do not exhibit the same level of continuity. Therefore, the paper should further analyze and discuss related work on diffusion models in this context.
遗漏的重要参考文献
• The paper lacks sufficient citations on Diffusion Loss, including its application to autoregressive modeling for continuous data.[Tian K et al.'24], [Li T et al.'24]
• Tian K, Jiang Y, Yuan Z, et al. Visual autoregressive modeling: Scalable image generation via next-scale prediction[J]. Advances in neural information processing systems, 2024, 37: 84839-84865.
• Li T, Tian Y, Li H, et al. Autoregressive image generation without vector quantization[J]. Advances in Neural Information Processing Systems, 2024, 37: 56424-56445.
其他优缺点
Strengths:
• The method and model architecture are clearly presented.
• The analysis of diffusion modeling and autoregressive modeling is reasonable.
• A substantial number of experiments have been conducted, yielding promising results.
Weaknesses:
• The differences between continuous columns in tabular data and images are not thoroughly analyzed; diffusion loss is simply applied to continuous columns without deeper justification.
• The second experiment lacks ablation studies, and the chosen baselines appear to be relatively weak.
其他意见或建议
The overall work is complete, and the method is explained clearly.
伦理审查问题
n/a
We sincerely thank the reviewer for the thorough review and constructive feedback on our manuscript. We have addressed the raised concerns as follows:
The suitability of diffusion modeling for continuous columns in tabular data
We agree that this is an important consideration that deserves clarification. Our choice of diffusion modeling for continuous columns is supported by empirical evidence from recent SOTA work:
- TabDDPM was the first model to successfully apply diffusion models to tabular data synthesis, using traditional DDPM for continuous columns and discrete diffusion with multinomial distribution for discrete columns. Its strong performance validated the effectiveness of diffusion models for capturing the complex distributions in continuous (multi-column) tabular data.
- TabSyn further advanced this approach by mapping both discrete and continuous columns to a continuous embedding space, enabling the application of standard diffusion models to the entire table. TabSyn has achieved state-of-the-art results in tabular data generation, demonstrating the exceptional capability of diffusion models to capture the distributions of continuous tabular data.
- Our empirical experiments align with these findings. It's important to note that direct autoregressive generative modeling of continuous values is not feasible without discretization. Baseline methods such as Tab-MT and DP-TBART both employ this discretization-plus-autoregressive approach, and our experiments demonstrate that they consistently underperform compared to our method across all metrics, particularly in Statistical Fidelity scores.
On the differences between continuous columns in tabular data and images
We agree with the reviewer that a deeper analysis of the differences between continuous values in tabular data and images would strengthen our paper. These differences are fundamental to our methodological choices:
- Tabular data columns are fundamentally more independent than image pixels, with each column typically representing a distinct feature with its own semantic meaning. In contrast, image data pixels have strong spatial continuity, with neighboring pixels highly correlated and collectively representing coherent visual elements.
- This structural difference necessitates different denoising neural network architectures. For images, CNN-based UNet models have become the standard architecture for diffusion models because they effectively leverage the local spatial correlations through convolutional operations and capture multi-scale features through their encoder-decoder structure with skip connections.
- For tabular data, where columns have distinct meanings without inherent spatial relationships, traditional MLP architectures are more appropriate as denoising neural networks. In our approach, we use an MLP for denoising continuous data, while the Transformer is employed solely to generate conditional vectors. This design choice aligns with other successful tabular diffusion models, such as TabDDPM and TabSyn, which also utilize MLPs as their denoising neural networks.
On the missing data imputation experiment
We would like to clarify that the baselines we compared against (Remasker, HyperImpute, and GRAPE) are indeed SOTA methods for tabular missing data imputation. These same methods were used as the top-performing baselines in a recent ICLR 2025 Spotlight paper on tabular data imputation [1]. Our selection of baselines is therefore aligned with the current standards in the field. We will aslo include the new paper as an additional baseline in our revised manuscript to ensure our comparisons remain comprehensive and up-to-date.
Due to page limitations, we were unable to include ablation studies for the missing data imputation task in the original submission. We have now conducted these additional experiments, and the results are presented in Figure 15 and Figure 16 in rebuttal.pdf in the anonymous link:https://anonymous.4open.science/r/ICML-TabNAT. The findings are summarized as follows:
-
Simple MSE loss is as effective as the Diffusion loss in missing data imputation since we care more about the expectation of the missing entry rather than its distribution. The imputation of discrete columns is sensitive to the order, and the random-order sampling used in our paper is beneficial.
-
The proposed TabNAT is very robust to the model depth and width in the missing data imputation task.
These results further validate our design choices and demonstrate that each component of our model contributes significantly to its superior performance in the missing data imputation task.
Missing references
We thank the reviewer for pointing out the related works. We will cite these works and add a discussion about the application of diffusion losses in autoregressive image modeling.
References:
[1] Zhang et al. DiffPuter: Empowering Diffusion Models for Missing Data Imputation. In ICLR 2025.
The paper tackles the task of generating tabular data. It claims that the application of current autoregressive models for generating tabular data is limited due to two challenges:
- tabular data contains heterogeneous types, whereas autoregressive next-token (distribution) prediction is designed for discrete data
- tabular data is column permutation-invariant, requiring flexible generation orders.
In order to alleviate these issues, it proposes using a transformer which encodes the continuous data in each row (each row is a single data sample) into a vector at position 1 of the sequence, and the rest of the data (discrete), in that row is seen as the rest of the sequence, where each entry (corresponding to one column) is represented as a vector with learnable entries. As such, each row can be seen as a sequence of (L+1) vectors, where each vector has dimension . Uniform random masking is applied before feeding the sequence to the transformer, and the outputs are then used to predict the masked (missing) discrete data, and also used to condition to generate the continuous data.
The proposed method is then compared to other methods in multiple datasets, and assessed using multiple metrics. In addition ablation studies are performed to see how the main two claimed weaknesses above influence the performance.
给作者的问题
- Why was a stochastic generating approach used instead of the ODE formulation, or flow matching, which could speed up generation?
- Why are masked encoded differently at each position, when we already add positional encoding?
论据与证据
The paper does not make any theoretical claims. Regarding performance claims, based on the provided results, the proposed method appears to outperform the other methods in most of the cases. However, it should be noted that my main area of expertise are diffusion models, and autoregressive models (mostly applied to language), therefore I cannot fully evaluate the significance of these results. My review, will mostly be focused on the merits of the design of modelling and application of diffusion and other discrete modelling approaches, rather than performance.
方法与评估标准
It should be noted that my main area of expertise are diffusion models, and autoregressive models (mostly applied to language), therefore I cannot fully evaluate the significance of these results. My review, will mostly be focused on the merits of the design of modelling and application of diffusion and other discrete modelling approaches, rather than performance.
That being said the criteria appear to be sound and cover multiple aspects.
理论论述
The paper makes no novel theoretical claims.
实验设计与分析
The experiment designs appear to be sound and cover multiple aspects.
补充材料
I reviewed Appendix A.
与现有文献的关系
I am not familiar with other tabular data generation methods.
The paper makes use of continuous diffusion models, and Bert like masking architecture. However, it does not include categorical diffusion models (which are broadly known as discrete diffusion models).
遗漏的重要参考文献
The main essential references which are missing in my opinion are discrete diffusion model papers (some references for example [1,2,3,4,5]). The authors use a masking schedule similar to Bert, but do not attempt to model discrete data with discrete diffusion, which has been shown to perform well in modelling discrete data (such as language). In the context of this paper, it could be even more useful as the data lacks the inherent left to right bias. The choice not to include them due to not modelling explicitly (I believe this means analytically the pdf) dependencies between dimensions is not properly supported considering that the goal of the paper is data generation, not density/probability modelling.
[1] Structured Denoising Diffusion Models in Discrete State-Spaces, Austin et al, 2021. [2] A Continuous Time Framework for Discrete Denoising Models, Campbell et al 2022. [3] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, Lou et al 2024. [4] Simple and Effective Masked Diffusion Language Models, Sahoo et al, 2024 [5] Discrete Flow Matching, Gat et al, 2024.
其他优缺点
Strengths
- The claims made in this paper are quite interesting, in particular the attempt to not model discrete tabular data with autoregressive models, but utilize full attention. Results show the latter is superior.
- The section which explains the method is very clearly written and illustrated.
- Based on the provided results, the proposed method improves upon the existing ones in the vast majority of cases.
Main weaknesses
- The main weakness in this paper is the modelling of continuous data. The paper suggest in Equations (1) and (2), to model the continuous data separately, and then learn discrete data conditioning on continuous data. I believe this is the right approach. The inference explanation in lines 270-274 reflects this approach. On the other hand during the actual training, the model does not learn the whole continuous distribution, but learns a family of conditional ones, conditioned on a masked sequence. As such a natural order is missing (rule of choosing unmasking positions) which in discrete diffusion is mitigated. The main missing part in this paper is comparing this approach against the following:
Encode the perturbed continuous dimensions into a higher dimensionality vector and then split it into multiple vectors, which will be placed into positions 1 to K in the sequence. If L is the number of discrete columns, and C the number of continuous columns, then then K/L should be roughly C/L. The discrete tokens come after, that is so far the sequence has length K+L. Finally the last K positions are reserved to the continuous data so that the discrete data can attend to. That is, the first K tokens attend only to themselves, in other words, (K, , ) attends to (K, , ) in the (K,L,K) sequence. The L tokens ( ,L, ) attend to ( ,L,K), and finally ( , ,K) attends to ( , ,K). The last K tokens do not provide an output at the end, they are there just so that the L tokens can condition on them. In this way the transformer could model both the continuous and the (discrete|continuous) distribution simultaneously. This would fit the objective proposed in Equation (1) and (2) in the paper. Otherwise the section containing these equations should be modified in order to more properly set the tone for the proposed method.
- The work does not include advances in modelling discrete distributions, which have been developing since at least 2021. More recently, discrete diffusion modelling with absorbing (masked) dynamics have shown very competitive results in text modelling. It is likely that they would perform very well on data without left to right bias. In addition, they offer a natural unmasking schedule. Applying this approach to (_,L,) discrete data is straightforward.
Minor Weaknesses
- There are some small issues with notation and grammar:
a. Line 110, right column, I believe it should be .
b. Line 111, right column, I believe it should be
c. Line 200, left column, I believe it should be
d. Line 303, left column. For each model.
-
Figures 1 and 2 are unclear. The rectangles have varying sizes and no labeling , which makes it difficult to interpret the figure.
-
While not a weakness, the results of comparing MSE against diffusion are not surprising.
其他意见或建议
The main 2 weaknesses are simultaneously suggestions, which I believe would improve the paper. Overall, I am on the borderline regarding acceptance. I am looking forward to reading the responses of authors, and am willing to reevalute the paper based on their replies, as well as based on other reviews which could highlight something I might have missed.
We sincerely thank the reviewer for their thoughtful and constructive feedback on our manuscript. We appreciate the time and effort spent reviewing our work, and we believe the suggestions will significantly improve the quality of our paper. Below, we address each point raised by the reviewer.
Modeling of Continuous Data
Your observation about our approach to modeling continuous data is insightful. We agree with your assessment that our method doesn't directly model the full continuous distribution , but rather learns a conditional distribution , where the constant vector represents the case where all discrete columns are masked.
While this is conceptually different from modeling the complete joint distribution, in practice, the two approaches yield equivalent results for generation purposes. This is because when all discrete columns are masked, the model effectively sees only a generic placeholder with no specific information, forcing it to generate continuous values based solely on the learned marginal distribution of those values. The mask tokens in this case serve merely as positional indicators without conveying any actual discrete data information.
Your suggested approach of encoding perturbed continuous dimensions into vectors at positions 1 to K is elegant and theoretically sound. Following your suggestion, we implemented and tested this approach. Our experiments confirmed your intuition—the results were indeed comparable to our original method in terms of generation quality.
However, we identified a practical limitation: this approach significantly extends the sequence length, approximately tripling the training computational cost. Given the similar performance outcomes but increased resource requirements, we opted for our original approach as a more efficient solution. Nevertheless, we appreciate this valuable suggestion and will discuss this alternative formulation and its trade-offs in our revised paper.
Discrete Diffusion Models
Thank you for highlighting the recent advances in discrete diffusion models that we overlooked. While we did include basic approaches like multinomial diffusion in our baselines (which showed limited effectiveness), we missed the opportunity to explore more recent developments in this area.
The papers you referenced indeed offer promising directions for modeling discrete data without left-to-right bias. In our revised paper, we'll discuss these advancements and explore how they might complement our approach for mixed-type tabular data generation. We're particularly intrigued by masked diffusion approaches and their potential in heterogenous tabular data generation.
Choice of Stochastic Generation Approach
We thank the reviewer for this insightful suggestion. Our primary focus was to effectively integrate diffusion models with autoregressive approaches for tabular data generation. Regarding the choice of sampling method, we acknowledge that different approaches offer various trade-offs between quality and efficiency. Following the reviewer's suggestion, we conducted additional experiments with flow matching. While this approach did improve sampling speed (requiring only 20 steps), we observed a slight decrease in generation quality. It's worth noting that our current sampling method requires only 50 sampling steps—significantly fewer than the 1000 steps typically needed in traditional DDPM approaches—thus already providing a good balance between quality and efficiency.
Position-Specific Masking Encoding
Intuitively, the position-specific mask embedding serves a different purpose than the position embeddings. While position embeddings encode the absolute position of each token in the sequence, the position-specific mask embedding is designed to capture information about which specific attributes are masked during the any-order training process. This allows the model to better understand the relationship between masked and unmasked attributes in different positions, enhancing its ability to handle partial information scenarios. In fact, we found that the empirical performance difference between the two designs is not substantial (with position-specific mask embedding performing slightly better). Considering that the additional parameters introduced by this design are negligible, we adopted this approach in our final model.
Notation and Figure Clarity
We thank the reviewer for pointing out the notation issues and concerns about figure clarity. We will correct all notation errors and improve the clarity of Figures 1 and 2 in the revised version of our manuscript.
I thank the authors for the additional information, and for perfoming the requested experiment. While I would have preferred to see the precise architecture and experimental setting you implemented, as well as the results (via an anonymized link), based on the explanation, I now lean accept regarding the submission. This evaluation now matches the official score I gave in the review.
There is consensus among reviewers that this paper should be accepted. Concerns about experimental setup appear to have been adequately addressed by the authors. This paper will represent a solid contribution to the ICML program.