Understanding and Mitigating Memorization in Diffusion Models for Tabular Data
摘要
评审与讨论
- The authors introduce TabCutMix, a data augmentation strategy to mitigate memorization in Tabular diffusion models.
- TabCutMix operates by combining samples that belong to the same label class. They claim that “The feature swap within the same class increases data diversity and reduces the likelihood of memorization while preserving the integrity of the label.”.
优点
Originality & Significance
The paper tackles the memorization issue in tabular diffusion models, which has been underrepresented in recent research.
Quality
Visualizations are nice to understand the experiments.
Clarity
Paper is clear and well-written.
缺点
- L233: Figure 4 — Which features are you removing specifically for the examples? The features you remove can potentially affect the memorization ratio. For instance, in InterpreTabNet, salient features contribute more towards predictions. Thus, features that are salient could potentially have a larger impact on the memorization ratio. Another example could be looking at the correlation between features. Would removing highly correlated features be more impactful on the memorization ratio than less correlated features?
- L235: Theoretical Analysis in this section and the referred appendix is nice but trivial. The following information could be inferred from the EDM paper which is what TabSyn uses. Additionally, there is no citation of the EDM paper either.
- L286: The methodology is naive. One example is when there is a violation of feature dependencies — In the Adult dataset, Education and Education-Num are related. Swapping one without the other can create inconsistent samples.
- L319: Datasets seem to be very limited in the current standard, with only 4 included datasets. The referenced TabSYN and TabDDPM in the paper included 6 and 16 datasets in total respectively.
- The overall methodology seems to be lacking in terms of contribution, proposing a trivial data augmentation technique that reduces memorization via an adaptation of CutMix from the image domain to tabular data.
- There are also numerous established data augmentation techniques for tabular data (e.g., SMOTE, noise injection) that serve similar purposes. The paper does not clearly differentiate TabCutMix from these methods or demonstrate significant advantages.
- It seems that TabCutMix can also be applied to other forms of generative models. I am unable to determine the reason that it is only generalizable to diffusion models.
问题
Please see the weaknesses.
[W3] The methodology is naive. One example is when there is a violation of feature dependencies — In the Adult dataset, Education and Education-Num are related. Swapping one without the other can create inconsistent samples.
A3: Thank you for raising this insightful comment. We have thoroughly addressed this issue through three perspectives.
-
Hyperparameter Control in TabCutMix for Balancing OOD Risk and Data Utility. TabCutMix provides a hyperparameter, the augmented ratio, to control the number of augmented samples included in training. This allows for careful tuning to balance between reducing memorization and preserving the utility of the augmented data.
-
Proposed TabCutMixPlus with Adaptive Feature Exchange. To further mitigate the risk of creating OOD data, we developed TabCutMixPlus, which incorporates an adaptive feature exchange strategy. Specifically, we calculate feature correlations using metrics like Pearson’s correlation coefficient (for numerical features), Cramér’s V (for categorical features), and the squared ETA coefficient (for numerical-categorical pairs). These measures allow us to cluster highly correlated features, ensuring that features within the same cluster are exchanged together during augmentation. This process preserves the structural integrity of the data and significantly reduces the likelihood of generating OOD samples.
-
OOD Detection Experiments for Quantifying OOD Risk. To assess the OOD risk introduced by TabCutMix and TabCutMixPlus, we conducted dedicated OOD detection experiments. Following [1], the OOD detection experiments framed the problem as a classification task, treating normal samples as negative and OOD samples as positive, where the positive OOD samples were synthesized by scaling a randomly selected numerical feature by a factor of or choosing a random value for categorical features from existing categories. A multi-layer perceptron (MLP) was trained on the original data and evaluated on augmented samples generated by TabCutMix and TabCutMixPlus to assess the proportion of samples classified as OOD. As shown in Table. 4, TabCutMixPlus consistently reduces the OOD ratios compared to TabCutMix across all datasets. For example, in Adult dataset, OOD ratio reduced from 2.06% (TabCutMix) to 0.36% (TabCutMixPlus). These results demonstrate that TabCutMixPlus significantly mitigates OOD risks while maintaining robust classification capabilities, reinforcing its utility in synthetic data augmentation workflows.
The table below shows the F1 scores and OOD ratios (with standard deviations) for TabCutMix and TabCutMixPlus across various datasets.
| Dataset | Metric | TabCutMix | TabCutMixPlus |
|---|---|---|---|
| Adult | F1 | 92.67 ± 0.22 | 92.63 ± 0.20 |
| OOD (%) | 2.06 ± 1.10 | 0.36 ± 0.27 | |
| Default | F1 | 71.42 ± 1.32 | 71.39 ± 0.94 |
| OOD (%) | 39.47 ± 6.70 | 25.44 ± 2.81 | |
| Shoppers | F1 | 82.47 ± 0.35 | 82.28 ± 0.39 |
| OOD (%) | 1.58 ± 0.76 | 0.70 ± 0.39 | |
| Magic | F1 | 99.27 ± 0.07 | 99.19 ± 0.05 |
| OOD (%) | 0.61 ± 0.03 | 0.43 ± 0.25 | |
| Cardio | F1 | 60.33 ± 0.25 | 60.39 ± 0.17 |
| OOD (%) | 4.83 ± 1.39 | 3.88 ± 0.19 | |
| Churn Modeling | F1 | 97.94 ± 0.13 | 97.97 ± 0.02 |
| OOD (%) | 0.00 ± 0.00 | 0.00 ± 0.00 | |
| Wilt | F1 | 99.94 ± 0.01 | 99.95 ± 0.03 |
| OOD (%) | 0.00 ± 0.00 | 0.00 ± 0.00 |
TabCutMixPlus shows a significant reduction in OOD ratios compared to TabCutMix, particularly in datasets such as Default and Cardio, while consistently maintaining high F1 scores.
[W4] Limited Datasets.
A4: The proposed TabCutMix method relies on exchanging feature samples within the same class label, which inherently limits its applicability to classification tasks. Consequently, we are unable to validate the method on regression tasks due to this methodological constraint.
To address your concern and broaden the scope of evaluation, we have added three additional classification datasets to the benchmark: Churn, Cardio, and Wilt. Please see Appendix E.4 for additional results on these datasets.
[W5] Please compare with numerous established data augmentation techniques for tabular data (e.g., SMOTE, noise injection) that serve similar purposes.
A5: We have implemented the baseline methods, SMOTE and Mixup, and provided their evaluation results in Table 1 and Table 5. The results demonstrate that our proposed TabCutMix and TabCutMixPlus achieve more effective memorization ratio mitigation in most cases while offering a superior trade-off between data generation quality and memorization control.
[W6] It seems that TabCutMix can also be applied to other forms of generative models. I am unable to determine the reason that it is only generalizable to diffusion models.
A6: We agree that, conceptually, TabCutMix's data augmentation strategy could be adapted to various generative modeling frameworks beyond diffusion models. However, the primary focus of this work is to address memorization issues specifically in diffusion models for tabular data, which represent the current state-of-the-art in generative modeling. Diffusion models have unique training dynamics and sampling processes, making them particularly suitable for high-quality data generation, but also prone to memorization.
While TabCutMix might be extended to other generative architectures, such as GANs or VAEs, those models come with their own distinct challenges, such as mode collapse or posterior approximation issues, which may require significant adaptations to the TabCutMix framework. Exploring such extensions is an interesting direction for future work, but it is beyond the scope of this paper.
I agree with reviewer DwHs that "rewriting the paper by focusing on TabCutMixPlus would make it even more beneficial to the community". Additionally, my concerns have only partially been addressed: Specifically additional experiments on other forms of generative models. Therefore, I will maintain my score.
We thank the reviewer for the thoughtful feedback.
[Q1] I agree with reviewer DwHs that "rewriting the paper by focusing on TabCutMixPlus would make it even more beneficial to the community".
A1: We have thoroughly revised our manuscript at PDF, including the abstract, introduction, methodology, and experiments, with a strong emphasis on TabCutMixPlus. We also updated the citation format to ensure consistency with \citep. While we acknowledge your point that TabCutMixPlus could potentially warrant its own dedicated paper, we believe its inclusion here, alongside TabCutMix, provides important context for understanding the progression from a simple yet effective baseline to mitigate memorization in Tabular diffusion models.
[Q2] My concerns have only partially been addressed: Specifically additional experiments on other forms of generative models.
A2: We respectfully acknowledge the reviewer’s feedback and their suggestion mainly focusing on memorization mitigation method. However, we believe that the evaluation does not fully consider the primary contribution and scope of our work. Our paper is primarily focused on pioneering the investigation of memorization in tabular diffusion models, a critical yet previously unexplored issue. This foundational contribution, along with the development of TabCutMix and its enhancement in TabCutMixPlus, is a pioneer step forward for the field. Our work provides a self-contained, rigorous analysis of memorization in tabular diffusion models and demonstrates the effectiveness of our methods with solid experiments and validations specific to this scope.
To further address your concern of the additional experiments on other forms of generative models, we conduct the additional experiments on CTGAN and TVAE as the additional generative model baseline. Please see the detailed results in *Appendix E.8. If there are any remaining points, we would be happy to address them promptly. If you believe our updated paper meets your expectations, we hope you might consider reflecting that in your evaluation.
Thank you once again for your valuable feedback.
We thank the reviewer for the constructive comments.
[W1] Which features are you removing specifically for the examples? The features you remove can potentially affect the memorization ratio. Would removing highly correlated features be more impactful on the memorization ratio than less correlated features?
A1: Thank you for raising this important question about the impact of feature removal on memorization ratio. In the current experiments, features were removed randomly without considering their correlation or importance. While this approach simplifies the analysis, we acknowledge that removing highly correlated or influential features could have a more pronounced effect on memorization ratios. Features with strong correlations often capture key patterns in the dataset, and their removal might significantly alter the data manifold, potentially reducing memorization further.
[W2] Theoretical Analysis in this section and the referred appendix is nice but trivial. The following information could be inferred from the EDM paper which is what TabSyn uses. Additionally, there is no citation of the EDM paper either.
A2: We want to clarify that our analysis is independent of the EDM paper and is specifically designed to address memorization phenomena in tabular diffusion models, which is a distinct research focus. While the EDM paper [1] explores the design space of diffusion models for image generation and emphasizes efficiency and quality improvements, it does not investigate memorization issues, particularly in the context of tabular data.
Our theoretical contribution is unique in explaining why memorization occurs in tabular diffusion models, providing insights into mitigating these issues. These aspects are unrelated to the scope of the EDM paper, which focuses on architectural and sampling improvements in diffusion models for vision tasks.
To avoid confusion, we will cite EDM paper and add more discussion on the differences.
[1] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.
Thank you for the experiments. I have raised my score from a 3 to a 5. However, I still remain conservative as the method from a theoretical perspective is trivial based on existing diffusion papers such as EDM.
We appreciate the reviewer’s acknowledgment of the additional experiments and the subsequent score adjustment!
[Q] I still remain conservative as the method from a theoretical perspective is trivial based on existing diffusion papers such as EDM.
A: While we respect the reviewer’s perspective, we strongly disagree with the continued characterization of our work as “trivial” from a theoretical standpoint. This assessment overlooks the distinctiveness and importance of our contributions, which are entirely independent of the EDM paper.
Our work addresses a critical and underexplored issue: memorization in tabular diffusion models. This focus is fundamentally different from that of the EDM paper, which centers on architectural and sampling improvements for diffusion models in the context of image generation. To the best of our knowledge, the EDM paper [1] neither studies nor offers insights into memorization phenomena in diffusion models, let alone in the tabular data domain.
We also emphasize that our theoretical contributions go beyond surface-level observations. Specifically, we provide a rigorous explanation of why memorization occurs in tabular diffusion models and actionable strategies to mitigate it. These contributions are both novel and practically impactful, particularly for applications where privacy and generalization are paramount. The comparison to the EDM paper is misplaced, as its goals and contributions are orthogonal to ours.
Furthermore, we underscore that the primary contributions of our work lie in identifying and addressing memorization phenomena in tabular diffusion models—a significant problem that has been largely overlooked in existing literature. This focus fills a critical gap in research and extends the applicability of diffusion models to new, high-stakes domains.
In summary, our contributions are far from trivial. They open up new research directions in generative modeling for tabular data and establish a theoretical and practical foundation for addressing memorization, a key challenge in this space.
[1] Karras, Tero, et al. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.
This paper studies the memorization issue in the diffusion model for tabular data generation. Memorization is defined as present when the distance between a synthesis data sample and the first closet sample in the training dataset is less than one-third of the distance with the second closest in the training dataset. Using this definition of memorization, the authors find that TabSyn exhibits different levels of memorization across different datasets. To reduce memorization, this paper proposes TabCutMix, which first produces new samples by randomly swapping features between two training samples with the save target label, then append the new samples to the original training dataset.
优点
- This paper is in general well-written and easy to read.
- This paper studies an important but under-explored potential issue in applying the diffusion model for tabular data generation.
- The theoretical result, which shows that perfectly trained diffusion models will generate memorized latent representation, is interesting. Although due to the randomness in the sampling process, this theory does not prove memorization must happen, it reveals enough potential of memorization.
缺点
-
Flawed method: The proposed method TabCutMix produces new samples by swapping features between two randomly selected training samples with the same target label. My biggest concern is this procedure will contaminate the pair-wise correlation in the training dataset. Therefore, I am skeptical about the Trend Score in Table 1, which shows applying TabCutMix has very little effect on the pairwise correlation, which is pretty counter-intuitive. I am willing to raise my score if the authors can clarify this issue.
-
Missing discussion with previous memorization metrics: In TabSyn, the authors actually also study a memorization metric: Distance to Closest Records (DCR), which measures the distance of the generated samples w.r.t to training and a held-out dataset. It is necessary to compare and discuss the difference and connection between the proposed memorization metric with DCR. Also, the new proposed memorization metric uses a pre-fixed 1/3 as the threshold, although it has been used in previous works on generating images. It may not be reasonable to directly adapt it to tabular data, which is a different modality. Fig 5, which plots the distribution of the ratio, contains more information than Fig 2,3,4, which use a fixed threshold. However, as shown in Fig 5, the improvement from TabCutMix looks very marginal, again raising doubts about the effectiveness of the proposed method.
-
Weak experiment:
- Datasets: This paper considers TabSyn and TabDDPM as the base diffusion model to reduce memorization. In TabSyn, the experiment is conducted on 7 datasets, and in TabDDPM, 15 datasets are used. However, in this paper, experiments are done on only 4 datasets, which significantly harms the convinceness.
- Fidelity metric: C2ST metric used in TabSyn is not included in this paper.
Reference: Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space, ICLR 2023
问题
See the weakness section.
We thank the reviewer for the constructive comments.
[W1] TabCutMix contaminates the pair-wise correlation in the training dataset. Therefore, I am skeptical about the Trend Score in Table 1, which shows applying TabCutMix has very little effect on the pairwise correlation, which is pretty counter-intuitive.
A1. TabCutMix provides a hyperparameter, the augmented ratio, to control the number of augmented samples included in training. This allows for careful tuning to balance between reducing memorization and preserving the utility of the augmented data.
Additionally, we design OOD detection experiments for quantifying OOD risk. To assess the OOD risk introduced by TabCutMix and TabCutMixPlus, we conducted dedicated OOD detection experiments. Following [1], the OOD detection experiments framed the problem as a classification task, treating normal samples as negative and OOD samples as positive, where the positive OOD samples were synthesized by scaling a randomly selected numerical feature by a factor of or choosing a random value for categorical features from existing categories. A multi-layer perceptron (MLP) was trained on the original data and evaluated on augmented samples generated by TabCutMix and TabCutMixPlus to assess the proportion of samples classified as OOD. As shown in Table. 4, TabCutMixPlus consistently reduces the OOD ratios compared to TabCutMix across all datasets. For example, in Adult dataset, OOD ratio reduced from 2.06% (TabCutMix) to 0.36% (TabCutMixPlus). These results demonstrate that TabCutMixPlus significantly mitigates OOD risks while maintaining robust classification capabilities, reinforcing its utility in synthetic data augmentation workflows.
The table below shows the F1 scores and OOD ratios (with standard deviations) for TabCutMix and TabCutMixPlus across various datasets.
| Dataset | Metric | TabCutMix | TabCutMixPlus |
|---|---|---|---|
| Adult | F1 | 92.67 ± 0.22 | 92.63 ± 0.20 |
| OOD (%) | 2.06 ± 1.10 | 0.36 ± 0.27 | |
| Default | F1 | 71.42 ± 1.32 | 71.39 ± 0.94 |
| OOD (%) | 39.47 ± 6.70 | 25.44 ± 2.81 | |
| Shoppers | F1 | 82.47 ± 0.35 | 82.28 ± 0.39 |
| OOD (%) | 1.58 ± 0.76 | 0.70 ± 0.39 | |
| Magic | F1 | 99.27 ± 0.07 | 99.19 ± 0.05 |
| OOD (%) | 0.61 ± 0.03 | 0.43 ± 0.25 | |
| Cardio | F1 | 60.33 ± 0.25 | 60.39 ± 0.17 |
| OOD (%) | 4.83 ± 1.39 | 3.88 ± 0.19 | |
| Churn Modeling | F1 | 97.94 ± 0.13 | 97.97 ± 0.02 |
| OOD (%) | 0.00 ± 0.00 | 0.00 ± 0.00 | |
| Wilt | F1 | 99.94 ± 0.01 | 99.95 ± 0.03 |
| OOD (%) | 0.00 ± 0.00 | 0.00 ± 0.00 |
TabCutMixPlus shows a significant reduction in OOD ratios compared to TabCutMix, particularly in datasets such as Default and Cardio, while consistently maintaining high F1 scores.
[W2] It is necessary to compare and discuss the difference and connection between the proposed memorization metric with DCR. Also, the new proposed memorization metric uses a pre-fixed 1/3 as the threshold, although it has been used in previous works on generating images. It may not be reasonable to directly adapt it to tabular data, which is a different modality. The improvement from TabCutMix in Fig. 5 looks very marginal.
A2: The DCR metric measures the (most closet) distance of each synthetic sample to both the training and holdout sets, providing insights into potential privacy concerns by assessing how closely synthetic data points resemble real training data. In another word, DCR metric is highly dependent on holdout sets, which hampers the robustness of the evaluation result. To mitigate such dependency, we use distance ratio to determine the memorization metric focusing on whether a generated sample is a memorized replica of the training data, specifically identifying samples whose closest neighbor distances are disproportionately small compared to their second-closest neighbors. The connection between the two metrics lies in their shared reliance on distance-based measures, but they serve distinct purposes: DCR assesses privacy risks, while our metric detects memorization within generative outputs.
Regarding the use of a fixed threshold for the relative distance ratio, we acknowledge that this threshold was inspired by prior work in image generation and may not perfectly translate to tabular data, given the differences in modality. However, the threshold provides a practical starting point, and its adoption enables comparability with existing methods. We also recognize that adapting the threshold to the specific characteristics of tabular datasets (e.g., feature types and distributions) could further refine the metric's accuracy. This is an area that warrants further investigation and will be addressed in future work.
We respectfully disagree with the reviewer's comment regarding the improvement of TabCutMix. Our proposed method demonstrates a significant reduction in memorization ratio across multiple datasets. For example, with TabSyn on the Magic dataset, the memorization ratio is reduced from to . On the Adult dataset, there is a relative improvement in memorization ratio of . Furthermore, as shown in Figure 5, the nearest-neighbor distance ratio probability around 1 for TabCutMix is significantly higher compared to the vanilla method, indicating improved diversity, while it is noticeably lower for other distance ratio values, highlighting reduced memorization.
[W3] In this paper, the experiments are done on only 4 datasets, which significantly harms the convinceness.
A3: The proposed TabCutMix method relies on exchanging feature samples within the same class label, which inherently limits its applicability to classification tasks. Consequently, we are unable to validate the method on regression tasks due to this methodological constraint.
To address your concern and broaden the scope of evaluation, we have added three additional classification datasets to the benchmark: Churn, Cardio, and Wilt. Please see Appendix E.4 for additional results on these datasets.
[W4] Fidelity metric: C2ST metric used in TabSyn is not included in this paper.
A4: We have included C2ST and DCR metrics in Tables. 1 and 5. It is seen that our proposed methods TabCutMix and TabCutMixPlus can still achieve better performance compared with baselines across various datasets and tabular diffusion models.
Dear Reviewer 2Ann,
Thank you once again for your constructive feedback and valuable suggestions.
As we near the end of the discussion phase, we kindly ask if our responses have adequately addressed your concerns. If there are any remaining questions or points requiring clarification, please let us know—we would be happy to provide further details.
We deeply appreciate the time and effort you have dedicated to reviewing our submission. If our revisions and responses have resolved your concerns, we hope you might consider adjusting your score accordingly.
Thank you for your thoughtful consideration and time.
Sorry for my late response. I appreciate the authors' effort in addressing my concerns. While some of my initial questions have been clarified, some issues remain:
- About data contamination.
TabCutMix provides a hyperparameter, the augmented ratio, to control the number of augmented samples included in training. This allows for careful tuning to balance between reducing memorization and preserving the utility of the augmented data.
I agree that the newly proposed TabMixPlus can intuitively mitigate the data contamination issue by swapping only less correlated columns. However, this solution is not fully satisfactory in my opinion. As it introduces additional hyperparameters (num of clusters, types of correlation during preprocessing, etc). It can be expensive/difficult for the user to identify the correct configuration, to do the proper pre-processing. For example, the user may not know what type of correlation they should keep, as the data and the downstream tasks jointly determine it.
- About threshold.
Thanks for the explanation. I think we basically agree that taking 1/3 as the threshold may not be optimal for tabular data. I think it may be beneficial to define the memorization on a uniform grid from a range of thresholds, say 1/2-1/3. By doing this, the definition becomes more robust.
- Experiments on more datasets.
The proposed TabCutMix method relies on exchanging feature samples within the same class label, which inherently limits its applicability to classification tasks. Consequently, we are unable to validate the method on regression tasks due to this methodological constraint.
I appreciate the authors' efforts in adding three new datasets. However, limiting the framework to datasets with classification target columns constrains its broader applicability. Expanding TabMix to support regression and other target variable types would significantly increase its practical utility across diverse real-world scenarios.
Summary
I think this paper will benefit from a more thorough study based on TabMixPlus (as Reviewer DwHs suggested), which should address the aforementioned weak points. Therefore, I will keep my score.
We thank the reviewer for the thoughtful feedback.
[Q1] Data Contamination. TabCutMixPlus introduces additional hyperparameters.
A1: We acknowledge the OOD risks associated with TabCutMix and proposed TabCutMixPlus as a mitigation strategy. The key idea of TabCutMixPlus is to reduce contamination by swapping only less correlated columns, which intuitively addresses the issue. While we agree this approach introduces additional hyperparameters (e.g., number of clusters, types of correlations), it is a step towards improving data utility without overly complicating the memorization performance.
However, we respectfully argue if a simple but effective solution works well for a relatively new task — which is the case for TabCutMix (and TabCutMixPlus) — it may not be necessary or well-motivated to pursue a more complex or "perfect" solution at this stage. Instead, such work sets the foundation for follow-up research to further refine and enhance performance in this emerging area.
Besides, our extensive experiments demonstrate that TabCutMixPlus is both practical and effective in 7 datasets, 5 generative models against 3 baselines utilizing 8 evaluation metrics. These comprehensive results underscore the broad applicability of TabCutMixPlus and its ability to address contamination concerns for many real-world scenarios. As Reviewer DwHs pointed out, while not perfect, TabCutMixPlus sufficiently addresses the issue for numerous datasets, making it practical for many use cases.
Overall, we emphasize that TabCutMix and TabCutMixPlus are not only self-contained but also effective in laying the groundwork to mitigate memorization. These methods provide a practical starting point, and we hope they inspire future work to refine and extend these ideas.
[Q2] It may be beneficial to define the memorization on a uniform grid from a range of thresholds, say 1/2-1/3. By doing this, the definition becomes more robust.
A2: We appreciate the reviewer’s suggestion to define memorization on a uniform grid across a range of thresholds, such as , to enhance robustness. To address this, we propose and employ the Memorization Area Under Curve (Mem-AUC) metric, which aggregates the memorization ratio across a continuous range of thresholds . This approach inherently captures variations in memorization intensity over a range of thresholds, making the analysis more robust compared to single-point estimations.
We conduct experiments to demonstrate the high consistency between the memorization ratio and Mem-AUC in Appendix E.9. Specifically, we present both the memorization ratio and Mem-AUC results for various methods, datasets, and backbones. As shown in Table 9, the values of Mem-AUC and memorization ratio are consistent and align well across different configurations, demonstrating the reliability of Mem-AUC as a comprehensive metric. To further validate the robustness of Mem-AUC, we conducted a correlation analysis between the memorization ratio and Mem-AUC for different methods. The results, illustrated in the accompanying scatterplots, indicate very high correlation coefficients across all methods, e.g., TabCutMixPlus: 0.998; TabCutMix: 0.943; Mixup: 0.998. These high correlation coefficients suggest that while Mem-AUC provides a richer, threshold-agnostic view of memorization, it is strongly aligned with the simpler memorization ratio metric. This alignment demonstrates that Mem-AUC effectively generalizes the concept of memorization ratio over a range of thresholds.
[Q3] Limiting the framework to datasets with classification target columns constrains its broader applicability. Expanding TabMix to support regression and other target variable types would significantly increase its practical utility across diverse real-world scenarios.
A3: We appreciate the reviewer’s comment on expanding TabCutMix to support regression and other target types. While we acknowledge this limitation, we believe it is not a critical concern for our work.
- Our framework is designed specifically for classification tasks, which are highly impactful in real-world scenarios (e.g., fraud detection, medical diagnosis).
- By addressing classification tasks, our work establishes a robust foundation for mitigating challenges like memorization and data contamination in tabular data generation. This foundational approach can inspire future research to generalize the framework to other tasks like regression. Expanding the scope is a logical follow-up step but is not the primary focus of this paper.
- To validate the practicality and effectiveness of TabCutMix, we performed extensive experiments on 7 diverse datasets, 5 generative models, 8 evaluation metrics, and 3 baseline methods, all focused on classification tasks. These rigorous evaluations demonstrate the utility of our approach in its intended scope, emphasizing its current value.
While we acknowledge the limited applicability on classification tasks, this is a deliberate and well-justified choice, given the high impact and importance of these tasks. We view our work as a foundational step that lays the groundwork for future generalizations to other target variable types, such as regression, as part of the broader research trajectory.
The paper investigates the phenomenon of data memorization in diffusion models for tabular data generation, highlighting how memorization can lead to privacy issues and reduced generalization in models. The study introduces a criterion based on nearest-neighbor distance ratios to quantify memorization, revealing that diffusion models, such as TabSyn and TabDDPM, tend to memorize training data, especially as training epochs increase. This memorization is influenced by dataset size and feature dimensions, with smaller datasets often leading to higher memorization levels. The paper further provides theoretical insights into the mechanisms behind memorization in tabular diffusion models, showing how specific model configurations and training processes contribute to this effect.
To mitigate memorization, the authors propose TabCutMix, a post-processing technique inspired by the CutMix approach in image processing. TabCutMix randomly swaps feature segments between samples within the same class, preserving label integrity while introducing diversity in the synthetic data. This approach effectively disrupts memorization tendencies by reducing the exact resemblance between generated samples and training data. Experimental results demonstrate that TabCutMix significantly reduces memorization across multiple datasets and model configurations, while also maintaining key aspects of data quality, including fidelity and the overall statistical distribution of features. The approach achieves a balance between mitigating memorization and preserving data utility for downstream tasks.
优点
- The paper addresses a critical issue of data memorization in deep generative models for tabular data, where privacy risks from memorization could pose greater harm compared to the image and language domains.
- It offers a comprehensive examination of the latest state-of-the-art generators, filling a gap in prior work that lacks focus on tabular data generation models.
- The paper provides clear motivation and detailed descriptions of its memorization metrics and the proposed post-processing technique, TabCutMix, enhancing understanding and applicability of the methods introduced.
缺点
- Choice of Memorization Metric: The tabular synthesis field has widely adopted distance-based metrics to assess privacy leakage, with impactful works such as Platzer et al. using the ratio of synthetic examples closer to the training set than the test set. The proposed method uses a variant of l2 distance-based approach but does not sufficiently discuss why this particular metric is superior to standards l2 distance. The proposed metric may still suffer from issues such as sensitivity to outliers, a limitation common to other distance-based measures.
- Limited Dataset Benchmarking: The small number of datasets used for evaluation may be insufficient to conclude consistent patterns in memorization.
- Effectiveness of TabCutMix Post-Processing: In tabular data synthesis, the trade-off between utility and privacy (or memorization reduction) is well established, with increased perturbation leading to lower memorization but reduced data fidelity. To validate TabCutMix’s usefulness, the paper should compare it to other established “shallow” perturbation techniques, such as SMOTE or Mixup, to better illustrate its advantages.
Reference: Platzer, Michael, and Thomas Reutterer. "Holdout-based empirical assessment of mixed-type synthetic data." Frontiers in big Data 4 (2021): 679939.
问题
- Please compare the proposed mixed-distance against l2 distance with one-hot encoding for categorical variable, and explain the essential difference between the two metrics.
- In the experiment for Table 1, test post-processing using SMOTE[1] and Mixup for tabular data[2] in addition to TabCutMix.
- In the cased study: does the resulting sample of TabSyn + TabCutMix now closely resemble another real example? Or is the distance to close real example increased?
- The tested dataset have 10k+ samples and are still fairly large with 10% subsampling. Does the same behaviors hold for more realistic small-scale dataset with <= 200 samples? Consider repeating the analysis in section 3 on the following benchmarking datasets: Insurance, Indian Liver Patient, Titanic, Obesity. You can also try smaller subset percentages such as 0.1%, 1%
Reference: [1] Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357. [2]Takase, Tomoumi. "Feature combination mixup: novel mixup method using feature combination for neural networks." Neural Computing and Applications 35.17 (2023): 12763-12774.
We thank the reviewer for the constructive comments.
[W1&Q1] Please compare the proposed mixed-distance against l2 distance with one-hot encoding for categorical variable, and explain the essential difference between the two metrics. The proposed metric may still suffer from issues such as sensitivity to outliers, a limitation common to other distance-based measures.
A1: Thank you for raising the need to compare the proposed mixed-distance metric with distance using one-hot encoding for categorical variables and for highlighting the issue of sensitivity to outliers in distance-based measures. To clarify, distance with one-hot encoding for categorical variables is mathematically equivalent to the indicator distance in our mixed-distance metric, differing only by a constant factor. Specifically, the distance between two one-hot encoded vectors equals when 0 when they are the same, which aligns with the binary indicator distance. The key distinction of our approach lies in combining the indicator distance for categorical features with normalized distance for numerical features, ensuring compatibility across mixed-type tabular datasets. Additionally, our metric normalizes distances for all features into the same range to prevent any single feature type from dominating the overall distance calculation. This normalization is essential for ensuring balanced contributions from both numerical and categorical variables during distance computations.
We agree that sensitivity to outliers is a limitation shared by most distance-based metrics, including the proposed mixed-distance metric. This is a valid concern that warrants further research. Future work could explore robust scaling techniques or distance-weighted adjustments to mitigate the impact of outliers, particularly in mixed-type tabular datasets. However, we would like to highlight that the primary contribution of this work is to deepen the understanding of memorization in tabular diffusion models and to propose methods for mitigating it. The mixed-distance metric is introduced as a practical tool within this scope, rather than as a general-purpose distance metric. We acknowledge the importance of outlier sensitivity as a promising direction for future investigation.
[W2] Limited Dataset Benchmarking. More datasets are needed to make a convincing conclusion.
A2: The proposed TabCutMix method relies on exchanging feature samples within the same class label, which inherently limits its applicability to classification tasks. Consequently, we are unable to validate the method on regression tasks due to this methodological constraint.
To address your concern and broaden the scope of evaluation, we have added three additional classification datasets to the benchmark: Churn, Cardio, and Wilt. Please see Appendix E.4 for additional results on these datasets.
[W3] More Baselines: To validate TabCutMix’s usefulness, the paper should compare it to other established “shallow” perturbation techniques, such as SMOTE or Mixup, to better illustrate its advantages. A3: We have implemented the baseline methods, SMOTE and Mixup, and provided their evaluation results in Table 1 and Table 5. The results demonstrate that our proposed TabCutMix and TabCutMixPlus achieve more effective memorization ratio mitigation in most cases while offering a superior trade-off between data generation quality and memorization control.
[Q3] In the case study: does the resulting sample of TabSyn + TabCutMix now closely resemble another real example? Or is the distance to close real example increased?
A4: Based on Table 2, the sample generated by TabSyn + TabCutMix does not closely resemble another specific real example. Instead, it increases the distance to the nearest real sample by altering features like age (e.g., changing from 47.0 to 36.0) while preserving key categorical relationships such as workclass ("Private") and marital status ("Divorced"). This demonstrates that TabCutMix introduces diversity and reduces the risk of memorization.
[Q4] The tested dataset have 10k+ samples and are still fairly large with 10% subsampling. Does the same behaviors hold for more realistic small-scale dataset with <= 200 samples? Consider repeating the analysis in section 3 on the following benchmarking datasets: Insurance, Indian Liver Patient, Titanic, Obesity. You can also try smaller subset percentages such as 0.1%, 1%.
A5: We have conducted more experiments for an additional 3 datasets and smaller subset percentages 0.1%, 1%. After conducting experiments on smaller percentages datasets, we revise our observation in Section 3.3 and find the consistent higher memorization ratio for all datasets. Please see Figure 3 and Table. 5 for detailed experimental results.
Dear Reviewer j6Xu,
Thank you once again for your constructive feedback and valuable suggestions.
As we near the end of the discussion phase, we kindly ask if our responses have adequately addressed your concerns. If there are any remaining questions or points requiring clarification, please let us know—we would be happy to provide further details.
We deeply appreciate the time and effort you have dedicated to reviewing our submission. If our revisions and responses have resolved your concerns, we hope you might consider adjusting your score accordingly.
Thank you for your thoughtful consideration and time.
We sincerely thank the reviewer for their constructive comments and for raising their score. We greatly appreciate your acknowledgment of our efforts to address the evaluation principles, the effectiveness of TabCutMix, and its advantages over SMOTE in balancing memorization reduction and generation quality.
Regarding your remaining concerns about the completeness of memorization metrics, we acknowledge that standard distance-based metrics, including ours, can face challenges such as sensitivity to outliers and the curse of dimensionality. These are indeed important issues, and we agree that incorporating inference attack-based methods could further enhance the benchmarking of memorization. We added the discussion in future work (Appendix G).
However, we respectfully disagree with the assertion that our mixed-type distance metric closely resembles standard DCR metrics except for a normalizing constant. As detailed in Appendix D.5, DCR’s dependence on the holdout set limits its robustness, as the results can vary with changes in the holdout set composition. Additionally, the mixed-type distance metric we use is equivalent to the -distance applied to one-hot encodings for categorical features, except for the constant factor. This approach ensures that our metric handles categorical and numerical features in a unified manner while maintaining interpretability. Unlike DCR, our memorization metric directly assesses overfitting within the generative process and is independent of holdout set composition, addressing a key limitation of DCR.
Once again, we thank the reviewer for the thoughtful suggestions and the revised rating. We are encouraged by the feedback and will continue to refine and enhance our approach to address these open challenges in future research.
I appreciate the detailed and thoughtful responses, and I apologize for the delay in providing my feedback. My concerns regarding the evaluation principles and the effectiveness of TabCutMix have been addressed. The additional comparison with SMOTE demonstrates that TabCutMix achieves a better balance between reducing memorization and preserving generation quality.
That said, my concerns with the completeness of memorization metrics remain. Based on the details provided by the authors, the mixed-type distance metrics closely resemble standard DCR metrics, with the exception of a normalizing constant. Memorization tests in synthetic data generation that rely solely on standard distance-based metrics are known to be susceptible to issues such as the curse of dimensionality and sensitivity to outliers. Incorporating inference attack-based methods could further enhance the quality and reliability of benchmarking.
Taking all responses into account, I am revising my rating from 5 to 6.
This paper focused on the problem of memorization (overfitting) for diffusion model used on tabular data.
The paper is organized as a sequence of questions, with experiments to answer them, namely:
- Does memorization occur in tabular diffusion models, and if so, how can it be effectively mitigated?
- Effect of diffusion model, impact of dataset size.
- Feature dimension.
The paper demonstrates that:
- Memorization occurs at similar regardless of algorithm used, which suggests the origin lies in the diffusion itself. This is confirmed by Proposition 3.1.
- Weirdly, dataset size has no impact on this, or surprising impoact.
- Feature dimension is important, having influence in both directions depending on the dataset.
Author propose a simple augmentation technique (TabCutMix) that mixes the columns features of two examples to create a new one. Empirically, this technique reduces memorization. Authors measure Precision/Recall/Shape and Trend score of the diffusion model trained on augmented data, and show that this does not have too detrimental effects on these metrics, which suggests the practical efficiency of the method. A visual sanity check is performed using T-SNE.
优点
Simplicity
The method is extremely simple, since it is essentially just a data-augmentation technique done at the pipeline level. The contribution could even look "too simple" if it wasn't for all the experiments and sanity checks, which are convincing .
Methodology and clarity
The paper is structured as a set of questions, experiments to answer these questions, and empirical observations. Not only it improves clarity but also makes the paper impactful and useful in its own, outside the technical contribution of the data-augmentation.
Sanity checks
Author monitor several metrics to ensure that the diffusion model trained with the data-augmentation is still faithful to training data, which is crucial to ensure the relevance of the diffusion model in this context. It is not hard to ensure diversity of a generative model, the difficulty lies in balancing this diversity with the recall of the true distribution.
Theoretical results
The theoretical results give valuable insights on the experiments, creating a consistent narrative.
缺点
Limitations
I feel like a discussion on limitations is lacking. This data-augmentation changes the input distribution, like any data-augmentation. For image space, data-augmentation like symmetries comes from a prior over the structure of the input space, other ones like CutMix produces OOD data that act like regularization.
TabCutMix is doing an implicit hypothesis on the structure of the data manifold.
Let me resurrect the infamous "XOR problem" of neural networks. Assume we are given a dataset with two numerical features and one categorical feature . Assume that the "class" is . i.e, class will be if the two feature have the same sign, and negative otherwise. This classification task effectively splits the dataset into four quadrants around the origin , with a XOR pattern. Applying TabCutMix on this problem will mixes the two distributions and , and they will completely overlap.
Therefore, this method is at high risk of creating OOD data that might overlap the categories. If such diffusion model is used for downstream tasks, this can be problematic as it will not reflect the real data distribution.
问题
-
I was surprised by Obs 2 of Sec 3.2. Do you have an idea of why the memorization ratio is independent of the diffusion model used?
-
I was puzzled by Obs 2 of Sec 3.3. I would expect smaller dataset to yield higher memorization rates (easier overfitting). Why is it not changing on some datasets? I suspect that the definition you use in l230 is biased in this regard: since it relies on distance ratio, smaller datasets leads to more sparsely populated space, so absolute distances are increased but maybe relative distances remain unchanged. I am not sure how it plays with the curse of dimensionality though, as I would expect any criterion based on euclidean distance to become irrelevant in high dimension.
-
Can authors comment on the implicit hypothesis TabCutMix does about data manifold, explicit cases in which it might "fail" by creating OOD data, and explicit cases in which they expect it to perform well? Even better, a practical example of a dataset on which the method failed.
[Q3] What's the implicit hypothesis TabCutMix about data manifold, explicit cases in which it might "fail" by creating OOD data, and explicit cases in which they expect it to perform well? Even better, a practical example of a dataset on which the method failed.
A5: Thank you for raising this insightful question regarding the implicit hypothesis underlying TabCutMix and the scenarios where it may succeed or fail.
Implicit Hypothesis of TabCutMix: The underlying assumption of TabCutMix is that the data manifold of tabular data can be effectively approximated through random feature-level exchanges between samples. Specifically, TabCutMix assumes that the resulting augmented samples will remain close to the original data manifold if features are independently meaningful and can be combined without disrupting their inherent dependencies. This hypothesis aligns well with tabular datasets where features are relatively independent or loosely correlated. We have added this limitation discussion in Appendix. F.
Explicit Cases Where TabCutMix May Fail: TabCutMix may fail when the assumptions about feature independence or loose correlation do not hold. For example: (a) Highly Correlated Features: If features are tightly coupled (e.g., a dataset with strongly dependent numerical and categorical attributes), swapping individual features without accounting for correlations may disrupt the structural relationships in the data, potentially leading to OOD samples. (b) Imbalanced Datasets: In datasets with severe class imbalance, TabCutMix may inadvertently create augmented samples that disproportionately represent minority classes, introducing noise or unrealistic distributions. (c) Sensitive Domains: In sensitive applications such as medical data, feature interactions may carry domain-specific meanings (e.g., age and diagnosis), and arbitrary exchanges may result in implausible or nonsensical combinations.
Explicit Cases Where TabCutMix Performs Well: TabCutMix performs well on datasets where features are either independent or exhibit weak correlations.
To address the failure cases mentioned above, we introduced TabCutMixPlus, which adapts feature exchanges based on feature correlations. By clustering highly correlated features and swapping them together, TabCutMixPlus mitigates the risk of creating OOD samples in datasets with strong feature dependencies, as demonstrated in the reduction of OOD ratios across all datasets.
Thank you very much for this detailed answer and additional experiments.
[W1] Add more limitation discussion.
Thank you.
[W2] TabCutMix is at high risk of creating OOD data that might overlap the categories
I am satisfied by the new TabCutMixPlus strategy.
[Q1]. Why the memorization ratio is independent of the diffusion model used?
Thank you for your answer, and for your additional experiments.
smaller datasets leads to more sparsely populated space, so absolute distances are increased but maybe relative distances remain unchanged [...] We would like to clarify that the memorization ratio in our framework is based on relative distance ratios rather than absolute distances
I'm glad you agreed with my reasoning on the effect of using relative distances.
[Q3] What's the implicit hypothesis TabCutMix about data manifold, [...]
Thank you for the additional discussion. I think giving an explicit example of failure mode in the paper would strengthen it more than an abstract discussion. It would also help you sell TabCutMixPlus even better.
Baseline
TabCutMix does the implicit assumption of independence of features. It is possible to learn generative models (not only diffusion models) based on this hypothesis. For example:
- you can train a generative model for each feature
- and finally obtain a full generative model
I expect this "joint family of models" to produce similar results to the ones of TabCutMix. This approach should be a baseline to compare against.
Minor
Ensure to use \citep instead of just \cite or \citet in the new paragraphs you added, for better formatting.
Summary
While I agree with other reviewers such as 8dDK that the method is a bit naive because of the issue with feature dependency, I believe that TabCutMixPlus addresses partially the issue. While not perfect, this looks sufficient for numerous datasets as shown in experiments, therefore I believe this trick is useful and can benefit the practitioners. The paper could be accepted in its current form with no harm for the scientific community.
However, I think TabCutMixPlus deserves its own paper to precisely distinguish it from TabCutMix - I also believe that the baseline I proposed against TabCutMix is important. Therefore I find it difficult to push my score higher than 6. I am confident that rewriting the paper by focusing on TabCutMixPlus would make it even more beneficial to the community, by insisting even more on the practical implications of assuming (or not) independent features.
We thank the reviewer for their thorough and thoughtful feedback. We appreciate your acknowledgment that most of your concerns have been addressed. We also have conducted additional experiments and revised our paper significantly to tackle the remaining issues. Before delving into the detailed responses, we would like to emphasize that the core contribution of this work is to pioneer the investigation of memorization in tabular diffusion models, a critical yet previously unexplored issue in this domain. Below, we summarize our key contributions:
- Discovery of Memorization in Tabular Diffusion Models. A major contribution of our work is uncovering and analyzing memorization phenomena in tabular diffusion models, an issue largely overlooked in this domain. We conduct a comprehensive study on the memorization in tabular diffusion models, and provide the tabular generative models' community with a new memorization perspective and stimulus many potential follow-up works.
- Simple yet Effective TabCutMix and TabCutMixPlus to mitigate memorization. We propose TabCutMix as a lightweight and effective data augmentation method that serves as an initial step toward addressing memorization in tabular diffusion models. In response to concerns about OOD risks, we introduce TabCutMixPlus, which incorporates adaptive feature exchange based on feature correlations. Our experiments demonstrate that TabCutMixPlus significantly mitigates OOD risks while preserving data utility.
- We conduct comprehensive experiments, including 7 datasets, 4 baselines (vanilla, SMOTE, Mixup, IJF), 4 generative models (3 SOTA diffusion and CTGAN), and 8 evaluation metrics, to validate the effectiveness of proposed methods TabCutMix and TabCutMixPlus.
[Q1] An explicit example of failure mode in TabCutMix would strengthen this paper.
A1: We thank the reviewer for pointing out the importance of explicitly showcasing failure modes in TabCutMix. In Appendix E.5, we present a detailed case study using the Magic dataset, as summarized in the following Table. The results highlight a clear failure mode of TabCutMix: it disrupts feature correlations, leading to unrealistic relationships in the generated data. For example, TabCutMix produces samples where the Length is smaller than the Width, which contradicts the inherent structure of the dataset.
This failure occurs because TabCutMix performs random feature exchanges without considering inter-feature dependencies, resulting in unrealistic or inconsistent feature combinations. In contrast, TabCutMixPlus addresses this issue by clustering correlated features and swapping them within the same cluster. This approach preserves feature coherence, as evident from the results where TabCutMixPlus generates more realistic and consistent samples, better maintaining data quality and utility.
| Samples | Length | Width | Size | Conc | Conc1 | Asym | M3Long | M3Trans | Alpha | Dist | Class |
|---|---|---|---|---|---|---|---|---|---|---|---|
| TabSyn+TCM | 24.72 | 32.12 | 3.35 | 0.15 | 0.09 | 150.69 | -40.91 | -21.80 | 5.11 | 205.42 | g |
| TabSyn+TCMP | 17.56 | 11.20 | 2.29 | 0.58 | 0.37 | 2.29 | 17.12 | 2.40 | 22.44 | 211.00 | g |
[Q2] There is an additional baseline based on independent features (referred as ""joint family of models"") to generate augmented samples.
A2: We conducted experiments on an additional baseline named Independent Joint Family (IJF). By assuming feature independence, IJF generates augmented samples by concatenating feature-wise generative models, where each model fits individual feature distributions using empirical frequency distributions for categorical features and parameter distributions for numerical ones. Unlike VAEs or GANs, which are designed for generating high-dimensional data, IJF avoids their complexity and computational overhead, making it a simpler and more efficient choice for modeling single-feature distributions. Overall, while IJF achieves comparable performance to TabCutMix, it demonstrates lower data utility. For more details, please refer to Appendix E.7.
[Q3] TabCutMixPlus deserves its own paper to precisely distinguish it from TabCutMix - I also believe that the baseline I proposed against TabCutMix is important. I am confident that rewriting the paper by focusing on TabCutMixPlus would make it even more beneficial to the community. Ensure to use \citep instead of \cite.
A3: We have thoroughly revised our manuscript at PDF, including the abstract, introduction, methodology, and experiments, with a strong emphasis on TabCutMixPlus. We also updated the citation format to ensure consistency with \citep. While we acknowledge your point that TabCutMixPlus could potentially warrant its own dedicated paper, we believe its inclusion here, alongside TabCutMix, provides important context for understanding the progression from a simple yet effective baseline to mitigate memorization in Tabular diffusion models.
We greatly appreciate your supportive remark that "The paper could be accepted in its current form with no harm to the scientific community." We hope our extensive revisions and efforts to improve clarity and impact will elevate your confidence in this work and improve your score.
[Q1]. Why the memorization ratio is independent of the diffusion model used?
A3: While our results demonstrate that both TabSyn and TabDDPM converge to similar memorization ratios for the same dataset, this observation is based on experimental findings. It suggests that the memorization ratio may be predominantly influenced by data-centric factors, such as dataset complexity, sparsity, and feature redundancy, rather than model-centric factors, such as the specific diffusion model architecture. These data-centric considerations warrant further investigation to better understand the relationship between dataset properties and memorization behavior in tabular diffusion models.
We acknowledge that this question lies beyond the primary scope of this paper, which focuses on evaluating existing tabular diffusion models and proposing methods to reduce memorization risks. To address this, we have included a discussion of this promising direction in the future work section (Appendix F). We believe such investigations could provide valuable insights into the interplay between dataset characteristics and generative model behavior in the context of tabular diffusion.
[Q2] Obs 2 of Sec 3.3. I would expect smaller dataset to yield higher memorization rates (easier overfitting). (a) Why is it not changing on some datasets? (b) I suspect that the definition you use in l230 is biased in this regard: since it relies on distance ratio, smaller datasets leads to more sparsely populated space, so absolute distances are increased but maybe relative distances remain unchanged.
A4: Thank you for the insightful observation regarding Obs. 2 in Section 3.3. We have carefully examined both parts of your comment.
Regarding (a): Smaller datasets and memorization ratio trends. We agree with your expectation that smaller datasets are more prone to higher memorization due to easier overfitting. To investigate further, we conducted experiments with extremely small training sizes (1% and 0.1% of the original data) and observed that smaller datasets indeed result in higher memorization ratios across all four datasets, as shown in Figure 3. This confirms the expected trend and is now explicitly discussed in the revised Section 3.3 of the manuscript.
Regarding (b): Potential bias in the memorization ratio definition. Our answer is No bias. We would like to clarify that the memorization ratio in our framework is based on relative distance ratios rather than absolute distances. While it is true that smaller datasets lead to increased absolute distances due to sparse space, the memorization ratio metric depends solely on the relative distance ratio. This ensures consistency across different dataset sizes, as the relative distances remain comparable regardless of the dataset's "absolute density". Therefore, the definition of memorization ratio is not biased with respect to dataset size.
[W2] TabCutMix is at high risk of creating OOD data that might overlap the categories
A2: Thank you for raising this insightful comment. We have thoroughly addressed this issue through three perspectives.
-
Hyperparameter Control in TabCutMix for Balancing OOD Risk and Data Utility. TabCutMix provides a hyperparameter, the augmented ratio, to control the number of augmented samples included in training. This allows for careful tuning to balance between reducing memorization and preserving the utility of the augmented data.
-
Proposed TabCutMixPlus with Adaptive Feature Exchange. To further mitigate the risk of creating OOD data, we developed TabCutMixPlus, which incorporates an adaptive feature exchange strategy. Specifically, we calculate feature correlations using metrics like Pearson’s correlation coefficient (for numerical features), Cramér’s V (for categorical features), and the squared ETA coefficient (for numerical-categorical pairs). These measures allow us to cluster highly correlated features, ensuring that features within the same cluster are exchanged together during augmentation. This process preserves the structural integrity of the data and significantly reduces the likelihood of generating OOD samples.
-
OOD Detection Experiments for Quantifying OOD Risk. To assess the OOD risk introduced by TabCutMix and TabCutMixPlus, we conducted dedicated OOD detection experiments. Following [1], the OOD detection experiments framed the problem as a classification task, treating normal samples as negative and OOD samples as positive, where the positive OOD samples were synthesized by scaling a randomly selected numerical feature by a factor of or choosing a random value for categorical features from existing categories. A multi-layer perceptron (MLP) was trained on the original data and evaluated on augmented samples generated by TabCutMix and TabCutMixPlus to assess the proportion of samples classified as OOD. As shown in Table. 4, TabCutMixPlus consistently reduces the OOD ratios compared to TabCutMix across all datasets. For example, in Adult dataset, OOD ratio reduced from 2.06% (TabCutMix) to 0.36% (TabCutMixPlus). These results demonstrate that TabCutMixPlus significantly mitigates OOD risks while maintaining robust classification capabilities, reinforcing its utility in synthetic data augmentation workflows.
The table below shows the F1 scores and OOD ratios (with standard deviations) for TabCutMix and TabCutMixPlus across various datasets.
| Dataset | Metric | TabCutMix | TabCutMixPlus |
|---|---|---|---|
| Adult | F1 | 92.67 ± 0.22 | 92.63 ± 0.20 |
| OOD (%) | 2.06 ± 1.10 | 0.36 ± 0.27 | |
| Default | F1 | 71.42 ± 1.32 | 71.39 ± 0.94 |
| OOD (%) | 39.47 ± 6.70 | 25.44 ± 2.81 | |
| Shoppers | F1 | 82.47 ± 0.35 | 82.28 ± 0.39 |
| OOD (%) | 1.58 ± 0.76 | 0.70 ± 0.39 | |
| Magic | F1 | 99.27 ± 0.07 | 99.19 ± 0.05 |
| OOD (%) | 0.61 ± 0.03 | 0.43 ± 0.25 | |
| Cardio | F1 | 60.33 ± 0.25 | 60.39 ± 0.17 |
| OOD (%) | 4.83 ± 1.39 | 3.88 ± 0.19 | |
| Churn Modeling | F1 | 97.94 ± 0.13 | 97.97 ± 0.02 |
| OOD (%) | 0.00 ± 0.00 | 0.00 ± 0.00 | |
| Wilt | F1 | 99.94 ± 0.01 | 99.95 ± 0.03 |
| OOD (%) | 0.00 ± 0.00 | 0.00 ± 0.00 |
TabCutMixPlus shows a significant reduction in OOD ratios compared to TabCutMix, particularly in datasets such as Default and Cardio, while consistently maintaining high F1 scores.
[1] Mohammad Azizmalayeri, Ameen Abu-Hanna, and Giovanni Cin ´a. Unmasking the chameleons: A benchmark for out-of-distribution detection in medical tabular data. arXiv preprint arXiv:2309.16220, 2023.
Reviewer DwHs
We thank the reviewer for the constructive comments.
[W1] Add more limitation discussion. Data augmentation TabCutMix changes the input distribution and produces OOD data that act like regularization.
A1: We have added the limitation discussion as follows:
- OOD generation. TabCutMix may disrupt meaningful feature correlations by switching parts of features between samples. This disruption could lead to out-of-distribution (OOD) issues and reduce the utility of the generated data, especially in applications where feature dependencies are critical.
- Lack of Theoretical Analysis: While the method shows empirical success in mitigating memorization, the underlying reasons for its effectiveness are not fully explored. A theoretical framework explaining why TabCutMix works and its impact on the learned representations would strengthen its scientific contribution.
Time for discussions as author feedback is in. I encourage all the reviewers to reply. You should treat the paper that you're reviewing in the same way as you'd like your submission to be treated :)
We thank all reviewers for your valuable time and comments. We have incorporated all concerns and revise the manuscript at PDF. We are glad that many reviewers found the following:
-
Critical problem in tabular generative models:
j6Xu: The paper addresses a critical issue of data memorization in deep generative models for tabular data.2Ann: This paper studies an important but under-explored potential issue in applying the diffusion model for tabular data generation.8dDK: The paper tackles the memorization issue in tabular diffusion models, which has been underrepresented in recent research.
-
Simplicity of the proposed method.
DwHs: The contribution could even look "too simple" if it wasn't for all the experiments and sanity checks, which are convincing .
-
Methodology and clarity
DwHs: The paper is structured around key questions, experiments, and empirical observations, enhancing clarity and impact beyond its technical contribution to data augmentation.j6Xu: The paper provides clear motivation and detailed descriptions of its memorization metrics and the proposed post-processing technique.2Ann: This paper is in general well-written and easy to read.8dDK: The paper is clear and well-written.
-
Comprehensive experiments
DwHs: The authors monitor metrics to ensure the augmented diffusion model remains faithful to the training data, balancing diversity with accurate recall of the true distribution.j6Xu: It offers a comprehensive examination of the latest state-of-the-art generators.8dDK: Visualizations are nice to understand the experiments.
-
Theoretical insights
DwHs: The theoretical results give valuable insights on the experiments, creating a consistent narrative.2Ann: The theoretical result, which shows that perfectly trained diffusion models will generate memorized latent representation, is interesting.
On the other hand, aside from some cosmetic suggestions, the reviewers have brought up the following major concerns:
- OOD generation risk by Reviewers
DwHs,2Ann, and8dDK- We tackle this issue by (a) clarifying that augmented ratio hyperparameters can control the tradeoff between reducing memorization and preserving the utility of the augmented data. (b) Proposing an enhanced TabCutMix (called TabCutMixPlus) that can mitigate OOD risk; (3) Conducting OOD detection experiments to quantify OOD risk.
- Please see the details in Sections 4 and 5, Appendices D.4.6, and E.1.
- More dataset benchmarking by Reviewers
j6Xu,2Ann, and8dDK.- We conduct experiments on three additional datasets Churn, Cardio, and Wilt.
- Please see the details in Appendix E.6.
- More baselines such as SMOTE or Mixup by Reviewers
j6Xu,8dDK- We conducted three baselines SMOTE, Mixup, and IJF.
- Please see the details in Section 4 and Appendix E.7.
- More generative models by Reviewer
8dDK- We conduct experiments on CTGAN and TVAE.
- Please see the details in Appendix E.8.
- More evaluation metrics by Reviewer
2Ann- We included two evaluation metrics C2ST and DCR.
- Please see the details in Section 5, Appendices D.4, D.5, D.6, and E.6~E.8.
- Implicit hypothesis TabCutMix does about data manifold by Reviewer
DwHs- We discuss the implicit hypothesis of TabCutMix and explicit cases where TabCutMix may fail/perform well.
- Please see the details in Section 4, Appendices E.5 and F.
- Limitations discussion by Reviewer
DwHs- We have added limitation discussion on independent feature assumption, sensitivity to outliers, and lack of data-centric insights.
- Please see the details in Appendix F.
- Choice of Memorization Metric by Reviewer
j6Xu- We clarify that our metric is based on normalized distance and is equivalent to distance using one-hot encoding. We also acknowledge the limitation of this metric.
- Please see the details in Appendix F.
- More discussion with previous memorization metrics by Reviewer
2Ann- We have added the discussion between memorization ratio and DCR.
- Please see the details in Appendix D.5.
We argue all the above suggestions are factually rooted, and we have responded to them in a point-blank fashion and should have addressed them all. However, the following concern is rather a question of preference, and we would like to highlight it fairly by quoting our reviewer:
- Reviewer
8dDK: The methodology is naive.
We acknowledge the OOD risk associated with TabCutMix and we further propose TabCutMixPlus as a mitigation strategy. More broadly, we respectfully argue if a simple but effective solution works well for a relatively new task — which is the case for TabCutMix (and TabCutMixPlus) — it may not be necessary or well-motivated to pursue a more complex or "perfect" solution at this stage. Instead, such work sets the foundation for follow-up research to further refine and enhance performance in this emerging area.
During the rebuttal period, we have made significant improvements to our work, incorporating the valuable feedback provided by the reviewers. We respectfully argue our contributions exceed the bar of venues like ICLR, and we kindly request the reviewer reconsider their evaluation. Again, while we agree with the reviewer that our technical solution can be "more perfect", such advancing technical refinements & customizations are natural directions for future research. Our primary contributions lie in:
- Identifying the critical issue of memorization in tabular diffusion models,
- Analyzing the underlying causes of this phenomenon, and
- Proposing a simple yet effective solution to address it, while paving the way for future advancements.
We believe these contributions are highly impactful and represent a meaningful step forward in the field of tabular generative models.
This paper analyses the memorisation problem of tabular data diffusion models and proposed a data augmentation approach to mitigate this issue. Specifically, a number of experiments and a mathematical argument are provided to first analyse the memorisation problem. Then TabCutMix as an extension of CutMix data augmentation is proposed and applied to mitigate this issue.
While reviewers commended that the paper writing is overall clear, reviewers have questions regarding the TabCutMix method in contaminating the training data (e.g., changing the correlations between features) and the metrics used to investigate memorisation. Author rebuttal did not fully address the concerns.
Apart from the issues raised by reviewers, I understand that both the analysis and mitigation techniques are, to large degree, a direct translation of methods and insights from image diffusion models. Thus, apart from the reviews, another consideration for me regarding acceptance is: whether this paper provides further insight that is unique to tabular data (since image data and tabular data have very different structures)? If the authors apply both the analysis and mitigation methods to another data modality (e.g., molecule generation where diffusion models are also popular), would the authors achieve similar results?
I recommend the authors to either reconsider the unique properties of tabular data and repurpose the story, or, if they would like to keep this work in its current form, submit this work to more suitable venues, e.g., TMLR.
审稿人讨论附加意见
Reviewers raised score after rebuttal, however, major concerns remain unsolved.
Reject