5.0

/10

Rejected4 位审稿人

最低3最高8标准差2.1

3.5

置信度

正确性2.8

贡献度2.3

表达2.0

ICLR 2025

Unmasking Trees for Tabular Data

Calvin McCarter

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We propose tree-based autoregression for tabular data with SotA results on imputation; our hierarchical partitioning-based modeling of each feature also offers benefits for probabilistic prediction tasks.

摘要

关键词

tabular dataimputationmissinggenerative modelingtabularprobabilistic predictiontabular ML

评审与讨论

审稿意见

评分: 6置信度: 32024-10-28

The paper introduces UnmaskingTrees, a novel method for tabular data imputation and generation, leveraging gradient-boosted decision trees. The approach focuses on sequentially “unmasking” individual features using tree models, enabling state-of-the-art imputation performance and competitive generation results. The paper includes experimental results on 27 datasets, showcasing the superior performance of UnmaskingTrees in imputation tasks compared to established methods like MissForest and ForestDiffusion. Additionally, the paper explores the potential of TabPFN as a base learner within the UnmaskingTrees framework, demonstrating its flexibility for probabilistic prediction and generative modeling.

优点

UnmaskingTrees achieves superior imputation performance, outperforming several popular benchmarks such as MissForest and ForestDiffusion on various metrics.
The autoregressive unmasking approach is innovative and avoids the pitfalls of traditional diffusion models for tabular data, including the usage of TabPFN for regression.
The proposed model reduces training and inference time by only requiring predictions along a single path in the meta-tree, compared to multiple models in diffusion-based approaches.
Extensive benchmarks across 27 datasets with multiple metrics validate the model’s effectiveness across various imputation and generation tasks. And the authors provide code for experiments.

缺点

The training and inference procedure is not clear in writing. Can the authors show a more precise algorithm?
The paper do limited ablation to show how the hyperparameters effect the results, like tree depth or duplication factor, especially for large datasets.
The results for each dataset in tabular data generation should be shown in the appendix.
For tabular datasets the results are close to each other in value, so the comparison with mean rank of different methods can show the results more clear.

问题

See weaknesses.

评论- mSGz

2024-12-01

The training and inference procedure is not clear in writing. Can the authors show a more precise algorithm?

Thanks for the suggestion. We have added algorithms to the Appendix for further clarity.

The paper do limited ablation to show how the hyperparameters effect the results, like tree depth or duplication factor, especially for large datasets.

Given that the benchmark of Jolicoeur-Martineau et al uses default hyperparameters for other methods, we thought it would be unfair if we tuned our hyperparameters to do well on this benchmark. Thus, we chose hyperparameters that looked good on the Two Moons toy dataset (shown in Figure 1), and then just ran on the Jolicoeur-Martineau et al benchmark without further tuning. We can try to add further ablations, either for the camera-ready if accepted or for a resubmission if rejected, but were unfortunately unable to do this for the rebuttal, given our limited computing resources (a single 2015 iMac). Philosophically though, we think that a good method is one that doesn't require per-dataset hyperparameter tuning, and better yet, one that works well without even carefully tuning the default hyperparameters.

The results for each dataset in tabular data generation should be shown in the appendix. For tabular datasets the results are close to each other in value, so the comparison with mean rank of different methods can show the results more clear.

We've added results for each dataset to the Appendix. Please let us know if there's anything else you were hoping to see, and we can include it in a comment.

2024-12-02

My concerns are addressed and I will raise my score to 6

审稿意见

评分: 3置信度: 42024-10-28

This paper proposes to generate tabular data using discriminative models by posing the generation as a sampling from regressors/classificators one feature at a time, using multiple permutations to allow features to be conditioned on different sets of other features.

EDIT: I have read the authors response, and stay with my score. Thanks for pointing out the baselines that I missed and for pointing out that simplicity really is not a downside, which I fully agree with. What I wanted to say is that the proposed method is not very novel and to my understanding the thing that people already do when generating data with discriminatory models. The tree for regression models of course is novel, but I would believe that models that do not work discriminatory will win this in the medium-long run.

优点

Simple to understand

缺点

Unclear how the proposed hierarchical handling of regression sampling is better than a simple discretized approach with well chosen bins
The idea is very simple and low-effort in terms of putting it to work EDIT: I want to take the above weakness back.
The results are not great. The tables feel a little to designed: boldening non-standard things in the table, i.e. the winner of a subset of the methods or the 2 best methods.

I think if one wants to make a scientific contribution at ICLR level it would be the right thing to train a model that generates data for filling in in one forward pass, instead of relying on traditional discriminative models only. Maybe this is worse, but then it would be an important ablation.

问题

How is your approach to regression sampling different from using a well-chosen discretization in combination with a single classification model?

评论- Re xwkF

2024-12-01

Unclear how the proposed hierarchical handling of regression sampling is better than a simple discretized approach with well chosen bins

We include an ablation analysis in Table 2, showing that our proposed approach performs better than vanilla quantization, either with bin boundaries chosen via KMeans ("UTrees-kMeans") as was done by TabMT (Gulati & Roysdon, 2024), or with bin boundaries chosen via KDI ("UTrees-KDI").

The idea is very simple and low-effort in terms of putting it to work

That's not a weakness -- it's a strength! To quote Ilya Sutskever from his NeurIPS 2024 Oral (https://www.youtube.com/watch?v=-uyXE7dY5H0) on "Sequence to Sequence Learning with Neural Networks" that just won a test-of-time award, "We use minimum innovation for maximum results." Frankly, the preference of ML conference reviewers for complicatedness and high-effort set back our field by a decade. This preference, driven not by practicality or promise, but by the desire to appear smart to each other with fancy equations, explains why autoregressive-based methods have been underexplored for tabular data. We make no apologies for our taste for simple, low-effort, elegant minimalism.

The results are not great. The tables feel a little to designed: boldening non-standard things in the table, i.e. the winner of a subset of the methods or the 2 best methods.

This is incorrect. We highlight the best of ALL methods (not a subset of the methods), to show which method is best. And we boldify the best of the XGBoost-based methods (ForestVP vs ForestFlow vs ours) as a natural ablation, to control for the type of model, and focus on diffusion vs flow-matching vs autoregression.

I think if one wants to make a scientific contribution at ICLR level it would be the right thing to train a model that generates data for filling in in one forward pass, instead of relying on traditional discriminative models only. Maybe this is worse, but then it would be an important ablation.

I'm not sure how to interpret this.

If you are referring to our autoregressive joint modeling approach: by this ideology, research on LLMs (which are autoregressive and which involve training a token discriminator) does not belong in ICLR. Really?

If you are referring to our BaltoBot hierarchical partitioning conditional modeling approach: as we stated above, we already include an ablation experiment. Are you suggesting that this idea should be a priori rejected by ICLR for ideological reasons, regardless of its surprising empirical effectiveness?

With all due respect, avoiding this type of ideological, closed-minded reviewing is why ICLR was founded in the first place. We hope the reviewer will reconsider their perspective, but if not, we hope this review will be disregarded by the Area Chair and Program Committee.

审稿意见

评分: 3置信度: 32024-11-01

This paper introduces UnmaskingTrees, an approach for tabular data imputation and generation using gradient-boosted decision trees with an autoregressive unmasking objective. The authors also present BaltoBot, a tree-based probabilistic prediction method. Their method is more computationally efficient than diffusion-based approaches and offers practical advantages like closed-form density estimation.

优点

Problem relevance: There is a clear gap in the literature for methods that can handle tabular data imputations effectively, especially given that traditional methods like MissForest outperform newer deep learning approaches.
Computational efficiency: The method offers faster training and inference compared to diffusion-based approaches, which is valuable for practical applications.
Novel combination: While the individual components aren't new, combining gradient-boosted trees with autoregressive unmasking for tabular data is an interesting direction.
Numerous baseline models : The authors compare their method with several baseline models, including MissForest, MICE, and GAIN, which provide a comprehensive evaluation.
Existing implementation: The authors provide the source code implementation that promotes reproducibility and practical adoption. Although the code for the case studies and their proposed method was available, the code for the experiments on the 27 datasets was missing.

缺点

Incomplete technical analysis: The paper fails to properly analyze how the method performs under different missingness mechanisms (MAR, MCAR, MNAR). This is a critical oversight for a paper focused on missing data.
Superficial approach: Although the method is a novel combination of existing techniques, it also lacks substantial theoretical innovation or justification.
Oversold and unjustified claims: The method is presented as a general solution, but its limitations and assumptions are not adequately discussed. The paper also lacks a clear discussion of how the method handles discrete variables and different data types.
Poor presentation of the methodology: The method is poorly formalized and difficult to follow. The description needs more rigorous mathematical formulation and clear algorithmic steps.
Writing issues: There are some editorial issues (e.g., line 91 has a missing variable definition somewhere in between "process" and "times," or in line 112, it is unclear what $\Delta$ in depth- $\Delta$ refers to). These issues make the technical part of the paper especially hard to follow.
Datasets assumptions: Even though there are 27 datasets, Jolicoeur-Martineau et al. (2024b) carried out pre-processing steps, such as encoding the categorical features as one-hot vectors, that have implications for the experimental results. Therefore, an overview of the datasets in the appendix with this information would have been helpful.
Poor abstract: The abstract fails to clearly communicate the paper's contributions and methodology.

问题

How does the method specifically handle MAR, MCAR, and MNAR cases? What are the theoretical guarantees for each case?
What are the method's limitations when dealing with discrete variables? The same question applies to mixed data types.
How does the choice of masking order affect the results? Is there any theoretical justification for random ordering?
What are the computational complexity implications for high-dimensional data?
Was the experiment code (for the 27 datasets) available, or did I miss something? Additionally, here are some minor remarks on the code: (1) there were no unit tests, so pytest in your README.md is unnecessary, and (2) I advise the authors to add ForestDiffusion to the requirements.txt required for iris-demo.ipynb.

评论- Re pXdG

2024-11-30

The paper fails to properly analyze how the method performs under different missingness mechanisms (MAR, MCAR, MNAR).

Our experiments are all on MCAR missingness. We note that the GAIN (Yoon et al, ICML 2018), ForestDiffusion (Jolicoeur-Martineau et al, AISTATS 2024), and TabMT (Gulati & Roysdon, NeurIPS 2024) papers only evaluated on MCAR. So we vigorously disagree with the reviewer's demanding of this as a condition for acceptance.

Although the method is a novel combination of existing techniques, it also lacks substantial theoretical innovation or justification.

This is blatantly false. Hierarchical partitioning (BaltoBot) is a novel technique for conditional generative modeling. We gently request the reviewer to reread Section 2.2.

The method is presented as a general solution, but its limitations and assumptions are not adequately discussed. The paper also lacks a clear discussion of how the method handles discrete variables and different data types.

Our paper has a full yet short Limitations section (Section 4). This section is short, because our method has few limitations. If the reviewer thinks otherwise, they are welcome to say precisely what they think those are!

With regards to discrete variables, our method has no limitations, though we inadvertantly failed to mention this in the original manuscript. We have clarified this in Section 2.1: "We model categorical features via softmax-based classification with cross-entropy loss; our approach for continuous features is described in Section 2.2." This is also clarified in Algorithm 1 in Appendix A.

The description needs more rigorous mathematical formulation and clear algorithmic steps.

We have added an Equation to Section 2.1 to formalize UnmaskingTrees, and have added algorithms to Appendix A.

Writing issues

We have addressed these. Note that we are now using the variables of H (for height of the meta-tree) and h (for height of a node in the meta-tree).

Datasets assumptions

We follow Jolicoeur-Martineau et al.'s exact setup and even their experimental scripts. Anyone who wants these details can read their paper.

The abstract fails to clearly communicate the paper's contributions and methodology.

Without any concrete details about what is allegedly lacking or missing in the abstract, this criticism is too vague to respond to. Can you clarify what you mean?

How does the choice of masking order affect the results? Is there any theoretical justification for random ordering?

There is typically no inherent ordering to tabular features. Random ordering shows as many possible orders to the model as possible during training. This is especially useful for imputation, where any subset of the features may need to be conditioned on at inference time. The theoretical justification for this is formalized and discussed extensively in the case of tabular data in TabMT (Gulati & Roysdon, 2024) and DP-TBART (Bastellon et al, 2023) (we added a citation to the latter in our revised manuscript), and, more generally, for joint modeling in (Kitouni et al., 2024).

What are the computational complexity implications for high-dimensional data?

This was already fully discussed in Section 2.3 (Computational Complexity). Please read and let us know if you have remaining questions.

We did add a discussion of empirical runtime to Appendix C, noting that our method is actually comparatively fast, especially on the biggest datasets in the benchmark:

"Timing results are in Table 9, and depicted in Figure 5. Our method is relatively efficient at both imputation and generation. The datasets on which we are slowest for imputation are Libras (1976 seconds) and Bean (1929 seconds), on our ancient 2015 iMac with 16Gb RAM. On Libras, ForestVP imputation took 12439 seconds (without RePaint) and 14715 seconds (with RePaint); on Bean, ForestVP took 898 seconds (without RePaint) and 1318 seconds (with RePaint), on their cluster of 10-20 CPUs with 64-256Gb of RAM. The datasets on which we are slowest for generation are also Libras (2987 seconds) and Bean (4346 seconds). On Libras, ForestFlow generation took 9481 seconds and ForestVP took 9042 seconds; on Bean, ForestFlow took 869 seconds and ForestVP took 947 seconds, once again on their much more powerful computing cluster."

What are the method's limitations when dealing with discrete variables?

As stated above, there are no limitations.

Was the experiment code (for the 27 datasets) available

We have added them to the repo; these are from the ForestDiffusion public repo, with additional function calls for our approach.

there were no unit tests

Our deanonymized repo has a variety of things including unit tests that were stripped from our anonymized repo. Frankly, we find this complaint rather nitpicky and unnecessary.

add ForestDiffusion to the requirements.txt

Done.

审稿意见

评分: 8置信度: 42024-11-02

The authors propose leveraging the autoregressive framework of [1] for generative modeling of tabular data, replacing Transformers with XGBoost.

Because this requires learning conditional distributions over numerical variables, and XGBoost regression models only provide point estimates, the authors propose a second contribution in the form of a non-parametric conditional density estimator which they call BaltoBot.

The authors also explore replacing XGBoost with TabPFN which, despite not achieving comparable results to XGBoost, shows the overall flexibility of the framework where any classification model can be used as an autoregressive model to generate tabular data.

[1] Manbir S. Gulati and Paul F. Roysdon. TabMT: Generating Tabular data with Masked Transformers. In NeurIPS 2023.

优点

The main idea is simple and sensible.
The algorithm proposed for conditional density estimation over numerical variables, despite not being the main focus of the paper, shows promising results. I will note that while this algorithm might seem expensive, requiring training multiple binary classification models for predicting a single variable, it is in fact not so dissimilar to how a naive treatment as a multi-class categorical variable would be handled in XGBoost, requiring training one tree per categorical value. If one were to quantize a numerical variable into an equivalent $2^\Delta$ bins and fit a multi-class XGBoost model, it would also require fitting $2^\Delta$ trees at each boosting round, therefore leading to a similar "duplication" of models. Furthermore, as noted by the authors, training a BaltoBot model actually seems advantageous in this instance as each XGBoost model not at base level only sees a fraction of the training examples.
As an autoregressive model, the method proposed here allows for density evaluation which is an advantage that is somewhat understated. This would enable interesting use cases that other generative methods can't do such as out of distribution detection. It would also allow one to compute likelihoods under the model which would be a golden standard for evaluation of a generative model.
Source code is provided.

缺点

In my opinion this is a good paper with good ideas but certain aspects could be improved (roughly in order of importance):

The biggest downside of the proposed method would seem to be the need for duplicating the training set K*D times which could make it impratical for large datasets (and require selecting a lower K least the training set wouldn't fit in memory). The paper doesn't explore this regime and the trade-offs that would have to be made to scale to datasets with more examples and/or more features since it relies on a benchmark composed only of relatively small datasets.
The benchmark proposed by [1] also does not carry out any hyperparameter tuning as far as I understand. It could be more relevant to a practitioner who would tune the hyperparameters of their chosen generative model to know the performance that each method can achieve with some reasonable hyperparameter tuning protocol. However, I do sympathize with the authors that there are too many degrees of freedom to such an analysis and it is not fun.
The BaltoBot evaluation could be further developed. As a non-parametric conditional estimation method for numerical variables it could have wider applicability as it seems competitive to other proposed methods for uncertainty estimation. However, evaluation on more than one real-world scenario would be required to draw meaningful conclusions.
Please consider providing the full table of results and standard errors for every metric in an appendix. This would allow an interested reader to assess the magnitude of the differences in a metric of interest between different methods.
The explanation of BaltoBot could be improved. Namely, it could be clearer that the input space to every node is partitioned into two with the splitting point determined by KDI.
Fix missing $K$ in line 092: (...), we repeat this process times with $K$ different random permutations -> (...), we repeat this process $K$ times with $K$ different random permutations

Other interesting analyses

The following are not really weaknesses of the paper but rather analyses that I would be interested in but I don't think should be expected of the paper.

The density estimation aspect could be explored further. Namely, comparing with [2] as an example of another tree-based model for density estimation on NLL directly.
Since errors in auto-regressive generation could accumulate faster given certain feature permutations than others it could be interesting to explore doing multiple rounds of imputation with different random permutations. This could be an easy way to improve results at the cost of more computation.

[1] Alexia Jolicoeur-Martineau et al. Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. In AISTATS 2024.

[2] David S. Watson et al. Adversarial random forests for density estimation and generative modeling. In AISTATS 2023.

问题

Would similar strategies to ForestDiffusion using a data iterator to avoid materializing the entire duplicated training set in memory be a viable strategy?

评论- Re sT3q

2024-11-30

"The biggest downside of the proposed method would seem to be the need for duplicating the training set K*D times..."

True, while [1]'s benchmark includes Bean (N=13611, D=16) and Libras (N=360, D=90), it is mainly focused on small datasets. Note that we had no difficulty applying UnmaskingTrees to the benchmark on our ancient 2015 iMac with 16GB RAM. We have clarified in the updated manuscript that our approach is expecially geared towards smaller-sized datasets. Nevertheless, our method is sometimes faster than its diffusion-based (ForestDiffusion) and flow-based (ForestFlow) counterparts. See Appendix C in our revised Supplemental:

"Full imputation results are in Table 7. Full generation results are in Table 8. Timing results are in Table 9, and depicted in Figure 5. Our method is relatively efficient at both imputation and generation. The datasets on which we are slowest for imputation are Libras (1976 seconds) and Bean (1929 seconds), on our ancient 2015 iMac with 16Gb RAM. On Libras, ForestVP imputation took 12439 seconds (without RePaint) and 14715 seconds (with RePaint); on Bean, ForestVP took 898 seconds (without RePaint) and 1318 seconds (with RePaint), on their cluster of 10-20 CPUs with 64-256Gb of RAM. The datasets on which we are slowest for generation are also Libras (2987 seconds) and Bean (4346 seconds). On Libras, ForestFlow generation took 9481 seconds and ForestVP took 9042 seconds; on Bean, ForestFlow took 869 seconds and ForestVP took 947 seconds, once again on their much more powerful computing cluster."

The benchmark proposed by [1] also does not carry out any hyperparameter tuning as far as I understand. ... However, I do sympathize with the authors that there are too many degrees of freedom to such an analysis and it is not fun.

Correct, [1]'s benchmark doesn't tune hyperparameters. But we believe their choice was appropriate for such small datasets, where as you mention, one wants to avoid too many degrees of freedom. And we note that hyperparameter tuning is even more not-fun for practitioners. They may lack the data, time, and skills to do it, so a method like ours that doesn't require hyperparameter tuning, but works well by default, is genuinely valuable to them.

Please consider providing the full table of results and standard errors for every metric in an appendix.

We have added the raw data to the Appendix C and to the code repo.

The explanation of BaltoBot could be improved...

Thanks for this feedback! We have updated the manuscript accordingly: "Thus, the input space to every node is partitioned into two with the splitting point determined by KDI."

Fix missing K in line 092

Thanks; we have fixed this.

The BaltoBot evaluation could be further developed. As a non-parametric conditional estimation method for numerical variables it could have wider applicability as it seems competitive to other proposed methods for uncertainty estimation. However, evaluation on more than one real-world scenario would be required to draw meaningful conclusions.

Indeed, it would probably be possible to write an entire paper around the BaltoBot idea. However, such experiments are probably out of scope for this conference paper. Here, we think it suffices to show that we have superior results (including the Table 2 ablation comparing it with vanilla KMeans quantization and vanilla KDI quantization) using it as a subroutine for the task our paper addresses.

"Would similar strategies to ForestDiffusion using a data iterator to avoid materializing...?"

This is a fantastic idea. This would be more efficiently implemented by marginalizing over random permutations of autoregressive masking orders, which yields uniform-rate masked language modeling. These two are equivalent -- see TabMT (Gulati & Roysdon, 2024) and (Kitouni et al., 2024) -- but the latter is more amenable to using with the XGBoost data iterator as done by Jolicoeur-Martineau et al.

The density estimation aspect could be explored further.

Thanks for the reference. We hope to investigate this further, either for camera-ready if accepted, or in a future submission if rejected.

Since errors in auto-regressive generation could accumulate faster given certain feature permutations

Indeed. We use a different random permutation for each generated or imputed sample, as a way to contribute to sampling diversity. But each sample could also incorporate multiple different permutations to improve each sample's quality. However, how one would use the model's own outputs to evaluate generations is still an open research question (eg, for LLMs, the varentropy-based Entropix sampler (Shrek & Frog, 2024) is one such exploratory idea).

评论- Thank you for the reply

2024-12-01

Thank you for your detailed response. I agree that that there is value in a method that does not require extensive tuning, particularly for smaller datasets.

I will also stand by my score and assessment which seems to deviate from that of other reviewers. I see the fact that the main idea is simple as a strength rather than a weakness since simple ideas that work well are preferable to convoluted ones. Autoregressive approaches for tabular data have not received much attention and the direction proposed here is interesting and novel and synergizes well with XGBoost's handling of missing data. Furthermore, the main challenge with relying on XGBoost was adequately addressed by the second contribution proposed. I agree with the authors that the BaltoBot contribution could stand on its own given better experimental support.

I also found the additional experiments with TabPFN interesting despite the results not being the most promising when compared to XGBoost.

AC 元评审

2024-12-18

The paper introduces UnmaskingTrees, a method for tabular data imputation and generation using gradient-boosted decision trees. The approach leverages an autoregressive process where features are sequentially unmasked achieving state-of-the-art imputation performance. The method outperforms existing approaches like MissForest and ForestDiffusion, and is competitive in generating data with missing values. A key innovation is the integration of BaltoBot, a non-parametric method for conditional density estimation. This model offers computational efficiency, faster training, and inference compared to diffusion-based models. The authors provide comprehensive benchmarks across 27 datasets, but some aspects of the methodology (like handling large datasets and the impact of certain hyperparameters) need further exploration. The paper could benefit from clearer technical details and a deeper discussion of model limitations and assumptions. The reviewers also indicate that the proposed method is not very novel and follows the approach that people already do when generating data with discriminatory models. Although it may fall in the borderline range, overall I believe that this paper needs more work before it can be accepted for publication.

审稿人讨论附加意见

There were some concerns from the reviewers that were properly addressed during the discussion. However, at least one reviewer is still suggesting rejection. The other reviewer suggesting rejection did not respond to the authors. However, I still think that some of the points of criticism indicated by such a reviewer remain in the paper.

最终决定Reject

2025-01-22

Reject