6.0

/10

Poster5 位审稿人

最低5最高7标准差0.9

2.6

置信度

正确性2.6

贡献度3.2

表达2.4

NeurIPS 2024

Generative Forests

Richard Nock,Mathieu Guillame-Bert

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

A new class of tree-based generative models and a training algorithm with fast, boosting compliant induction and models that can be used for data generation, missing data imputation and density estimation.

摘要

关键词

tabular datagenerative modelsboostingtrees

评审与讨论

审稿意见

评分: 7置信度: 12024-07-07

This paper proposes a generative model for tabular data based on a forest of trees. It is evaluated on the quality of the generated samples (using an optimal transport based distance to a held out test set), on data imputation and density estimation.

优点

Generative modeling for tabular data is an important research topic, and exploiting the strengths of trees (ubiquitous in this modality) for this generative task is a promising direction.

The framework inherits the flexibility of boosting, and the theoretical results (under weak learning assumptions) provide guarantees that the underlying distribution can be approximated well enough with enough trees.

缺点

Further empirical comparison could be conducted. For example, recent works have applied diffusion for tabular data (Kotelnikov et al.) as well as flow matching (Jolicoeur-Martineau et al.). It might be worth comparing to these more recent approaches.

Furthermore, more metrics could be included for the LifeLike experiment to provide more nuanced results. See Jolicoeur-Martineau et al. for examples of such metrics.

References:

Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023, July). Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning (pp. 17564-17579). PMLR.

Jolicoeur-Martineau, A., Fatras, K., & Kachman, T. (2024, April). Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees. In International Conference on Artificial Intelligence and Statistics (pp. 1288-1296). PMLR.

问题

What is the computational cost for training and inference compared to Generative Trees? (It would be interesting to provide wall-clock times)

How are the parameters chosen (number of trees, number of splits, ...)?

局限性

See questions

作者回复

2024-08-06

We would like to especially thank the reviewer for singling out the importance of boosting and our theoretical framework and results

weaknesses

Further empirical comparison could be conducted. [...]

We could be happy to do so, but kindly note that at this stage we have received criticisms that we cram a lot of experimental stuff in a small allotted space [kTDz-C]. Furthermore, our paper investigates three (3) different experimental settings on which we already have five (5) contenders (in fact 6 if we count the generative trees of [29], and even more if we consider "copy" to be one in our GEN-DISCRIM experiment, Section V.V.3.5); finally, the reviewers altogether suggest to add five (5) more contenders, in different parts of our three settings. As we write above [ALL], we would like to keep a good balance theory / experiments and so it would be important for the reviewers to get to a consensus on an additional experiment that we could detail as part of the +1 page. We make a suggestion in [ALL] that would be easy to carry out during camera-ready preparation and which rejoins the reviewer's suggestion.

Furthermore, more metrics could be included for the LifeLike [...]

Good suggestion but kindly keep in mind that this would induce more space to allocate in our paper (see above). We suggest adding Grower and coverage in the appendix for our method and all our LifeLike experiments. Kindly note also that out of the many metrics in Jolicoeur-Martineau et al., an optimal transport distance is the only one that provides a statistically meaningful comparison at the distribution level (see [kTDz-F]). Jolicoeur-Martineau et al. use Wasserstein's metric, which is computationally expensive and prevents them from computing it on medium+ sized datasets (Cf their section B.2.2); we use a regularized version, now very popular since Cuturi [8], that still bears OT properties and is sufficiently fast that it could be computed on all our datasets even on our "Low-end" laptop.

questions

What is the computational cost for training and inference compared to Generative Trees? (It would be interesting to provide wall-clock times)

It is more computationally demanding than generative trees (GT's) for a single reason: when testing a split at a leaf, we have only one support to split in a GT (that corresponding to the leaf) while in a generative forest (GF), we need to test the splitting of all supports in ${\mathcal P}({\mathcal T})$ that are included in the leaf's. Note that this has a positive side: we can train much smaller GFs compared to GTs. So somehow we are slower with GFs but end up training models with less splits to get to the same quality. We could put times but such a question entails a broader one on our contenders and this could be sensitive [kTDz-F]. Note also that we have used a "low end" laptop and a "medium+ end desktop" for our experiments, but we have systematically trained our models on the laptop. It was deliberate as we wanted to make sure we could learn such models on a "simple" device. In this context, we are not sure we would not end up comparing peers and apples when it comes to times. At least we can propose to put the training times of our method for all our experiments, with a clear explanation of how / why we did so in a "low-end" laptop (and in the appendix) ?

How are the parameters chosen (number of trees, number of splits, ...)?

We deliberately did not optimize / hypertune these parameters, in order not to factor the "human tuning" in the explanation for the quality of our models (keep in mind we have three experimental settings), so we just picked informed guesses that would also represent fast enough training, see kTDz-E] [cvp8-F]. The other reason we have not done this is because we are convinced there is an automatic method to find the "right" parameters, which comes down to pruning our models, see [cvp8-E]. This of course would require a work / paper of its own.

评论- Thank you for your response

2024-08-10

Dear Authors,

Thank you for your detailed response. I'd be happy to see the results against Forest-Flow if the other reviewers also agree.

评论- Experiments vs Forest-Flow provided

2024-08-12

Kindly see at the top of this page, or Search for tags [Exp-0], [Exp-1] and [Exp-2].

评论- Thank you for your response

2024-08-13

Thank for your response, I have raised my score by one point.

审稿意见

评分: 6置信度: 42024-07-09

This paper introduces generative forests (GF), a new class of generative models designed for tabular data and based on sets of trees (forests), and a training algorithm called GF.BOOST. The algorithm is designed to be simple, making it easy to implement with minor modifications to existing CART-style decision tree induction schemes.

The GF models and GF.BOOST algorithm are evaluated across various domains, demonstrating their capability in tasks such as data generation, missing data imputation, and density estimation. The results show that even a small number of trees within a generator can be effective, highlighting the potential of the proposed approach in handling tabular data.

优点

I believe the paper is a nice addition to the ML community. Prevailing methods in data imputations for DTs are based on heuristics and tricks, whereas this paper approaches this problem with sounding probabilistic point of view, which makes a lot of sense to me. This naturally leads to the ability of generating more data, which may have serious implications in synthetic data generation literature.
The method is relatively easy to implement and can be done by straightforward extension of CART-style tree induction methods.

缺点

Experiments section may need some improvements:
- Most datasets in Table 2 (which seem to be the main results table) are standard ML benchmarks. It would be really nice to have real-world kaggle-style datasets with missing values and compare with strong baselines therein.
- Adding traditional forests is also necessary to compare against established methods (xgboost, lightgbm, etc.)
I find most of the math notations hard to follow and can be significantly simplified.

问题

How computationally efficient to compute Eq. 4? From my understanding it is used for splitting a tree
Is there any mechanism to prune some trees in a forest (reduce T)? Traditional tree boosting builds trees sequentially, so there is opportunity for early stopping. I was wondering how critical here to have predefined T...
Is there any separate ablation how well this method works for categorical features?

局限性

there seems to be some limitations in convergence analysis (e.g. symmetry of l).
otherwise, having separate (sub)section on limitations would be nice (e.g. runtime compared to traditional boosting methods).

作者回复

2024-08-06

We thank the reviewer for noting the importance of our formal contribution !

weaknesses

It would be really nice to have real-world kaggle-style datasets with missing values

Actually, sigma-cabs is a Kaggle dataset with missing values. Stanford Open Policing is not Kaggle but it also has missing values and is real world (see our Table A1). We understand the reviewer would like to have a few more ? Perhaps we can use [uJTo-B] ?

Adding traditional forests is also necessary to compare against [...]

We understand the reviewer would like us to add tests on supervised learning. We kindly refer to [cvp8-A] for a discussion on this matter.

I find most of the math notations hard to follow and can be significantly simplified

We attribute it to the fact that our approach is purely grounded in a supervised learning framework, which is indeed seldom for generative models. We refer to [cvp8-B][uJTo-A] for discussions on these and their importance; we would use part of the +1 camera ready to hopefully bring improved explanations.

questions

How computationally efficient to compute Eq. 4?

[6MHb-A] Great question. We only need to keep those elements in ${\mathcal P}({\mathcal T})$ with $>0$ empirical measure because for all others their contribution to (4) is zero, so the complexity is linear in the size of the training sample (we cannot have more elements in the set). We also refer to [kTDz-I] for side remarks on the code we provide.

Is there any mechanism to prune some trees in a forest (reduce T)?

Great question: as we say in [cvp8-E], there surely is, but it would deserve a paper of its own (we are talking about a pruning mechanism with provable properties).

I was wondering how critical here to have predefined T... [...]

We refer to [cvp8-F] for a similar discussion and the reason why we did not "hypertune" our parameters $T,J$ .

Is there any separate ablation how well this method works for categorical features?

We would be happy to get additional context on this question -- does the reviewer think about testing "all features" vs "only numerical features" to test how much categorical features contribute to the model quality ? (note that we would have to push it to the appendix, see [ALL])

limitations

there seems to be some limitations in convergence analysis (e.g. symmetry of l).

This is in fact a limitation of a previous paper [29], limitation that we remove in our analysis, which is thus more general (see L497, that we can put in the main file using the +1 camera ready page)

2024-08-08

Hi, thanks for your rebuttal! However, I feel like my concerns on additional experiments were not addressed adequately.

Most datasets used seem to not have missing values, the ones that do have them appear to be small sixe, e.g. sigma-cabs has 5000 data points and 13 features. I don't understand why it was a problem for not running requested evaluations during rebuttal? One could just get some dataset with reasonable size where tree-based models perform good (e.g. XGBoost) and apply some standard method (e.g. https://www.kaggle.com/code/alexisbcook/missing-values) and evaluate against the proposed method. I feel like this is a major blocker for me to accept the paper without those results. The current evaluations does not convincingly show the advantageous of applying this method specifically over others.

评论- One more point re: using datasets with missing values

2024-08-08

We forgot to mention an important point: in missing data imputation, the task is to predict missing values and then compare to a known ground truth to compute the metrics. Thus, the missing values must have been removed from the domains and so those domains must have the values that will be removed before testing missing data imputation.

Note that this does not prevent the domain to have missing values "as is" (our methods can be trained if the domain has missing values) but the prediction of those cannot be assessed against a ground truth because they are not known -- read: those predictions cannot be included in a metric. This is why we simulate missing data imputation by removing a fixed proportion of values in the domain (5%, Missing Completely At Random, MCAR in the jargon of missing data imputation), and then compute metrics only based on those features "known in the full domain" then predicted by models.

2024-08-08

Yes, I understand that you're doing the mentioned 3 things. But how, for example, data imputation will eventually be used? My best guess is to train some model on it. My concern is how this method is going to fit in the final pipeline, where we can use alternative methods of data imputations with final model. The question is how GF is performing w.r.t. to that? This is point, which I think kTDz is raising as well...

评论- No pipeline for the evaluation of missing data imputation

2024-08-08

We apologize if we misunderstand the point. Say we have a training sample $S_m$ with missing values. A missing data imputation method $M$ provides another sample, $M(S_m)$ , without missing values. It seems that the reviewer would like to then put another method, say $A$ , which then produces a ML output (model, statistics, etc.) $A(M(S_m))$ . The composition $A \circ M$ is the pipeline mentioned. Then we are being asked to evaluate the quality of $A(M(S_m))$ when for example $M$ is our approach, for some $A$ (against alternative methods).

This is not the way missing data imputation is evaluated: it is just evaluated by comparing $M(S_m)$ to the ground truth, say $S$ in which the missing values of $S_m$ on which the evaluation has to take place are not missing. We thus compare $M(S_m)$ and $S$ , independently of any post-processing $A$ . This is why we have our evaluation as it is in the paper, aggregating over all features a pointwise loss that depends on the feature type (perr for categorical, rmse for numerical).

Using such pointwise metrics defines a gold standard for the evaluation of missing data imputation, see for example Muzellec, Josse, Boyer and Cuturi, "Missing Data Imputation using Optimal Transport", ICML20; Zhao, Sun, Dezfouli and Bonilla, "Transformed Distribution Matching for Missing Value Imputation", ICML23, etc. .

2024-08-10

I'd still encourage authors to discuss their method a bit more from the supervised learning angle, this is where DTs are mainly used so far. I'm maintaining my original score.

评论- We are running more experiments

2024-08-08

We are confused. The reviewer would like us to use XGBoost, but we do not do supervised learning. We do (i) data generation, (ii) missing data imputation, (iii) density estimation. XGBoost, lightGBM address supervised learning: the missing values need to be in one column (=class). Our missing values can be everywhere, that is why we use MICE, which is state of the art (note that we have also optimized MICE, in particular augmenting the number of trees used in random forests-based imputation to 100 (the default value, 10, is arguably too small), see also [kTDz-E]. We also test CART trees in MICE. Thus, note that the contender method we use is always tree-based.

Since the reviewer mentions many times datasets with missing values, we assume the task we have to tackle is missing data imputation. We have started more experiments with MICE against our method using real world domains bigger+ than sigma-cabs. The results will be communicated before the discussion deadline. Note that our results in missing data imputation are plots (Table 5) and we will not be able to communicate plots, so we will communicate summary metrics that we already use.

评论- Last reply

2024-08-14

We thank the reviewer for all interactions. As we are getting a few minutes before the deadline, we would like to mention that we have started comparisons on much bigger domains than the ones we have in our current benchmark. As a comparison, our largest dataset (real-world) in the vs-mice benchmark has $m \times d \approx$ 363K. We are now testing on bigger domains, but we are facing scalability setbacks on some contenders [uJTo-L].

Nevertheless, for imputation experiments, we can already report, as an example, an experiment vs mice (same parameters as in our paper) on domain medical_charges (a real world domain from US hospitals with $m \times d >$ 815K [govWD]) with RMSE = 2.86E8 (us) vs RMSE = 2.50E8 (mice), p-val = 0.38 so no statistically significant difference.

To make a uniform treatment among our experiments, we have to test those domains on our two other settings (data generation, density estimation) as well and using all our other contenders, so we propose to select 1-2 of such domains approaching $m \times d =$ 1M and add those in our benchmark. Our program runs smoothly on such domains [uJTo-L], but have to fairly solve scalability issues for some of our contenders [uJTo-L].

The author(s).

[govWD] L. Grinsztajn , E. Oyallon and G. Varoquaux . Why do tree-based models still outperform deep learning on tabular data? NeurIPS 2020

审稿意见

评分: 5置信度: 22024-07-10

The paper proposes a generative model based on an ensemble of trees (forest) for tabular data. The proposed model enjoys the following peculiar properties:

Compared to generative models based on a single tree, it offers improved expressiveness in terms of partitioning the input space (linear for generative trees vs. polynomial for generative forest in terms of decision nodes in a tree).
The model offers unbiased density estimation for sufficient large capacity, thanks to the fact that each node in each tree leverage decisions in proportion to the empirical distribution.
Learning the structure of the trees is performed by minimizing an objective function based on the likelihood ratio risk.
Computationallly speaking, data generation and density estimation can be performed in parallel with respect to the different trees. Indeed, results can be aggregated a posteriori by taking the intersection of selected regions.
Apart from data generation and density estimation, the model can be extended to deal missing data and perform data imputation.

Experiments are conducted on a series of tabular datasets from the UCI repository. The proposed solution is shown to outperform/being on par with existing tabular data generation (the most relevant being adversarial random forests), data imputation (e.g. MICE) and density estimation baselines (e.g. kernel density estimation).

优点

The problem of learning a good generative model for tabular data is relevant Relevance
The solution based on a tree ensemble represent a significant advance over solutions based on a single decision tree Significance.
The proposed model offers the possibility to tackle three different tasks, including density estimation, data generation and data imputation. Moreover, experiments demonstrate the efficacy of the solution when compared with corresponding baselines for each task Significance.
The theory of the paper seem correct, however I haven’t gone through the proofs Correctness
Code is available at the time of submission. No additional check has been conducted to verify the reproducibility of the results Reproducibility

缺点

One of the major concerns is with the novelty of the proposed solution Novelty. What is the relation with the work in [1]? The work in [1] is also providing a model able to tackle the three above-mentioned tasks. Would it be possible to provide a comparison?
The paper is sometimes overloaded by unnecessary formalism, rendering the text hard to follow Clarity. For instance,
1. Line 92-109, why is the whole theory about risk minimisation for binary classification introduced, when the goal is to perform density estimation? Would simple likelihood learning suffice?
2. Definition of tree. “…labeled with subsets of their tail node’s variable domain”. Do you mean “”labeled with subset of attribute value for the corresponding variable?
3. Line 160-163. Can you make an example or in case leave out the sentence? Am I correct to say that the permutation invariance property is due to the final intersection operation or am I missing something?
4. Algorithm 3 in Step 2.1 How is node selection performed, uniformly at random? Moreover, can you provide some more details about the resulting trees, are they balanced?
5. Can you elaborate on the motivation of proving theorem 5.2 and discuss the significance of its results or link with the results in the experiments?
All experiments are conducted on datasets with less 34 variables Soundness. Would it be possible to provide some comparisons with [1] on larger dimensional datasets (see the reference for example) and/or discuss the computational requirements?

Reference

[1] Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. AISTATS 2024

问题

Overall, I like the extension of generative trees to ensembles. I’m willing to raise my score if all weaknesses are addressed during the rebuttal phase.

局限性

There are no additional suggestions.

作者回复

2024-08-06

We thank the reviewer for organising the review in a very clean way that allows for easy referencing, tagging two significant key strengths on which we elaborate further below, and being clearly open for discussion. We use W.X.Y to refer to the weakness section.

W.1 "One of the major concerns [...]"

There are two parts to this question, a technical one (relationship to [1]) and experimental (comparison with [1]).

technical part: [1] is based on diffusion models while we rely on stochastic generation, so the approaches are radically different (they solve ODEs for data generation, we essentially "just" sample Bernoulli random variables, see our algorithm starUpdate). We thank the reviewer for this reference as it in fact provides us with a contender type that is radically different from all our contenders (not just our method) [ALL].

At this point, we would like to highlight a key difference with [1]: [1] has no formal analysis of the algorithm: no convergence / consistency result of any kind. Experimental results are important to tell the story but a good theory gives the backbone. Among all references pointed by all reviewers, none provides rates of convergence. We do. We also emphasize that it is not complicated to show a consistency result in our case [kTDz-A2].

experimental comparison: the reviewers provided overall a number of potential additional contenders. Assuming a consensus emerges around our proposition in [ALL], we would be happy to provide an additional comparison with [1] on data generation, hoping our experimental section will remain readable using the +1 page of camera ready (we would not compare on missing data imputation as [1] make it explicit that they do worse than MICE and we are competitive).

W.2.1 "The paper is sometimes overloaded by unnecessary formalism [...] Line 92-109, why is the whole theory [...]"

[uJTo-A] The formalism we introduce is necessary because we manage to train a generative model using the framework of supervised learning and our algorithm heavily relies on the loss definitions mentioned. Furthermore, the loss properties are also important because, it turns out, our convergence results are more general than those of [29]. Those definitions make the "slack" explicit. We are confident we can make the section "breathe" using part of the +1 page camera ready.

W.2.2 "Definition of tree [...]"

Correct. We can paraphrase.

W.2.3 "Line 160-163. Can you make an example or in case leave out the sentence"

It is correct but the sentence should remain because it grounds the proof (not given) of Lemma 4. We can add the reviewer's remark and merge it.

W.2.4 "Algorithm 3 in Step 2.1 How is node selection performed, [...]"

We choose the heaviest leaf among all trees (the one with the largest empirical mass in its support). In the code, it is method choose_leaf of class Algorithm.

W.2.5 "Can you elaborate on the motivation of proving theorem 5.2 [...]"

This is an empirical convergence rate under the weak learning assumption in the boosting framework. To summarize its importance, this has been the cornerstone of boosting algorithms since Valiant's weak/strong model and since Schapire's first proof that boosting is achievable in the model (1990). Such a result is thus fondamental: it shows an explicit rate under assumptions that are so "weak" that not satisfying them would just preclude learning (in the supervised setting, which is easier to state, failing the weak learning assumption would mean getting updates no better than a fair coin, thus indeed totally useless to learn). Kearns and Mansour in [18] summarize very well the importance of boosting in the context of decision tree induction, which is close to our setting. The first sentence of the abstract of the popular paper "Additive logistic regression: a statistical view of boosting", Friedman, Hastie and Tibshirani summarizes the importance of boosting as we just explained:

"Boosting is one of the most important recent developments in classification methodology."

W.3 All experiments conducted on datasets with less 34 variables [...] comparisons with [1] on larger dimensional datasets [...]

[uJTo-B] (we correct: it is less than 35) We would be happy to do so, see [ALL], but note that among all datasets of [1], only 3 out of 27 have more than 35 features (we can use those ?).

评论- Additional Questions

2024-08-10

Dear authors,

Thank you for the answers, that have addressed some of my concerns about clarity. There are still a number of questions that I would like to see addressed and that refrain me to raise the score, also in light of the other reviews.

Novelty

While it is clear the difference between the proposed solution and Forest-Flow from a model perspective, I don’t understand the difference in terms of properties, especially regarding generalization (as also pointed out by reviewer kTDz). Firstly, Theorem 5.2 provides a result about the weak monotonic reduction on the density distance. Since the reduction is not strict, how do you ensure that the learnt model is unbiased? Secondly, does the mentioned consistency property imply that Forest-Flow is biased, whereas Generative Forests is not? As far as I understand, training Forest-Flow reduces the distance between the model and the empirical data distribution leveraging standard KL divergences through an ELBO-like formulation. Consistency results clearly hold in the infinite sample regime and for sufficiently large network capacity. Can you please clarify this aspect?

Clarity

It is mentioned that during learning, leaves with largest empirical mass are selected among all trees. How does this ensure that learning is unbiased as uncertainty is not taken into account? Moreover, how does this criterion influences the topology of the trees, are they balanced?

Can you provide some synthetic visualisations on the convergence results of Theorem 5.2 and the benefits of the strict properness property against other common generative losses, such as KL divergence? This could help to add value my making the concepts clearer.

From an implementation perspective, how are the different trees initialised and how is diversity across trees ensured throughout learning?

Experiments

Comparison results against Forest-Flow are promised, but it is not clear whether they will be released during the discussion phase or afterwards. Providing these during the discussion phase could definitely help in the decision.

评论- Reply on "Experiments" -- Experiments vs Forest-Flow provided

2024-08-12

Kindly see at the top of this page, or Search for tags [Exp-0], [Exp-1] and [Exp-2].

评论- Feedback

2024-08-13

Dear reviewer,

Since you lodged your additional questions, we have reported a stream of experimental results [Exp-0][Exp-1][Exp-2][Exp-3], a consistency result [Th-0] and replied to your questions. You made it explicit in your questions that some information "[...] could definitely help in the decision [...]", information we have since lodged (we refer to the experiments on Forest-Flow, [Exp-0] and [Exp-1]), so we would like to hear back before the end of the discussion phase on updates relative to this decision (reviews will become invisible again afterwards).

Thank you.

评论- Answer

2024-08-13

Dear authors,

thank you for the answers and the additional experiments.

I haven't had time to check in close details the theoretical aspects about generalization, but looking at the results and your answers, the outcome looks reasonable.

It is nice that you managed to provide an experimental comparison with Forest-Flow and it is clear from these results that the two approaches while being different in nature have similar performance. I'm therefore happy to increase my score.

However, I will not increase my score even further as the issue of scalability is still not addressed and there are still unclear aspects.

Scalability Specifically, there is no comparison on larger datasets (number of features > 34) against Forest-Flow. I would have expected some experiments on the three larger datasets you mentioned, libras, connectionist bench sonar and qsar biodegradation.

Unclear aspects I have two last questions:

If you replace the loss with a KL do you get a good model? If so, then why do you need the proposed loss? I would appreciate a simplification of the formalisms introduced in the paper.
I'm not sure that by choosing the leaf with the largest empirical mass you obtain the desired diversity across trees. Can you please elaborate on that? I suspect that by doing so you end up by learning a single tree. How is this phenomenon prevented?

评论- Reply to the additional questions

2024-08-13

We thank the reviewer for these additional comments. Here is our reply, following the comment's order.

On Novelty. We have added here the proof that our method is consistent [Th-0] (see above) in the framework used by A.H.C. Correia, R. Peharz and C. P. de Campos. Joints in Random Forests. NeurIPS 2020. We do not want the reviewer to think that we claim any biasedness / inconsistency property for Forest-Flow. We do not know. Our experiments we just added here [Exp-0][Exp-1] at least demonstrate its qualities as a contender. Note that our result in [Th-0] follows from a simple relationship between our generative forests and binary trees (Lemma A above), which allows us to branch immediately to the state of the art. Maybe a similar path exists for Forest-Flow ?

On Clarity. We selection the leaf with the largest empirical mass because it is a simple (the simplest ?) way to meet conditions (a) and (c) in our weak learning assumption, thus to get fast empirical convergence. We also realized whlie writing [Th-0] that this strategy has a positive impact in terms of consistency. Note that this strategy of picking the heaviest leaf is also one used in DT induction [kmOT]. Depending in the domain, this can create a diversity of model "shapes" which we believe is a good thing because the model "shape" is also adaptive to the domain at hand. Strict properness is crucial because otherwise the minimization of the loss does not guarantee convergence to a good model (L108-L109). We alleviate the symmetry assumption of [29], and it turns out to be important in the light of the reviewer's question: the KL divergence can be formulated in the proper loss framework with asymmetric partial losses [rwID, Table 2]. Note also that the GAN's Jensen-Shannon divergence can also be represented.

On Implementation, trees are all initialized to roots. Diversity is ensured by the adaptivity of the model ``shape'' during training (Cf before).

On Experiments, we have released the set of Experiments vs Forest-Flow (see above).

References:

[kmOT] K. Kearns and Y. Mansour. On the Boosting Ability of Top–Down Decision Tree Learning Algorithms. STOC 1996.

[rwID] M. Reid and R.C. Williamson. Information, Divergence and Risk for Binary Experiments. JMLR 2011.

评论- Last reply

2024-08-14

[uJTo-L] We thank the reviewer for this comment and their decision to raise their score.

We apologize for not replying any sooner but as soon as we got the last comments we decided to run some purely scalability experiments on comparing our software and Forest-Flow's.

We are not sure why the reviewer say we mentioned using "[...] libras, connectionist bench sonar and qsar biodegradation [...]". We did not mention those. Looking for the datasets, it seems to us that they are also quite small (we searched on OpenML and the UCI, those are the hits we got on the largest datasets):

-libras (https://www.openml.org/search?type=data&sort=runs&id=299&status=active ?), 24 * 15 = 360 examples, 90 features

-connectionist bench sonar (https://archive.ics.uci.edu/dataset/151/connectionist+bench+sonar+mines+vs+rocks), 208 examples, 60 features

-qsar biodegradation (https://www.openml.org/search?type=data&sort=runs&id=1494&status=active), 1055 examples, 42 features

We do not know if the suggestion from the reviewer was a typo. In the spirit of saving time, we looked ourselves for domains substantially larger to accomodate scalability studies. We have started making scalability comparisons on the biggest domains of the benchmark of [govWD], considering only domains with $m \times d >$ 1M.

As far as our experiments are going (Laptop), we are facing an issue with Forest-Flow: running the same code as we had for all other experiments results in R crashing with a "vector memory limit". Getting the available memory to 50 Gb gets the same error; to 150 Gb results in session aborted with a fatal error.

After several more tries with the same outcomes, we decided to follow recommendations from the webpage [jkkGA] on how to optimize parameters for memory management. Given the size of the dataset, we reduced parameter duplicate_K, initially 100 (from the website, this amount to a number of duplications of the training sample, so targeting this parameter looked like a no brainer). We first reduced it to 50 but this washed the system out of application memory so we had to run it with a smaller value. So far, it seems that trying duplicate_K = 10 seems to work on the domains tried, but the webpage [jkkGA] makes it clear that decreasing duplicate_K can lead to worse performances. So we will probably have to fine-tune our runs against Forest-Flow for fairness.

We do not have issues with our code as far as we can tell. The biggest domains we have tried so far have $m \times d \approx$ 1.5M. For example, with T=200 trees, J=500 iterations, the max memory we use at run time is < 1Gb; for T=500 trees, J=2000 iterations, it is < 1.5Gb. As far as we can tell, the worse training times we get, with T=500 trees, J=2000 iterations are < 7000s.

(This is not meant to criticise Forest-Flow's R implementation, rather to point the difficulties we faced in the context of the scalability analysis to wrap our analysis and justify the choices we made in this short time)

Regarding the last two Unclear aspects, here are our answers:

-On the loss we use (vs KL). The loss we choose, derived from Matusita's loss (class Statistics.java) is the one guaranteeing the best possible convergence rate in our theory (ref [24][18] in our paper). The KL is theoretically suboptimal from the boosting rate standpoint.

-On learning a single tree / diversity. We initialize all trees to roots (roots do not affect generation because their support is the whole domain) then pick the heaviest leaf among all current trees. Hence, during the first $T$ (number of trees) iterations all roots are first split since each root has a larger mass (1) than any leaf in a tree with depth >0. Then we keep on choosing the heaviest leaf among all trees: the couple (tree, leaf) chosen depends on the current set of trees and on the data at hand.

Maybe an example would set the idea that we indeed learn sets of trees and that their structure depends on the domain: suppose we have two trees, $\Upsilon_1$ and $\Upsilon_2$ . At the initialisation, they are root leaves, thus of weight (sum of the weight of all examples reaching them) 1.

[A] Step 1, we split $\Upsilon_1$ . Suppose the two (new) leaves have weight 0.8 and 0.2. At the next step, we have to split $\Upsilon_2$ because its root (leaf) has weight 1.

(two scenarii follow)

[B1] Scenario 1: we split $\Upsilon_2$ and its two new leaves have weight 0.6 and 0.4.

[C1] At step 3, the next leaf to be split is then in $\Upsilon_1$ (weight 0.8)

[B2] Scenario 2: we split $\Upsilon_2$ and its two new leaves have weight 0.1 and 0.9.

[C2] At step 3, the next leaf to be split is now in $\Upsilon_2$ (weight 0.9)

[and so on]

We are conscious that we are lodging these comments just before the deadline. We thank the reviewer (all reviewers in fact) for a strong engagement around our paper.

[govWD] L. Grinsztajn , E. Oyallon and G. Varoquaux . Why do tree-based models still outperform deep learning on tabular data? NeurIPS 2020

审稿意见

评分: 7置信度: 22024-07-11

The paper presents a novel boosting algorithm based on ensembles of generative trees, addressing tasks such as data binary classification, missing data imputation, and density estimation. The proposed algorithm optimizes specific loss functions and leverages the structure of the models to efficiently estimate densities given a generative model and partially specified observations. The authors demonstrate the practicality of their approach through comprehensive experiments, highlighting its effectiveness in binary classification compared to diverse state-of-the-art methods, including their competitors like Adversarial Random Forests (ARF), CT.GAN, Vine copulas auto-encoders (VCAE) or Kernel Density Estimations(KDE).

The authors introduce a unique training methodology for these generative models, which is based on class probability estimation (CPE) in binary classification tasks. The loss functions are decomposed into partial losses for positive and negative classes, with the Bayes risk function defined as the optimal achievable loss given a certain positive base-rate. Properness and strict properness of the loss functions are critical properties, ensuring that the loss is minimized when the predicted probability matches the true class probability. The paper discusses various proper losses, such as square loss, logarithmic loss, and Matusita's loss, which are symmetric and differentiable. Furthermore, the paper explores loss functions in the context of density ratio estimation and introduces the likelihood ratio risk for assessing the performance of generative models. Using Bregman divergence and the generalized perspective transform, the authors propose a framework for evaluating the density ratios of different distributions.

This theoretical foundation supports the practical applications of their models, which show competitive performance over the dataset proposed. The authors carried out experiments on a total of 21 datasets from sources such as UCI, Kaggle, OpenML, and the Stanford Open Policing Project, or simulated datasets. These include datasets like iris, ringGauss, gridGauss, forestfires, tictactoe, ionosphere, student performance, winered, abalone, kc1, sigma-cabs, compas, artificial characters, jungle chess, open-policing-hartford, and electricity, with varying attributes and missing data conditions. The paper concludes that their models, trained with the novel generative tree algorithm, are strong contenders against existing state-of-the-art techniques.

优点

Regarding to the originality, the paper introduces a novel approach to generative modeling by creating a generative forest based on Class Probability Estimation (CPE). This method represents a significant advancement over current state-of-the-art algorithms. By leveraging ensemble methods and decision tree induction for generative tasks, the paper presents a fresh perspective that has not been extensively explored in the literature. The combination of well-known techniques in a new context adds substantial value to the field and showcases the authors' innovative approach to generative modeling.

The quality of the submission is commendable. The paper is well-organized and meticulously indexed, with a comprehensive appendix that supports the main content. The methodology is robust, encompassing a detailed mathematical explanation, pseudocode for the proposed algorithm, and extensive experimental validation. The authors have tested their model on 21 diverse datasets from reputable sources such as UCI, Kaggle, OpenML, and the Stanford Open Policing Project. This thorough evaluation across various tasks, including data generation, missing data imputation, and density estimation, provides strong evidence of the model's performance and versatility. Additionally, the paper continues the line of research in generative models, building on previous work and demonstrating significant progress in this domain.

The paper is clearly written and well-organized, ensuring that readers can easily follow the authors' reasoning and methodology. The detailed descriptions of the mathematical foundation, the pseudocode, and the experimental setup are particularly noteworthy. These elements allow that an expert can reproduce the results. Furthermore, the inclusion of a comprehensive appendix adds to the clarity by providing additional details without cluttering the main text.

The results presented in the paper are of high value. In that line, the authors have made a commendable effort by comparing their method with the closest approaches (ARF, copycat) cited in [42] and [29]. Furthermore, they have clearly outlined the distinctions and enhancements of their algorithm in comparison to these approaches.

The authors suggest that generative forest can effectively eliminate "the slack" when utilizing decision trees as discriminators. This implies that decision trees exhibit a more regulated and foreseeable behavior in contrast to neural networks. Decision trees are inherently more calibrated and provide more deterministic outcomes. As a result, the feedback they provide to the generator during training is more accurate and reliable. The proposed method seems to enhance binary classification performance compared to existing approaches and also addresses challenging tasks such as missing data imputation and density estimation. The model's ability to produce competitive results across a diverse array of datasets and tasks underscores its potential impact on both the research community and practical applications.

缺点

In section 1, the authors delve into the use of generative AI in the ML community and highlight two significant features of their generative forest. They emphasize its simplicity and the strong convergence guarantees it offers in a weak/strong learning model, akin to Valiant’s PAC learning boosting model. However, it remains unclear whether this generative forest, based on Class Probability Estimation (CPE), can effectively handle multiclass scenarios.

Section 3 introduces the supervised loss for the generative tree. Nevertheless, from a synthesized perspective, the drafting in line 75 lacks clarity. The authors should consider enhancing the clarity of these lines.

Moving on to Section 4, in line 125, the authors refer to 'c', but it is not clearly connected to the context provided in the text.

Regarding the supplementary material, Table A3 is challenging to read as the footnotes overlap with the image.

问题

In Table 2, the final row displays the wins and losses for us. I am not closely following the numbers as you have already chosen the green one, which includes p-values indicating significance. Another query pertains to comparing the number of trees. It seems unfair to compare T=500 with competitors having fewer trees (10, 50, 100, 200). In simpler terms, I coudn’t comprehend why you mentioned that all these approaches rely on models that are vastly distinct from each other. I know that in the article the authors mentioned that ARFs are trained with a varying number of trees T ∈ {10, 50, 100, 200}. Additionally, ARFs incorporate an algorithm for tree size selection, eliminating the need for manual selection. On the other hand, CT-GANs undergo training with a different number of epochs E ∈ {10, 100, 300, 1000} but I don't follow why these models are then compared with generative forests comprising T = 500 trees and trained over a total of J = 2000 iterations. That approach was done for Table 2 and 4.

局限性

The work presented by the authors is highly promising since they have introduced a novel methodology for constructing generative tree models based on class probability estimation (CPE) as a loss function in binary classification tasks. However, the paper would benefit from a dedicated section discussing the current limitations of this method. Specifically, it would be useful to understand whether this approach can be applied with the same efficacy to multiclass classification or, with some modifications, to regression tasks.

Additionally, it would be valuable to identify the types of tabular data that the proposed method might struggle with or the conditions under which its performance may be suboptimal. For instance, are there specific data types or distributions where this model does not perform as well? A thorough examination of these aspects could enhance the article.

Moreover, while the authors have laid a strong foundation for binary classification problems in their work, constructive suggestions for improvement include providing an analysis or discussion on how the current method could be extended or adapted to handle multiclass classification tasks or regression problems, which would be helpful.

In that line, another interesting thing could be identifying the specific types of tabular data that might present challenges for the proposed method, which would be beneficial. Under what conditions does the model's performance decline, and what are the characteristics of datasets where this approach is less effective?

作者回复

2024-08-06

We thank the reviewer for noting that our approach "[...] presents a fresh perspective that has not been extensively explored in the literature [...]" and noting that the quality of our submission is commendable.

weaknesses

In section 1, [...]

[cvp8-A] we assume the reviewer wants to know if our models can tackle supervised learning with multiclass problems in the context of class probability estimation. The answer is yes, and it just resorts to the fact that our generative models can easily provide the full distribution of missing values given the observed ones and the generative model. Indeed, supervised learning is "just" missing data imputation where the class label is missing and the full distribution we obtain, in this case, is the complete Bayes rule estimates for $P[Y=c|X]$ (where $c$ ranges in any set, thus allowing multiclass). Note that we can do something "slicker" than supervised learning when additional variables are missing because in this case, we get the complete estimates for $P[\mbox{all missing variables} = \mbox{set of values}|\mbox{observed variables}]$ . The pointwise max for the set of values necessarily contains a prediction for the class.

Section 3 introduces the supervised loss for the generative tree. [...]

[cvp8-B] we would definitely be happy to use part of the +1 camera-ready page to expand on this part as it grounds our approach in the supervised learning framework.

Moving on to Section 4, in line 125, [...]

[cvp8-C] we agree. In fact part of the hardness probably comes from the fact that we did some "acrobatic" implicit referencing to save space: (C) is used before in lines L119+. Same as for [cvp8-B], we believe we can use part of the +1 page to polish this.

Regarding the supplementary material, Table A3 [...]

Indeed !! We apologize (this will be fixed), the obfuscated line is:

"Table A3: 2D density plots ( Generative forests x = Life Style Index and y = Customer Rating) for data generated on sigma_cabs, from models learned with 5% missing features [...]"

It is important because sigma_cabs is a real-world domain and some features appear to be "trickier" to generate than others.

questions

We directly reply here to the list of concerns.

[cvp8-D] First, ARF have fewer trees but they are much bigger than ours (the technical reason being that to get a good model overall, each tree needs to be a good model of its own already, because the distribution learned is a convex combination of tree's -- get one wrong, and the overall model can be heavily penalized). In fine, their total number of splits is equivalent or bigger than ours. We propose to indicate the average total number of splits as well for ARF.

[cvp8-E] "[...] Additionally, ARFs incorporate an algorithm for tree size selection, eliminating the need for manual selection [...]" Correct ! As we say in L311-L312, we do not have one in this paper. Such a matter would deserve a work of its own. However, the blueprint of such an algorithm seems intuitive: get inspiration from pruning algorithms in supervised decision tree induction.

"On the other hand, CT-GANs [...]" We understand that the reviewer points out that some parameters look like pears and apples. It is true and we accept the criticism: we deliberately wanted to compare techniques coming from many different model types [ALL]. It then came as a constraint to figure out some good model parameters to train for the contenders: in the case of KDE, we kept the kernel that gave overall the best results on our datasets; we tried to use a number of epochs that looked sound for the datasets that we had in CT-GANs, etc. . This explains why we give several different results for our contenders, in order not to "fix arbitrary" values.

limitations

The work presented [...] multiclass classification [...] regression.

Our method can indeed cope with supervised learning and this comes from the abilities of our models [cvp8-A]. However, this would represent a fourth task and we are not sure this represents a limitation of our work, which already contains experiments on 3 different tasks (no paper cited by the reviewers addresses this many tasks). Probably worth adding in a work of its own on [cvp8-E] ?

Additionally, it would be valuable to identify [...] proposed method might struggle with [...]

[cvp8-F] We agree. We did not do it because we strongly suspect from our experiments that more than "hard" datasets, there is actually an eventually suboptimal choice of our parameters $J, T$ . By deliberately choosing those from a very small set in our experiments, we accepted the eventual suboptimality of some of our results. We did not want to hypertune our method and we believe there is a simple path forward, which would then be useful to guess eventual hard datasets: first solve the pruning problem [cvp8-E]. We are happy to elaborate more on this in the +1 page.

Moreover, while the authors have laid a strong foundation [...]

We refer to [cvp8-A] on this.

审稿意见

评分: 5置信度: 42024-07-13

The authors introduce a novel generative model (Generative Forests - GF) for sampling and density estimation, leveraging an ensemble of trees. Their method partitions the input domain by considering all possible intersections of the supports of the leaves across the ensemble, resulting in a much finer partitioning than a single tree could achieve. For each subset of the input domain thus obtained, the model estimates a uniform density based on an empirical training measure. Efficient algorithms for sampling from such a model and evaluating the densities are derived.

To train the model, the authors propose an algorithm that iteratively splits a selected leaf node in one of the ensemble's trees. This splitting is guided by minimizing a strictly proper loss for a binary class probability estimation task.

The authors evaluate their method's performance against competing approaches across three main dimensions: the quality of synthetic data generation, missing data imputation, and density estimation.

优点

I find that the paper reads resonably clearly. The language and notation used throughout the paper is very precise although it occasionally appears overly complex.
The proposed method is interesting and I believe enjoys reasonable novelty.
The method and theoretical results all appear to be sound.
The proposed method's ability to estimate densities makes it more versatile than other generative methods commonly used for tabular synthetic data generation. Few methods can match its generality in handling various types of tabular data, except for other tree-based methods like Adversarial Random Forests (ARF). In this context, the positive results presented in comparison to ARF are promising.
The authors have made their implementation code available.

缺点

While the authors adequately explain their ensemble tree-based model and its sampling and density computation methods, the explanation of the training algorithm is less thorough. There is insufficient discussion on how the method scales computationally with the number of trees and splitting iterations (see my first question). Additionally, more discussion on the motivation behind the formulation of the CPE task would be welcome.

According to the paper's Lemma 4.4, GF estimates a different density value at each non-empty intersection of the supports of each leaf, derived directly from the empirical training measure. This is similar to ARF or other works such as [1] or [2] when using uniform densities at the leaves with the exception that these methods would output an averaged density estimated independently from leaves of each tree in the ensemble. It is unclear why the proposed approach should be better. From a generalization perspective, I would intuitively assume that averaging densities from multiple trees has a regularizing effect compared to using a single tree (see [2]). On the other hand, GF's density estimation is based on a finer partitioning, whose cardinality grows with the product of the depths of all trees in the ensemble. While this could providing greater representational capacity, it would seem to make the method more susceptible to overfitting.

The theoretical analysis provided also focuses only on approximating this empirical training measure, without addressing overfitting and the approximation of the data-generating measure. For a model such as described above, it seems obvious that the less coarse the partitioning becomes, the more faithfully an empirical distribution can be approximated. I am therefore unsure of the practical significance of these results and I think the paper would benefit more from a discussion of generalization ability.

My main concern with the paper, however, lies with the evaluation and comparison to other methods:

For the many analyses presented in the main paper and the appendices, different methods are compared against and for each method/scenario combination different datasets are used. This lack of consistency in comparisons and datasets in the presented results, leaves an impression that results are missing.
The choice of datasets focuses on relatively small numbers of data points and mixes both synthetic toy data and real datasets together. This is compounded by some of the main results being presented only as summary data (#wins/#ties/#losses) over this heterogeneous group of datasets.
There is no serious attempt to compare the proposed method with state-of-the-art approaches for synthetic data generation, particularly neural network-based ones:
- The comparison is limited to CT-GAN, which has been shown to underperform almost every other popular neural network method, including the TVAE method put forward as a baseline in the same paper.
- Additionally, the authors do not tune any hyperparameters for CT-GAN beyond the number of epochs. As a GAN, CT-GAN can have convergence issues, and without proper tuning, it may fail to converge. This lack of tuning undermines the fairness of the comparison, especially since it is unclear how much the authors optimized the hyperparameters of their own method.
- The authors also report unusually long training times on a very small dataset, presumably because no GPU is used to train these models and possibly training for too long to no avail. While I understand that the proposed method can be trained on a CPU alone, and maybe that is the point being made, I don't think this remark is made in good faith. In my experience, methods like CT-GAN, TVAE, and TabDDPM can achieve reasonable performance quickly (within minutes to tens of minutes) when trained on a consumer GPU (e.g., 2080) even on larger datasets than those used in the reported experiments.
- Results for CT-GAN on the two larger datasets are not reported, so the comparison primarily relies on very small datasets where neural networks might perform worse.
A comparison with TabDDPM [3], a strong neural network-based generative method, would be valuable, as it is currently a standard benchmark in papers evaluating synthetic data and, in my opinion, a consistent contender for a state-of-the-art method.
The evaluation of generated samples relies solely on an Optimal Transport distance metric. Although I am not well-versed in the practical computation of this metric, the authors' comment about selecting an entropic regularization parameter that avoids numerical instabilities across all domains is not particularly reassuring. It raises the question of why other metrics were not used in conjunction.
- Additionally, the setup described by the authors (GEN-DISCRIM), which uses Random Forests to distinguish real from synthetic data, appears both reasonable and well-founded. I am curious why this wasn't the primary evaluation method for comparing with other approaches. Instead, it is only used to compare GF against optimistic and pessimistic baselines.
- On this topic, the authors might also want ot look at e.g., [4] for utility metrics in machine learning tasks that are common in comparing synthetic data generators, although I wouldn't personally fault them if they choose to disregard such metrics.
Finally, given that the proposed model is able to estimate densities and that evaluating the log-likelihood of held-out data is the gold standard for objectively measuring the quality of such models I don't understand why this evaluation doesn't have a more prominent place in the paper:
- The authors only compare their method to KDE, which GF seems to underperform on several datasets, particularly those with a moderate number of dimensions. This is concerning, and I think it is important to investigate how the proposed method scales to higher-dimensional data compared to KDE.
- ARF is missing as a baseline in these evaluations which is strange since it is another model that is capable of estimating densities. Additionally, other methods, such as [2], report consistently outperforming KDE, making it worthwhile to include as an additional comparison in my view.
- Finally, the evaluation setup is not throughly explained in the main paper and only summary results against KDE are presented. It seems that the authors don't always use log-likelihood to compare these models due to their proposed model predicting 0 density for test samples (see question 3 below). This issue is only mentioned in the appendix but should, in my view, be disclosed in the main paper.

[1] Alvaro H. C. Correia, Robert Peharz, Cassio de Campos. Joints in Random Forests, 2020 (https://arxiv.org/abs/2006.14937)

[2] Hongwei Wen, Hanyuan Hang. Random Forest Density Estimation, 2022 (https://proceedings.mlr.press/v162/wen22c.html)

[3] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko. TabDDPM: Modelling Tabular Data with Diffusion Models, 2022 (https://arxiv.org/abs/2209.15421)

[4] Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, Andrija Petrovic. Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark, 2023 (https://arxiv.org/abs/2310.16981)

问题

In line 218 the authors mention regarding finding the optimal split in $\text{splitPred}$ that "the optimisation is no more computationally demanding" than for decision tree induction. I'm not sure if this is a typo but it seems to me that, as the trees in the ensemble grow deeper, finding the best split could quickly become intractable. As an example, in the case where only single stumps are used, the potential cardinality of $\mathcal{P}(\mathcal{T})$ at iteration $j$ is $2^j$ which means that intersections with that many subsets have to be considered when finding the best split point. This seems like an untenable scaling of the computation with the iteration number but maybe I am misunderstanding something about how the best split is selected as the authors' explanation is not clear.
Regarding density modelling the authors only compare with KDE. Since ARF also provides density estimates why not compare against this method as well? Additionally [2] could also be worthwhile comparing against.
Could the authors clarify what they mean by "expected densities" that are used "to cope with the eventuality of zero (pointwise) density predictions" in appendix section V.V.3.7? If their proposed model assigns 0 probability to a subset of the domain that has non-zero mass under the data generating distribution I would say that this is a defect of the model and a sign of poor regularization. Changing from the commonly used log-likelihood metric to one that doesn't penalize this so harshly should be surfaced and discussed in the main text in my opinion. Furthermore, it seems that this choice is made on a dataset-by-dataset basis, I can only assume depending on whether the proposed model yields $-\infty$ log likelihoods which seems to happen for most non-toy datasets. As it stands, these details are hidden in the appendix while the main paper presents a potentially misleading picture based only on #wins.
Still regarding 0 density predictions, the authors propose a way to avoid this issue for GFs in line 173. It would be interesting if this was explored further and the resulting (more regularized) density model compared to other approaches for density estimation based on log-likelihood.

[2] Hongwei Wen, Hanyuan Hang. Random Forest Density Estimation, 2022 (https://proceedings.mlr.press/v162/wen22c.html)

局限性

I find that the authors hardly discuss the limitations of their method. Most importantly there is little discussion of training times and how they scale with the number of iterations/data dimensions/number of samples. There is also no discussion of how performance and generalization is expected to scale with these factors.

作者回复

2024-08-06

We thank the reviewer for pointing the soundness, novelty and versatility of our approach. We must confess we found the review a tad offensive at times [kTDz-D] [kTDz-H], accusing us of a complete lack of seriousness in some of our experiments, acting not in good faith elsewhere, finding strange missing parts elsewhere. It is hard to swallow not just because of the language that implies we did not meet ethical standards of scientific papers, but also because it sounds like an a priori guilty verdict, that comes even before any explanation of ours. We apologize if our tone sounds "combative" at times.

We proceed in order of the review, with tags to help cross-referencing and navigation. Due to the length of our rebuttal, we put answers to "questions" field here and postpone the rest of our rebuttal to additional comments.

Questions

In line 218 [...] I'm not sure if this is a typo but it seems to me that, as the trees in the ensemble grow deeper, finding the best split could quickly become intractable. [...]

[kTDz-I] Wrong: to find the best split, we only need to keep those elements in ${\mathcal P}({\mathcal T})$ with $>0$ empirical measure because the rest has zero measure (and related calculi to find the best split are trivial for all those), so we only have to store those elements with $>0$ empirical measure and their number is limited by the size of the training sample. It has also been implemented this way (please check the commented code, this is the purpose of the class MeasuredSupportAtTupleOfNodes). in addition, as we get deeper, the computation can in fact be much quicker because it also depends on the size of the training sample in the elements to compute the effect of the split, and this size obviously decreases through splitting.

It is no more computationally demanding than DT induction because then we only need to parse the whole set of training point at the leaf and check whether they go left or right, which is exactly the same procedure as for DT induction. At the end, we need to eventually split the corresponding elements of ${\mathcal P}({\mathcal T})$ whose support is included in the leaf, instead of one for a DT, but this cannot represent more than the size of the training sample at the leaf and thus costs in general less than the spltting test, for a global complexity that is still O(DT induction) (we sketch) -- we would be happy to make this more explicit.

(1) Comparisons on Synthetic data generation

The authors claim that their main task is synthetic data generation and use that to justify not comparing with ARF (FORDE) on density estimation. I would expect any paper to compare to the best available methods especially on their main task but that is not the case here. Even if not TabDDPM, I believe TVAE or perhaps CTAB-GAN would provide a bigger challenge than CTGAN, particularly given that only the number of epochs was tuned.

(2) Single metric for evaluating synthetic data generation

I remain somewhat skeptical of the OT metric used for synthetic data evaluation because it requires defining a distance in the domain of the data. It is not obvious to me how to meaningfully define one on a domain with differently-scaled numerical features and categoricals. I would therefore be much more comfortable if there was at least another metric used for comparison.
I still don't understand why GEN-DISCR can only be used in supervised tasks. I assumed that L901 referred to the pipeline in the ARF paper and that the authors' pipeline only involved training discriminators to distinguish between real and fake data (L895-L896). If not GEN-DISCR, at least some ML utility metric.

(3) Density comparisons

I would not frame ARF (FORDE) as an additional evaluation since the authors already compare to ARF for sample generation (FORGE). This sets my expectation that this comparison should have been done in the original paper despite the author's claim that it is a side quest. Even looking at the documentation for the ARF package, it is necessary to estimate a density model (FORDE) first in order to then run sample generation (FORGE).
I also want to remark that the authors beat FORGE on the OT metric used for synthetic evaluation. In my opinion, a density evaluation based on log-likelihoods would be that much more valuable as it offers a different type of metric and one that is much more beyond criticism for this type of data in my opinion.
Synthetic data for ARF and GF is just sampling from a density model. Therefore comparing the density models directly makes a lot more sense in my opinion.

(4) Scalability

My concerns with scalability were mostly addressed by the authors. However, I would very much welcome if the authors were able to share some actual training times for their algorithm as other reviewers also requested (Sk9u). I am particularly interested in seeing how these times vary with the size of the datasets.

(5) Results Reporting

I agree with the authors' own statement that they cram a lot of results in a small allocated space but, in my view, the problem is not the too many results but the too small space that was allocated for reporting them. I particularly do not agree with the pooling of toy datasets and real datasets into the same aggregated results.

(6) Minor point

I appreciate the author's clarification on the OT metric. I wonder however if regularization is really necessary with such small datasets. Couldn't the assignment problem be solved exactly and avoid introducing a regularization parameter that will only raise further questions?

评论- Response to new results.

2024-08-13

I really appreciate the authors' efforts to resolve some of my main concerns, particularly those regarding the use of a single evaluation metric and the need for stronger comparisons.

However, the authors only provide these new metrics for CTGAN and the new Forest-flow method. It would be valuable to see the same comparison against ARF, given that the authors themselves identify it as the primary competing method.

2024-08-13

Thank you for the ARF results.

While I still have some reservations about the paper, I am increasing my score to a 5 to reflect the reduced set of concerns.

作者回复

2024-08-07

To all reviewers

[ALL] We warmly thank all reviewers for their work and for all granting our paper with general "good or excellent" contribution field. Many questions have been asked and we take it as a value found that so many contenders were proposed, added to the five (5) we already have (in fact, 6 if we include the generative trees of [29] -- ref in our paper, and 7 if we add COPY in gen-discrim) for our three (3) distinct experimental settings (TVAE, TABDDPM, ARF-FORDE, RFDE, Forest-Flow). We hope it is obvious that it would be hard to cram substantially more content in our paper as it has already been remarked that its current state can already be hard to parse at points in the experiments [kTDz-C]. However, since there is a +1 page in the camera-ready we propose to add Forest-Flow, because it is another kind of model that does not belong to the categories of the models we have compared against (mixing elements of trees, neural nets, kernels or graphical models). However, the paper on Forest-Flow clearly mentions that their results are worse than MICE on missing data imputation. Since we are competitive with MICE, we would be happy to instead compare on data generation with Forest-Flow.

We would like to stress the need to keep the current formal part to its current state and not reduce it: for example, none of the papers cited by all reviewers have convergence rates on training. We also would like to add a few comments on the statistical consistency of our method [kTDz-A2].

Tags like [ALL], [kTDz-C] have been put for easy search in the page's content.

评论- Sincerer apologies for "Official Comments" not visible for tag finding -- now fixed

2024-08-10

In our rebuttals to reviews uJTo and 6MHb, we provided tags pointing to Official Comments we lodged to reviewer kTDz. After double checking, we realized those Official Comments were visible only to reviewer kTDz.

We have hopefully fixed the problem and all tags should now be visible. We sincerely apologize to reviewers for this mistake in lodging those.

评论- [Exp-2] "Completing experiments vs CT-GAN for Data Generation; metrics used: Coverage, Density, F1"

2024-08-12

Results for GF (us) vs CT-GAN ( $E$ = 1 000 epochs, ct) on Coverage, Density, F1, shown as mean (std dev).

p-values smaller than 0.01 are replaced by a symbol in color, ${\textcolor{green} {\blacktriangle}}$ meaning we significantly beat CT-GAN, ${\textcolor{red} {\blacktriangledown}}$ meaning CT-GAN significantly beat us.

Explanations, comments, discussion, see "Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)" for details.

Results with $J$ =2000 splits for $T$ =500 trees (us):

Domain	Cov $\uparrow$ (us)	Cov $\uparrow$ (ct)	pval	Dens $\uparrow$ (us)	Dens $\uparrow$ (ct)	pval	F1 $\downarrow$ (us)	F1 $\downarrow$ (ct)	pval
ringgauss	0.968 (0.010)	0.798 (0.050)	${\textcolor{green} {\blacktriangle}}$	1.031 (0.049)	0.757 (0.025)	${\textcolor{green} {\blacktriangle}}$	0.070 (0.030)	0.100 (0.073)	0.40
circgauss	0.945 (0.015)	0.734 (0.041)	${\textcolor{green} {\blacktriangle}}$	0.993 (0.045)	0.401 (0.033)	${\textcolor{green} {\blacktriangle}}$	0.514 (0.529)	0.746 (0.033)	${\textcolor{green} {\blacktriangle}}$
gridgauss	0.975 (0.009)	0.828 (0.053)	${\textcolor{green} {\blacktriangle}}$	0.995 (0.077)	0.649 (0.058)	${\textcolor{green} {\blacktriangle}}$	0.017 (0.015)	0.034 (0.011)	0.21
randgauss	0.961 (0.012)	0.659 (0.049)	${\textcolor{green} {\blacktriangle}}$	0.979 (0.023)	0.582 (0.035)	${\textcolor{green} {\blacktriangle}}$	0.014 (0.013)	0.079 (0.053)	0.06
winered	0.962 (0.028)	0.808 (0.016)	${\textcolor{green} {\blacktriangle}}$	0.961 (0.026)	0.589 (0.129)	${\textcolor{green} {\blacktriangle}}$	0.459 (0.031)	0.654 (0.030)	${\textcolor{green} {\blacktriangle}}$
winewhite	0.963 (0.002)	0.894 (0.026)	${\textcolor{green} {\blacktriangle}}$	0.989 (0.017)	0.953 (0.035)	0.05	0.492 (0.037)	0.581 (0.043)	${\textcolor{green} {\blacktriangle}}$

Results with $J$ =500 splits for $T$ =200 trees (us):

Domain	Cov $\uparrow$ (us)	Cov $\uparrow$ (ct)	pval	Dens $\uparrow$ (us)	Dens $\uparrow$ (ct)	pval	F1 $\downarrow$ (us)	F1 $\downarrow$ (ct)	pval
ringgauss	0.964 (0.022)	0.798 (0.05)	${\textcolor{green} {\blacktriangle}}$	0.999 (0.055)	0.757 (0.025)	${\textcolor{green} {\blacktriangle}}$	0.060 (0.047)	0.100 (0.073)	0.75
circgauss	0.954 (0.014)	0.734 (0.041)	${\textcolor{green} {\blacktriangle}}$	0.968 (0.027)	0.401 (0.033)	${\textcolor{green} {\blacktriangle}}$	0.520 (0.021)	0.746 (0.033)	${\textcolor{green} {\blacktriangle}}$
gridgauss	0.964 (0.006)	0.828 (0.053)	${\textcolor{green} {\blacktriangle}}$	0.953 (0.062)	0.649 (0.058)	${\textcolor{green} {\blacktriangle}}$	0.034 (0.010)	0.034 (0.011)	0.98
randgauss	0.954 (0.010)	0.659 (0.049)	${\textcolor{green} {\blacktriangle}}$	0.936 (0.029)	0.582 (0.035)	${\textcolor{green} {\blacktriangle}}$	0.020 (0.005)	0.079 (0.053)	0.08
winered	0.954 (0.013)	0.808 (0.016)	${\textcolor{green} {\blacktriangle}}$	0.874 (0.042)	0.589 (0.129)	${\textcolor{green} {\blacktriangle}}$	0.503 (0.023)	0.654 (0.030)	${\textcolor{green} {\blacktriangle}}$
winewhite	0.952 (0.008)	0.894 (0.026)	${\textcolor{green} {\blacktriangle}}$	0.940 (0.019)	0.953 (0.035)	0.50	0.510 (0.027)	0.581 (0.043)	${\textcolor{green} {\blacktriangle}}$

评论- [Th-0] A consistency result for our models / setting

2024-08-12

This Comment addresses the following reviewers' requests:

uJTo: comments on ``Novelty''
kTDz: review on the need to discuss generalization ability and our reply [kTDz-A2]

We use the shorthands for papers:

[cpdJI] A.H.C. Correia, R. Peharz and C. P. de Campos. Joints in Random Forests. NeurIPS 2020

[llmDT] J.S. Leboeuf, F. LeBlanc and M. Marchand. Decision trees as partitioning machines to characterize their generalization properties. NeurIPS 2020

[lnCO] G. Lugosi and A. Nobel. Consistency of Data-driven Histogram Methods for Density Estimation and Classification. Ann. of Stats. 1996

We provide a consistency result of exactly the same nature as [cpdJI, Theorem 2]. Like in [cpdJI,lnCO], we assume for simplicity that the domain is $R^d$ . Hereafter, the size $|\Upsilon|$ of a (binary) tree $\Upsilon$ is its whole number of nodes. A key feature of our models that allow us to branch directly on the proof of strong consistency of [lnCO] is that a GF has density equal to the empirical density measured in each bin of the partition it induces. In [lnCO]'s notations, this is the empirical histogram $f_n$ . We need one Lemma that connects our GFs with classical binary trees.

Lemma A The partition achieved by a GF with trees $\Upsilon_1, \Upsilon_2, ..., \Upsilon_T$ can be achieved with a single tree $\Upsilon_*$ of size at most $\prod_i |\Upsilon_i|$ .

Proof the proof is simple and constructive. Suppose without loss of generality that $T=2$ . Add one copy of $\Upsilon_2$ at each leaf of $\Upsilon_1$ . The partition generated is exactly the same as that of the GF (L128-L129) and the size of this tree is no more than $|\Upsilon_1| \cdot |\Upsilon_2|$ . The construction generalizes to more than 2 trees [end of the proof of Lemma A].

Suppose without loss of generality that all trees have the same size, $N$ , so that $|\Upsilon_*| \leq N^T$ . Like in [cpdJI,lnCO], proving consistency involves computing and bounding a combinatorial parameter function of the model class, the growth function $\Delta^*$ (See Definition 1 in [cpdJI] or Section 2 in lnCO]). Lemma A tells us that the growth function of the trees in a GF is bounded by that of a tree partitioning the domain. Proposition 2 in [cpdJI] bounds the growth function for any such tree. The VC dimension of size- $N$ decision trees is $O(N \log N)$ [llmDT]. Using [cpdJI], we obtain from Sauer's Lemma the bound $\Delta^* \leq m^{CN^{2T}\log N}$ .

There are three conditions to get to strong consistency in [lnCO] (Theorem 1). Condition 2 reads $(1/m) \log \Delta^* \rightarrow 0$ (as the training size $m \rightarrow +\infty$ ). To satisfy Condition 2, we thus need our models parameters to vanish as a function of $m$ , namely $N_m^{2T_m}\log N_m \rightarrow 0$ (indexes are used as in [cpdJI] to indicate the dependence of model parameters in $m$ ). This is similar to condition i) in [cpdJI, Theorem 2].

Condition 1 imposes that the maximal empirical mass in one element of the partition must also manish with $m$ . In our case, it is guaranteed by our condition (c) in our weak learning assumption and the fact that our algorithm splits the heaviest leaf at each iteration.

Condition 3 imposes that the true mass in any ``big'' element of the partition must also vanish with $m$ . This is also condition ii) in [cpdJI, Theorem 2] (our formulation would be no different. We spare the reviewers with additional notations).

[Consistency result 1] Hence, we are able to prove, under the same conditions as Theorem 2 in [cpdJI] that, with our notations and those of [lnCO], $\int |{\color{red} R}(x) - {\color{blue} G}(x)|dx \rightarrow 0$ with probability 1.

(the statement + explanations can fit in 1/2 page; the proof could go in appendix)

评论- [Exp-0] Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)

2024-08-12

This Comment addresses the following reviewers' requests:

uJTo: request to compare vs Forest-Flow
Sk9u: request to compare vs Forest-Flow
kTDz: points (1) (Jolicoeur-Martineau et al. substantially beat TVAE and CATB-GAN) and (3.2) (more evaluation metrics, time), request to compare vs ARF on new metrics

We use the shorthands for papers:

[jfkGA] A. Jolicoeur-Martineau, K. Fatras and T. Kachman. Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees. AISTATS24

[noucyRF] M.F. Naeem, S.J. Oh, Y. Uh and Y. Choi. Reliable Fidelity and Diversity Metrics for Generative Models. ICML20

For the data generation experiment, [Exp-1] and [Exp-2] provide experiments vs Forest Flow or (non exclusive) on new metrics.

Forest-Flow: we used the R code provided in [jfkGA], run on the same machine we ran our experiments (the Laptop). Parameters were set following the default values in https://htmlpreview.github.io/?https://github.com/SamsungSAILMontreal/ForestDiffusion/master/R-Package/Vignette.html. We also followed the recommendation that fixing 50 trees per XGBoost model led to state of the art performances [jfkGA, Section 3.2]. With trees of depth at most 7, this represents a maximum number of $J=$ 6 750 splits per model used. Times displayed are times to learn a model on each fold. Forest-Flow requires numerically reencoding domains with categorical/text values and all our contenders used so far process our domains "natively", without reencoding. To avoid encoding bias affecting results, we have run Forest-Flow on the fully numerical domains of our benchmark (also, the computation of some of our new metrics is better suited to fully numerical domains). This included domains with binary outcomes encoded as 0/1.

CT-GAN: we considered the domains in the intersection of Table A6 and those we used for Forest-Flow.

ARF: we considered the domains in the intersection of Table 2 and those we used for Forest-Flow.

Metrics: we added three metrics to our synthetic data generation experiment:

The Coverage and Density metrics from [noucyRF], which present better alternatives to precision and recall for generative models evaluation. We computed those metrics following the paper, using the training / test samples as built from our experiments; In the $k$ -NN rule, we picked $k=5$ following suggestion in [noucyRF, Section 3.4].
The F1 metric from [jfkGA], which if $F1_{disc}$ in their experiments. Since our domains are fully numerical, we used for prediction a $k$ -NN rule with the same $k$ as in our Coverage + Density experiment.

We also ran those metrics on our full set of experimental results of CT-GAN and ARF. We only provide the results for 1000 epochs for CT-GAN and $T$ =200 trees for ARF for readability but we have computed all other results as well (as in Tables 2 and A6).

The parameters we fix for our models are the same as in our submission ( $T=500, J=2000$ as in Table 2, $T=200, J=500$ as in Table A8).

Discussion:

vs Forest-Flow: there seems to be two different behaviours. On "small" domains ( $m\times d$ small, $d$ small, our simulated domains), there is a slight tendency for our method to beat Forest-Flow on Sinkhorn, Coverage and F1 whereas the tendency is reversed for Density. On larger domains (e.g. artificial_characters, $m \times d$ >80K), it seems we do significantly beat Forest-Flow by large margins (p<0.01 for most metrics, even with our small models). Also, execution times are slightly at our advantage (more if we consider our smallest models). Caution advised for comparing times since Forest-Flow is implemented in R and our code is in Java.
vs CT-GAN: we had to remove artificial_characters from our benchmark because of issues with CT-GAN (Table A6 in our paper). We also do not report times for comparison because the algorithms were ran on different computers. The conclusion follows the broad lines of our conclusion in our paper with the additional metrics we have considered.
vs ARF: the picture resembles that of Table 2 when we traing our GFs with $T$ =500 trees (we significantly beat ARF on >1/3 of the cases and are never statistically significantly beaten). When reducing the number of our trees to $T$ =200, we still observe that our models beat GFs in many cases (statistical significance on 5 of them).

评论- [Exp-1] "Experiments vs Forest Flow for Data Generation; metrics used: Sinkhorn, Coverage, Density, F1"

2024-08-12

Results for GF (us) vs Forest Flow (ff) on Sinkhorn, Coverage, Density, F1, shown as mean (std dev). Time = average per fold (s).

p-values smaller than 0.01 are replaced by a symbol in color, ${\textcolor{green} {\blacktriangle}}$ meaning we significantly beat Forest Flow, ${\textcolor{red} {\blacktriangledown}}$ meaning Forest Flow significantly beat us.

Explanations, comments, discussion, see "[Exp-0] Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)" for details.

Results with $J$ =2000 splits for $T$ =500 trees (us):

Domain	Sink $\downarrow$ (us)	Sink $\downarrow$ (ff)	pval	Cov $\uparrow$ (us)	Cov $\uparrow$ (ff)	pval	Dens $\uparrow$ (us)	Dens $\uparrow$ (ff)	pval	F1 $\downarrow$ (us)	F1 $\downarrow$ (ff)	pval	Time(us)	Time(ff)
ringgauss	0.285 (0.008)	0.276 (0.001)	0.07	0.968 (0.010)	0.957 (0.020)	0.09	1.031 (0.049)	1.045 (0.020)	0.28	0.070 (0.030)	0.051 (0.031)	0.07	45	35
circgauss	0.351 (0.005)	0.354 (0.003)	0.27	0.945 (0.020)	0.956 (0.015)	0.25	0.993 (0.045)	0.989 (0.027)	0.79	0.514 (0.529)	0.530 (0.028)	0.46	74	32
gridgauss	0.390 (0.002)	0.393 (0.001)	0.04	0.975 (0.009)	0.954 (0.012)	0.01	0.995 (0.077)	1.013 (0.050)	0.21	0.017 (0.015)	0.045 (0.007)	${\textcolor{green} {\blacktriangle}}$	65	36
randgauss	0.286 (0.003)	0.287 (0.001)	0.43	0.961 (0.012)	0.927 (0.011)	0.02	0.979 (0.023)	1.000 (0.008)	0.05	0.014 (0.013)	0.028 (0.008)	0.09	98	58
winered	0.980 (0.032)	1.030 (0.029)	${\textcolor{green} {\blacktriangle}}$	0.962 (0.028)	0.954 (0.021)	0.71	0.961 (0.026)	1.001 (0.034)	0.04	0.459 (0.031)	0.458 (0.052)	0.97	173	251
winewhite	1.064 (0.003)	1.097 (0.007)	${\textcolor{green} {\blacktriangle}}$	0.963 (0.002)	0.945 (0.007)	${\textcolor{green} {\blacktriangle}}$	0.989 (0.017)	0.970 (0.023)	0.13	0.492 (0.037)	0.498 (0.041)	0.78	654	661
compas	0.532 (0.007)	0.891 (0.007)	${\textcolor{green} {\blacktriangle}}$	0.556 (0.021)	0.548 (0.027)	0.32	0.430 (0.008)	0.404 (0.013)	${\textcolor{green} {\blacktriangle}}$	0.496 (0.021)	0.526 (0.018)	0.04	161	756
artificial_characters	0.821 (0.004)	0.834 (0.016)	0.12	0.947 (0.003)	0.879 (0.008)	${\textcolor{green} {\blacktriangle}}$	0.967 (0.013)	0.774 (0.016)	${\textcolor{green} {\blacktriangle}}$	0.429 (0.056)	0.530 (0.032)	${\textcolor{green} {\blacktriangle}}$	469	748

Results with $J$ =500 splits for $T$ =200 trees (us):

Domain	Sink $\downarrow$ (us)	Sink $\downarrow$ (ff)	pval	Cov $\uparrow$ (us)	Cov $\uparrow$ (ff)	pval	Dens $\uparrow$ (us)	Dens $\uparrow$ (ff)	pval	F1 $\downarrow$ (us)	F1 $\downarrow$ (ff)	pval	Time(us)	Time(ff)
ringgauss	0.283 (0.005)	0.276 (0.001)	0.09	0.964 (0.022)	0.957 (0.020)	0.31	0.999 (0.055)	1.045 (0.020)	0.07	0.060 (0.047)	0.051 (0.031)	0.75	6	35
circgauss	0.356 (0.003)	0.354 (0.003)	0.30	0.954 (0.014)	0.956 (0.015)	0.74	0.968 (0.027)	0.989 (0.027)	0.06	0.520 (0.021)	0.530 (0.028)	0.55	10	32
gridgauss	0.392 (0.002)	0.393 (0.001)	0.53	0.964 (0.006)	0.954 (0.013)	0.07	0.953 (0.062)	1.013 (0.050)	0.01	0.034 (0.010)	0.045 (0.007)	0.02	9	36
randgauss	0.290 (0.007)	0.288 (0.001)	0.55	0.954 (0.010)	0.927 (0.011)	0.04	0.936 (0.029)	1.000 (0.008)	0.01	0.020 (0.005)	0.028 (0.008)	0.11	11	58
winered	1.027 (0.037)	1.030 (0.029)	0.84	0.954 (0.013)	0.955 (0.018)	0.95	0.874 (0.042)	1.009 (0.034)	${\textcolor{red} {\blacktriangledown}}$	0.503 (0.023)	0.458 (0.052)	0.13	40	251
winewhite	1.120 (0.013)	1.097 (0.007)	0.05	0.952 (0.008)	0.945 (0.007)	0.18	0.940 (0.019)	0.970 (0.023)	0.03	0.510 (0.027)	0.498 (0.041)	0.48	214	661
compas	0.537 (0.015)	0.891 (0.007)	${\textcolor{green} {\blacktriangle}}$	0.546 (0.026)	0.548 (0.027)	0.74	0.424 (0.014)	0.404 (0.013)	0.03	0.500 (0.025)	0.510 (0.032)	0.36	29	756
artificial_characters	0.829 (0.006)	0.834 (0.016)	0.51	0.932 (0.018)	0.879 (0.008)	${\textcolor{green} {\blacktriangle}}$	0.897 (0.004)	0.774 (0.016)	${\textcolor{green} {\blacktriangle}}$	0.457 (0.057)	0.530 (0.032)	${\textcolor{green} {\blacktriangle}}$	90	748

评论- [Exp-3] "Completing experiments vs ARF for Data Generation; metrics used: Coverage, Density, F1"

2024-08-13

Results for GF (us) vs Adversarial Random Forests (arf, $T$ =200 trees) on Coverage, Density, F1, shown as mean (std dev).

p-values smaller than 0.01 are replaced by a symbol in color, ${\textcolor{green} {\blacktriangle}}$ meaning we significantly beat ARF, ${\textcolor{red} {\blacktriangledown}}$ meaning ARF significantly beat us.

Explanations, comments, discussion, see "[Exp-0] Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)" for details.

Results with $J$ =2000 splits for $T$ =500 trees (us):

Domain	Cov $\uparrow$ (us)	Cov $\uparrow$ (arf)	pval	Dens $\uparrow$ (arf)	Dens $\uparrow$ (ct)	pval	F1 $\downarrow$ (us)	F1 $\downarrow$ (arf)	pval
ringgauss	0.968 (0.010)	0.960 (0.008)	0.05	1.031 (0.049)	0.976 (0.030)	0.11	0.070 (0.030)	0.086 (0.047)	0.54
circgauss	0.945 (0.015)	0.962 (0.021)	0.35	0.993 (0.045)	0.954 (0.011)	0.05	0.514 (0.529)	0.507 (0.032)	0.92
gridgauss	0.975 (0.009)	0.908 (0.010)	${\textcolor{green} {\blacktriangle}}$	0.995 (0.077)	0.630 (0.050)	${\textcolor{green} {\blacktriangle}}$	0.017 (0.015)	0.043 (0.012)	0.04
randgauss	0.961 (0.012)	0.953 (0.004)	0.29	0.979 (0.023)	0.940 (0.027)	${\textcolor{green} {\blacktriangle}}$	0.014 (0.013)	0.029 (0.010)	${\textcolor{green} {\blacktriangle}}$
winered	0.962 (0.028)	0.929 (0.010)	0.14	0.961 (0.026)	0.801 (0.028)	${\textcolor{green} {\blacktriangle}}$	0.459 (0.031)	0.531 (0.028)	${\textcolor{green} {\blacktriangle}}$
winewhite	0.963 (0.002)	0.946 (0.012)	0.10	0.989 (0.017)	0.941 (0.013)	${\textcolor{green} {\blacktriangle}}$	0.492 (0.037)	0.500 (0.027)	0.04
compas	0.556 (0.021)	0.560 (0.037)	0.75	0.430 (0.008)	0.440 (0.011)	0.11	0.496 (0.021)	0.520 (0.020)	0.26
artificial_characters	0.947 (0.003)	0.892 (0.014)	${\textcolor{green} {\blacktriangle}}$	0.967 (0.013)	0.747 (0.013)	${\textcolor{green} {\blacktriangle}}$	0.429 (0.056)	0.512 (0.037)	${\textcolor{green} {\blacktriangle}}$

Results with $J$ =500 splits for $T$ =200 trees (us):

Domain	Cov $\uparrow$ (us)	Cov $\uparrow$ (arf)	pval	Dens $\uparrow$ (us)	Dens $\uparrow$ (arf)	pval	F1 $\downarrow$ (us)	F1 $\downarrow$ (arf)	pval
ringgauss	0.964 (0.022)	0.9600 (0.008)	0.53	0.999 (0.055)	0.976 (0.030)	0.42	0.060 (0.047)	0.086 (0.047)	0.23
circgauss	0.954 (0.014)	0.962 (0.021)	0.27	0.968 (0.027)	0.954 (0.011)	0.04	0.520 (0.021)	0.507 (0.032)	0.67
gridgauss	0.964 (0.006)	0.908 (0.010)	${\textcolor{green} {\blacktriangle}}$	0.953 (0.062)	0.630 (0.050)	${\textcolor{green} {\blacktriangle}}$	0.034 (0.010)	0.043 (0.012)	0.26
randgauss	0.954 (0.010)	0.953 (0.004)	0.94	0.936 (0.029)	0.940 (0.027)	0.74	0.020 (0.005)	0.029 (0.010)	0.14
winered	0.954 (0.013)	0.929 (0.010)	0.04	0.874 (0.042)	0.801 (0.028)	${\textcolor{green} {\blacktriangle}}$	0.503 (0.023)	0.531 (0.028)	0.01
winewhite	0.952 (0.008)	0.946 (0.012)	0.22	0.940 (0.019)	0.941 (0.013)	0.88	0.510 (0.027)	0.500 (0.027)	0.09
compas	0.546 (0.026)	0.560 (0.037)	0.10	0.424 (0.014)	0.440 (0.011)	0.03	0.500 (0.025)	0.520 (0.020)	0.33
artificial_characters	0.932 (0.018)	0.892 (0.014)	${\textcolor{green} {\blacktriangle}}$	0.897 (0.004)	0.747 (0.013)	${\textcolor{green} {\blacktriangle}}$	0.457 (0.057)	0.512 (0.037)	0.02

最终决定Accept (poster)

2024-09-25

The authors propose a tree-based method as a generative model for tabular data.

Reviews: Initially, the paper received two very positive and three more critical reviews. Points of criticism were novelty, clarity, missing technical details, and the selection of experiments.
Rebuttal: The authors were able to convince also the rather critical reviewers by clarifying misunderstandings and providing numerous new experimental results.
Decision: While I recommend acceptance, it should be noted that scores negatively correlate with confidence. The reason, as also pointed out by Reviewer uJTo, likely is that the simplicity of the approach is sometimes hidden behind overcomplicated mathematical concepts. At the very least, the authors should simplify notation and improve the paper's readability in the camera-ready version.