PaperHub
6.0
/10
Poster5 位审稿人
最低5最高7标准差0.9
7
6
5
7
5
2.6
置信度
正确性2.6
贡献度3.2
表达2.4
NeurIPS 2024

Generative Forests

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

A new class of tree-based generative models and a training algorithm with fast, boosting compliant induction and models that can be used for data generation, missing data imputation and density estimation.

摘要

关键词
tabular datagenerative modelsboostingtrees

评审与讨论

审稿意见
7

This paper proposes a generative model for tabular data based on a forest of trees. It is evaluated on the quality of the generated samples (using an optimal transport based distance to a held out test set), on data imputation and density estimation.

优点

Generative modeling for tabular data is an important research topic, and exploiting the strengths of trees (ubiquitous in this modality) for this generative task is a promising direction.

The framework inherits the flexibility of boosting, and the theoretical results (under weak learning assumptions) provide guarantees that the underlying distribution can be approximated well enough with enough trees.

缺点

Further empirical comparison could be conducted. For example, recent works have applied diffusion for tabular data (Kotelnikov et al.) as well as flow matching (Jolicoeur-Martineau et al.). It might be worth comparing to these more recent approaches.

Furthermore, more metrics could be included for the LifeLike experiment to provide more nuanced results. See Jolicoeur-Martineau et al. for examples of such metrics.

References:

Kotelnikov, A., Baranchuk, D., Rubachev, I., & Babenko, A. (2023, July). Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning (pp. 17564-17579). PMLR.

Jolicoeur-Martineau, A., Fatras, K., & Kachman, T. (2024, April). Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees. In International Conference on Artificial Intelligence and Statistics (pp. 1288-1296). PMLR.

问题

What is the computational cost for training and inference compared to Generative Trees? (It would be interesting to provide wall-clock times)

How are the parameters chosen (number of trees, number of splits, ...)?

局限性

See questions

作者回复

We would like to especially thank the reviewer for singling out the importance of boosting and our theoretical framework and results

weaknesses

Further empirical comparison could be conducted. [...]

We could be happy to do so, but kindly note that at this stage we have received criticisms that we cram a lot of experimental stuff in a small allotted space [kTDz-C]. Furthermore, our paper investigates three (3) different experimental settings on which we already have five (5) contenders (in fact 6 if we count the generative trees of [29], and even more if we consider "copy" to be one in our GEN-DISCRIM experiment, Section V.V.3.5); finally, the reviewers altogether suggest to add five (5) more contenders, in different parts of our three settings. As we write above [ALL], we would like to keep a good balance theory / experiments and so it would be important for the reviewers to get to a consensus on an additional experiment that we could detail as part of the +1 page. We make a suggestion in [ALL] that would be easy to carry out during camera-ready preparation and which rejoins the reviewer's suggestion.

Furthermore, more metrics could be included for the LifeLike [...]

Good suggestion but kindly keep in mind that this would induce more space to allocate in our paper (see above). We suggest adding Grower and coverage in the appendix for our method and all our LifeLike experiments. Kindly note also that out of the many metrics in Jolicoeur-Martineau et al., an optimal transport distance is the only one that provides a statistically meaningful comparison at the distribution level (see [kTDz-F]). Jolicoeur-Martineau et al. use Wasserstein's metric, which is computationally expensive and prevents them from computing it on medium+ sized datasets (Cf their section B.2.2); we use a regularized version, now very popular since Cuturi [8], that still bears OT properties and is sufficiently fast that it could be computed on all our datasets even on our "Low-end" laptop.

questions

What is the computational cost for training and inference compared to Generative Trees? (It would be interesting to provide wall-clock times)

It is more computationally demanding than generative trees (GT's) for a single reason: when testing a split at a leaf, we have only one support to split in a GT (that corresponding to the leaf) while in a generative forest (GF), we need to test the splitting of all supports in P(T){\mathcal P}({\mathcal T}) that are included in the leaf's. Note that this has a positive side: we can train much smaller GFs compared to GTs. So somehow we are slower with GFs but end up training models with less splits to get to the same quality. We could put times but such a question entails a broader one on our contenders and this could be sensitive [kTDz-F]. Note also that we have used a "low end" laptop and a "medium+ end desktop" for our experiments, but we have systematically trained our models on the laptop. It was deliberate as we wanted to make sure we could learn such models on a "simple" device. In this context, we are not sure we would not end up comparing peers and apples when it comes to times. At least we can propose to put the training times of our method for all our experiments, with a clear explanation of how / why we did so in a "low-end" laptop (and in the appendix) ?

How are the parameters chosen (number of trees, number of splits, ...)?

We deliberately did not optimize / hypertune these parameters, in order not to factor the "human tuning" in the explanation for the quality of our models (keep in mind we have three experimental settings), so we just picked informed guesses that would also represent fast enough training, see kTDz-E] [cvp8-F]. The other reason we have not done this is because we are convinced there is an automatic method to find the "right" parameters, which comes down to pruning our models, see [cvp8-E]. This of course would require a work / paper of its own.

评论

Dear Authors,

Thank you for your detailed response. I'd be happy to see the results against Forest-Flow if the other reviewers also agree.

评论

Kindly see at the top of this page, or Search for tags [Exp-0], [Exp-1] and [Exp-2].

评论

Thank for your response, I have raised my score by one point.

审稿意见
6

This paper introduces generative forests (GF), a new class of generative models designed for tabular data and based on sets of trees (forests), and a training algorithm called GF.BOOST. The algorithm is designed to be simple, making it easy to implement with minor modifications to existing CART-style decision tree induction schemes.

The GF models and GF.BOOST algorithm are evaluated across various domains, demonstrating their capability in tasks such as data generation, missing data imputation, and density estimation. The results show that even a small number of trees within a generator can be effective, highlighting the potential of the proposed approach in handling tabular data.

优点

  • I believe the paper is a nice addition to the ML community. Prevailing methods in data imputations for DTs are based on heuristics and tricks, whereas this paper approaches this problem with sounding probabilistic point of view, which makes a lot of sense to me. This naturally leads to the ability of generating more data, which may have serious implications in synthetic data generation literature.
  • The method is relatively easy to implement and can be done by straightforward extension of CART-style tree induction methods.

缺点

  • Experiments section may need some improvements:

    • Most datasets in Table 2 (which seem to be the main results table) are standard ML benchmarks. It would be really nice to have real-world kaggle-style datasets with missing values and compare with strong baselines therein.
    • Adding traditional forests is also necessary to compare against established methods (xgboost, lightgbm, etc.)
  • I find most of the math notations hard to follow and can be significantly simplified.

问题

  • How computationally efficient to compute Eq. 4? From my understanding it is used for splitting a tree
  • Is there any mechanism to prune some trees in a forest (reduce T)? Traditional tree boosting builds trees sequentially, so there is opportunity for early stopping. I was wondering how critical here to have predefined T...
  • Is there any separate ablation how well this method works for categorical features?

局限性

  • there seems to be some limitations in convergence analysis (e.g. symmetry of l).
  • otherwise, having separate (sub)section on limitations would be nice (e.g. runtime compared to traditional boosting methods).
作者回复

We thank the reviewer for noting the importance of our formal contribution !

weaknesses

It would be really nice to have real-world kaggle-style datasets with missing values

Actually, sigma-cabs is a Kaggle dataset with missing values. Stanford Open Policing is not Kaggle but it also has missing values and is real world (see our Table A1). We understand the reviewer would like to have a few more ? Perhaps we can use [uJTo-B] ?

Adding traditional forests is also necessary to compare against [...]

We understand the reviewer would like us to add tests on supervised learning. We kindly refer to [cvp8-A] for a discussion on this matter.

I find most of the math notations hard to follow and can be significantly simplified

We attribute it to the fact that our approach is purely grounded in a supervised learning framework, which is indeed seldom for generative models. We refer to [cvp8-B][uJTo-A] for discussions on these and their importance; we would use part of the +1 camera ready to hopefully bring improved explanations.

questions

How computationally efficient to compute Eq. 4? 

[6MHb-A] Great question. We only need to keep those elements in P(T){\mathcal P}({\mathcal T}) with >0>0 empirical measure because for all others their contribution to (4) is zero, so the complexity is linear in the size of the training sample (we cannot have more elements in the set). We also refer to [kTDz-I] for side remarks on the code we provide.

Is there any mechanism to prune some trees in a forest (reduce T)? 

Great question: as we say in [cvp8-E], there surely is, but it would deserve a paper of its own (we are talking about a pruning mechanism with provable properties).

I was wondering how critical here to have predefined T... [...]

We refer to [cvp8-F] for a similar discussion and the reason why we did not "hypertune" our parameters T,JT,J.

Is there any separate ablation how well this method works for categorical features?

We would be happy to get additional context on this question -- does the reviewer think about testing "all features" vs "only numerical features" to test how much categorical features contribute to the model quality ? (note that we would have to push it to the appendix, see [ALL])

limitations

there seems to be some limitations in convergence analysis (e.g. symmetry of l).

This is in fact a limitation of a previous paper [29], limitation that we remove in our analysis, which is thus more general (see L497, that we can put in the main file using the +1 camera ready page)

评论

Hi, thanks for your rebuttal! However, I feel like my concerns on additional experiments were not addressed adequately.

Most datasets used seem to not have missing values, the ones that do have them appear to be small sixe, e.g. sigma-cabs has 5000 data points and 13 features. I don't understand why it was a problem for not running requested evaluations during rebuttal? One could just get some dataset with reasonable size where tree-based models perform good (e.g. XGBoost) and apply some standard method (e.g. https://www.kaggle.com/code/alexisbcook/missing-values) and evaluate against the proposed method. I feel like this is a major blocker for me to accept the paper without those results. The current evaluations does not convincingly show the advantageous of applying this method specifically over others.

评论

We forgot to mention an important point: in missing data imputation, the task is to predict missing values and then compare to a known ground truth to compute the metrics. Thus, the missing values must have been removed from the domains and so those domains must have the values that will be removed before testing missing data imputation.

Note that this does not prevent the domain to have missing values "as is" (our methods can be trained if the domain has missing values) but the prediction of those cannot be assessed against a ground truth because they are not known -- read: those predictions cannot be included in a metric. This is why we simulate missing data imputation by removing a fixed proportion of values in the domain (5%, Missing Completely At Random, MCAR in the jargon of missing data imputation), and then compute metrics only based on those features "known in the full domain" then predicted by models.

评论

Yes, I understand that you're doing the mentioned 3 things. But how, for example, data imputation will eventually be used? My best guess is to train some model on it. My concern is how this method is going to fit in the final pipeline, where we can use alternative methods of data imputations with final model. The question is how GF is performing w.r.t. to that? This is point, which I think kTDz is raising as well...

评论

We apologize if we misunderstand the point. Say we have a training sample SmS_m with missing values. A missing data imputation method MM provides another sample, M(Sm)M(S_m), without missing values. It seems that the reviewer would like to then put another method, say AA, which then produces a ML output (model, statistics, etc.) A(M(Sm))A(M(S_m)). The composition AMA \circ M is the pipeline mentioned. Then we are being asked to evaluate the quality of A(M(Sm))A(M(S_m)) when for example MM is our approach, for some AA (against alternative methods).

This is not the way missing data imputation is evaluated: it is just evaluated by comparing M(Sm)M(S_m) to the ground truth, say SS in which the missing values of SmS_m on which the evaluation has to take place are not missing. We thus compare M(Sm)M(S_m) and SS, independently of any post-processing AA. This is why we have our evaluation as it is in the paper, aggregating over all features a pointwise loss that depends on the feature type (perr for categorical, rmse for numerical).

Using such pointwise metrics defines a gold standard for the evaluation of missing data imputation, see for example Muzellec, Josse, Boyer and Cuturi, "Missing Data Imputation using Optimal Transport", ICML20; Zhao, Sun, Dezfouli and Bonilla, "Transformed Distribution Matching for Missing Value Imputation", ICML23, etc. .

评论

I'd still encourage authors to discuss their method a bit more from the supervised learning angle, this is where DTs are mainly used so far. I'm maintaining my original score.

评论

We are confused. The reviewer would like us to use XGBoost, but we do not do supervised learning. We do (i) data generation, (ii) missing data imputation, (iii) density estimation. XGBoost, lightGBM address supervised learning: the missing values need to be in one column (=class). Our missing values can be everywhere, that is why we use MICE, which is state of the art (note that we have also optimized MICE, in particular augmenting the number of trees used in random forests-based imputation to 100 (the default value, 10, is arguably too small), see also [kTDz-E]. We also test CART trees in MICE. Thus, note that the contender method we use is always tree-based.

Since the reviewer mentions many times datasets with missing values, we assume the task we have to tackle is missing data imputation. We have started more experiments with MICE against our method using real world domains bigger+ than sigma-cabs. The results will be communicated before the discussion deadline. Note that our results in missing data imputation are plots (Table 5) and we will not be able to communicate plots, so we will communicate summary metrics that we already use.

评论

We thank the reviewer for all interactions. As we are getting a few minutes before the deadline, we would like to mention that we have started comparisons on much bigger domains than the ones we have in our current benchmark. As a comparison, our largest dataset (real-world) in the vs-mice benchmark has m×dm \times d \approx 363K. We are now testing on bigger domains, but we are facing scalability setbacks on some contenders [uJTo-L].

Nevertheless, for imputation experiments, we can already report, as an example, an experiment vs mice (same parameters as in our paper) on domain medical_charges (a real world domain from US hospitals with m×d>m \times d > 815K [govWD]) with RMSE = 2.86E8 (us) vs RMSE = 2.50E8 (mice), p-val = 0.38 so no statistically significant difference.

To make a uniform treatment among our experiments, we have to test those domains on our two other settings (data generation, density estimation) as well and using all our other contenders, so we propose to select 1-2 of such domains approaching m×d=m \times d = 1M and add those in our benchmark. Our program runs smoothly on such domains [uJTo-L], but have to fairly solve scalability issues for some of our contenders [uJTo-L].

The author(s).

[govWD] L. Grinsztajn , E. Oyallon and G. Varoquaux . Why do tree-based models still outperform deep learning on tabular data? NeurIPS 2020

审稿意见
5

The paper proposes a generative model based on an ensemble of trees (forest) for tabular data. The proposed model enjoys the following peculiar properties:

  1. Compared to generative models based on a single tree, it offers improved expressiveness in terms of partitioning the input space (linear for generative trees vs. polynomial for generative forest in terms of decision nodes in a tree).
  2. The model offers unbiased density estimation for sufficient large capacity, thanks to the fact that each node in each tree leverage decisions in proportion to the empirical distribution.
  3. Learning the structure of the trees is performed by minimizing an objective function based on the likelihood ratio risk.
  4. Computationallly speaking, data generation and density estimation can be performed in parallel with respect to the different trees. Indeed, results can be aggregated a posteriori by taking the intersection of selected regions.
  5. Apart from data generation and density estimation, the model can be extended to deal missing data and perform data imputation.

Experiments are conducted on a series of tabular datasets from the UCI repository. The proposed solution is shown to outperform/being on par with existing tabular data generation (the most relevant being adversarial random forests), data imputation (e.g. MICE) and density estimation baselines (e.g. kernel density estimation).

优点

  1. The problem of learning a good generative model for tabular data is relevant Relevance
  2. The solution based on a tree ensemble represent a significant advance over solutions based on a single decision tree Significance.
  3. The proposed model offers the possibility to tackle three different tasks, including density estimation, data generation and data imputation. Moreover, experiments demonstrate the efficacy of the solution when compared with corresponding baselines for each task Significance.
  4. The theory of the paper seem correct, however I haven’t gone through the proofs Correctness
  5. Code is available at the time of submission. No additional check has been conducted to verify the reproducibility of the results Reproducibility

缺点

  1. One of the major concerns is with the novelty of the proposed solution Novelty. What is the relation with the work in [1]? The work in [1] is also providing a model able to tackle the three above-mentioned tasks. Would it be possible to provide a comparison?
  2. The paper is sometimes overloaded by unnecessary formalism, rendering the text hard to follow Clarity. For instance,
    1. Line 92-109, why is the whole theory about risk minimisation for binary classification introduced, when the goal is to perform density estimation? Would simple likelihood learning suffice?
    2. Definition of tree. “…labeled with subsets of their tail node’s variable domain”. Do you mean “”labeled with subset of attribute value for the corresponding variable?
    3. Line 160-163. Can you make an example or in case leave out the sentence? Am I correct to say that the permutation invariance property is due to the final intersection operation or am I missing something?
    4. Algorithm 3 in Step 2.1 How is node selection performed, uniformly at random? Moreover, can you provide some more details about the resulting trees, are they balanced?
    5. Can you elaborate on the motivation of proving theorem 5.2 and discuss the significance of its results or link with the results in the experiments?
  3. All experiments are conducted on datasets with less 34 variables Soundness. Would it be possible to provide some comparisons with [1] on larger dimensional datasets (see the reference for example) and/or discuss the computational requirements?

Reference

[1] Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees. AISTATS 2024

问题

Overall, I like the extension of generative trees to ensembles. I’m willing to raise my score if all weaknesses are addressed during the rebuttal phase.

局限性

There are no additional suggestions.

作者回复

We thank the reviewer for organising the review in a very clean way that allows for easy referencing, tagging two significant key strengths on which we elaborate further below, and being clearly open for discussion. We use W.X.Y to refer to the weakness section.

W.1 "One of the major concerns [...]"

There are two parts to this question, a technical one (relationship to [1]) and experimental (comparison with [1]).

  • technical part: [1] is based on diffusion models while we rely on stochastic generation, so the approaches are radically different (they solve ODEs for data generation, we essentially "just" sample Bernoulli random variables, see our algorithm starUpdate). We thank the reviewer for this reference as it in fact provides us with a contender type that is radically different from all our contenders (not just our method) [ALL].

At this point, we would like to highlight a key difference with [1]: [1] has no formal analysis of the algorithm: no convergence / consistency result of any kind. Experimental results are important to tell the story but a good theory gives the backbone. Among all references pointed by all reviewers, none provides rates of convergence. We do. We also emphasize that it is not complicated to show a consistency result in our case [kTDz-A2].

  • experimental comparison: the reviewers provided overall a number of potential additional contenders. Assuming a consensus emerges around our proposition in [ALL], we would be happy to provide an additional comparison with [1] on data generation, hoping our experimental section will remain readable using the +1 page of camera ready (we would not compare on missing data imputation as [1] make it explicit that they do worse than MICE and we are competitive).

W.2.1 "The paper is sometimes overloaded by unnecessary formalism [...] Line 92-109, why is the whole theory [...]"

[uJTo-A] The formalism we introduce is necessary because we manage to train a generative model using the framework of supervised learning and our algorithm heavily relies on the loss definitions mentioned. Furthermore, the loss properties are also important because, it turns out, our convergence results are more general than those of [29]. Those definitions make the "slack" explicit. We are confident we can make the section "breathe" using part of the +1 page camera ready.

W.2.2 "Definition of tree [...]"

Correct. We can paraphrase.

W.2.3 "Line 160-163. Can you make an example or in case leave out the sentence"

It is correct but the sentence should remain because it grounds the proof (not given) of Lemma 4. We can add the reviewer's remark and merge it.

W.2.4 "Algorithm 3 in Step 2.1 How is node selection performed, [...]"

We choose the heaviest leaf among all trees (the one with the largest empirical mass in its support). In the code, it is method choose_leaf of class Algorithm.

W.2.5 "Can you elaborate on the motivation of proving theorem 5.2 [...]"

This is an empirical convergence rate under the weak learning assumption in the boosting framework. To summarize its importance, this has been the cornerstone of boosting algorithms since Valiant's weak/strong model and since Schapire's first proof that boosting is achievable in the model (1990). Such a result is thus fondamental: it shows an explicit rate under assumptions that are so "weak" that not satisfying them would just preclude learning (in the supervised setting, which is easier to state, failing the weak learning assumption would mean getting updates no better than a fair coin, thus indeed totally useless to learn). Kearns and Mansour in [18] summarize very well the importance of boosting in the context of decision tree induction, which is close to our setting. The first sentence of the abstract of the popular paper "Additive logistic regression: a statistical view of boosting", Friedman, Hastie and Tibshirani summarizes the importance of boosting as we just explained:

"Boosting is one of the most important recent developments in classification methodology."

W.3 All experiments conducted on datasets with less 34 variables [...] comparisons with [1] on larger dimensional datasets [...]

[uJTo-B] (we correct: it is less than 35) We would be happy to do so, see [ALL], but note that among all datasets of [1], only 3 out of 27 have more than 35 features (we can use those ?).

评论

Dear authors,

Thank you for the answers, that have addressed some of my concerns about clarity. There are still a number of questions that I would like to see addressed and that refrain me to raise the score, also in light of the other reviews.

Novelty

While it is clear the difference between the proposed solution and Forest-Flow from a model perspective, I don’t understand the difference in terms of properties, especially regarding generalization (as also pointed out by reviewer kTDz). Firstly, Theorem 5.2 provides a result about the weak monotonic reduction on the density distance. Since the reduction is not strict, how do you ensure that the learnt model is unbiased? Secondly, does the mentioned consistency property imply that Forest-Flow is biased, whereas Generative Forests is not? As far as I understand, training Forest-Flow reduces the distance between the model and the empirical data distribution leveraging standard KL divergences through an ELBO-like formulation. Consistency results clearly hold in the infinite sample regime and for sufficiently large network capacity. Can you please clarify this aspect?

Clarity

It is mentioned that during learning, leaves with largest empirical mass are selected among all trees. How does this ensure that learning is unbiased as uncertainty is not taken into account? Moreover, how does this criterion influences the topology of the trees, are they balanced?

Can you provide some synthetic visualisations on the convergence results of Theorem 5.2 and the benefits of the strict properness property against other common generative losses, such as KL divergence? This could help to add value my making the concepts clearer.

From an implementation perspective, how are the different trees initialised and how is diversity across trees ensured throughout learning?

Experiments

Comparison results against Forest-Flow are promised, but it is not clear whether they will be released during the discussion phase or afterwards. Providing these during the discussion phase could definitely help in the decision.

评论

Kindly see at the top of this page, or Search for tags [Exp-0], [Exp-1] and [Exp-2].

评论

Dear reviewer,

Since you lodged your additional questions, we have reported a stream of experimental results [Exp-0][Exp-1][Exp-2][Exp-3], a consistency result [Th-0] and replied to your questions. You made it explicit in your questions that some information "[...] could definitely help in the decision [...]", information we have since lodged (we refer to the experiments on Forest-Flow, [Exp-0] and [Exp-1]), so we would like to hear back before the end of the discussion phase on updates relative to this decision (reviews will become invisible again afterwards).

Thank you.

评论

Dear authors,

thank you for the answers and the additional experiments.

I haven't had time to check in close details the theoretical aspects about generalization, but looking at the results and your answers, the outcome looks reasonable.

It is nice that you managed to provide an experimental comparison with Forest-Flow and it is clear from these results that the two approaches while being different in nature have similar performance. I'm therefore happy to increase my score.

However, I will not increase my score even further as the issue of scalability is still not addressed and there are still unclear aspects.

Scalability Specifically, there is no comparison on larger datasets (number of features > 34) against Forest-Flow. I would have expected some experiments on the three larger datasets you mentioned, libras, connectionist bench sonar and qsar biodegradation.

Unclear aspects I have two last questions:

  1. If you replace the loss with a KL do you get a good model? If so, then why do you need the proposed loss? I would appreciate a simplification of the formalisms introduced in the paper.
  2. I'm not sure that by choosing the leaf with the largest empirical mass you obtain the desired diversity across trees. Can you please elaborate on that? I suspect that by doing so you end up by learning a single tree. How is this phenomenon prevented?
评论

We thank the reviewer for these additional comments. Here is our reply, following the comment's order.

On Novelty. We have added here the proof that our method is consistent [Th-0] (see above) in the framework used by A.H.C. Correia, R. Peharz and C. P. de Campos. Joints in Random Forests. NeurIPS 2020. We do not want the reviewer to think that we claim any biasedness / inconsistency property for Forest-Flow. We do not know. Our experiments we just added here [Exp-0][Exp-1] at least demonstrate its qualities as a contender. Note that our result in [Th-0] follows from a simple relationship between our generative forests and binary trees (Lemma A above), which allows us to branch immediately to the state of the art. Maybe a similar path exists for Forest-Flow ?

On Clarity. We selection the leaf with the largest empirical mass because it is a simple (the simplest ?) way to meet conditions (a) and (c) in our weak learning assumption, thus to get fast empirical convergence. We also realized whlie writing [Th-0] that this strategy has a positive impact in terms of consistency. Note that this strategy of picking the heaviest leaf is also one used in DT induction [kmOT]. Depending in the domain, this can create a diversity of model "shapes" which we believe is a good thing because the model "shape" is also adaptive to the domain at hand. Strict properness is crucial because otherwise the minimization of the loss does not guarantee convergence to a good model (L108-L109). We alleviate the symmetry assumption of [29], and it turns out to be important in the light of the reviewer's question: the KL divergence can be formulated in the proper loss framework with asymmetric partial losses [rwID, Table 2]. Note also that the GAN's Jensen-Shannon divergence can also be represented.

On Implementation, trees are all initialized to roots. Diversity is ensured by the adaptivity of the model ``shape'' during training (Cf before).

On Experiments, we have released the set of Experiments vs Forest-Flow (see above).

References:

[kmOT] K. Kearns and Y. Mansour. On the Boosting Ability of Top–Down Decision Tree Learning Algorithms. STOC 1996.

[rwID] M. Reid and R.C. Williamson. Information, Divergence and Risk for Binary Experiments. JMLR 2011.

评论

[uJTo-L] We thank the reviewer for this comment and their decision to raise their score.

We apologize for not replying any sooner but as soon as we got the last comments we decided to run some purely scalability experiments on comparing our software and Forest-Flow's.

We are not sure why the reviewer say we mentioned using "[...] libras, connectionist bench sonar and qsar biodegradation [...]". We did not mention those. Looking for the datasets, it seems to us that they are also quite small (we searched on OpenML and the UCI, those are the hits we got on the largest datasets):

-libras (https://www.openml.org/search?type=data&sort=runs&id=299&status=active ?), 24 * 15 = 360 examples, 90 features

-connectionist bench sonar (https://archive.ics.uci.edu/dataset/151/connectionist+bench+sonar+mines+vs+rocks), 208 examples, 60 features

-qsar biodegradation (https://www.openml.org/search?type=data&sort=runs&id=1494&status=active), 1055 examples, 42 features

We do not know if the suggestion from the reviewer was a typo. In the spirit of saving time, we looked ourselves for domains substantially larger to accomodate scalability studies. We have started making scalability comparisons on the biggest domains of the benchmark of [govWD], considering only domains with m×d>m \times d > 1M.

As far as our experiments are going (Laptop), we are facing an issue with Forest-Flow: running the same code as we had for all other experiments results in R crashing with a "vector memory limit". Getting the available memory to 50 Gb gets the same error; to 150 Gb results in session aborted with a fatal error.

After several more tries with the same outcomes, we decided to follow recommendations from the webpage [jkkGA] on how to optimize parameters for memory management. Given the size of the dataset, we reduced parameter duplicate_K, initially 100 (from the website, this amount to a number of duplications of the training sample, so targeting this parameter looked like a no brainer). We first reduced it to 50 but this washed the system out of application memory so we had to run it with a smaller value. So far, it seems that trying duplicate_K = 10 seems to work on the domains tried, but the webpage [jkkGA] makes it clear that decreasing duplicate_K can lead to worse performances. So we will probably have to fine-tune our runs against Forest-Flow for fairness.

We do not have issues with our code as far as we can tell. The biggest domains we have tried so far have m×dm \times d \approx 1.5M. For example, with T=200 trees, J=500 iterations, the max memory we use at run time is < 1Gb; for T=500 trees, J=2000 iterations, it is < 1.5Gb. As far as we can tell, the worse training times we get, with T=500 trees, J=2000 iterations are < 7000s.

(This is not meant to criticise Forest-Flow's R implementation, rather to point the difficulties we faced in the context of the scalability analysis to wrap our analysis and justify the choices we made in this short time)

Regarding the last two Unclear aspects, here are our answers:

-On the loss we use (vs KL). The loss we choose, derived from Matusita's loss (class Statistics.java) is the one guaranteeing the best possible convergence rate in our theory (ref [24][18] in our paper). The KL is theoretically suboptimal from the boosting rate standpoint.

-On learning a single tree / diversity. We initialize all trees to roots (roots do not affect generation because their support is the whole domain) then pick the heaviest leaf among all current trees. Hence, during the first TT (number of trees) iterations all roots are first split since each root has a larger mass (1) than any leaf in a tree with depth >0. Then we keep on choosing the heaviest leaf among all trees: the couple (tree, leaf) chosen depends on the current set of trees and on the data at hand.

Maybe an example would set the idea that we indeed learn sets of trees and that their structure depends on the domain: suppose we have two trees, Υ1\Upsilon_1 and Υ2\Upsilon_2. At the initialisation, they are root leaves, thus of weight (sum of the weight of all examples reaching them) 1.

[A] Step 1, we split Υ1\Upsilon_1. Suppose the two (new) leaves have weight 0.8 and 0.2. At the next step, we have to split Υ2\Upsilon_2 because its root (leaf) has weight 1.

(two scenarii follow)

[B1] Scenario 1: we split Υ2\Upsilon_2 and its two new leaves have weight 0.6 and 0.4.

[C1] At step 3, the next leaf to be split is then in Υ1\Upsilon_1 (weight 0.8)

[B2] Scenario 2: we split Υ2\Upsilon_2 and its two new leaves have weight 0.1 and 0.9.

[C2] At step 3, the next leaf to be split is now in Υ2\Upsilon_2 (weight 0.9)

[and so on]

We are conscious that we are lodging these comments just before the deadline. We thank the reviewer (all reviewers in fact) for a strong engagement around our paper.

[govWD] L. Grinsztajn , E. Oyallon and G. Varoquaux . Why do tree-based models still outperform deep learning on tabular data? NeurIPS 2020

审稿意见
7

The paper presents a novel boosting algorithm based on ensembles of generative trees, addressing tasks such as data binary classification, missing data imputation, and density estimation. The proposed algorithm optimizes specific loss functions and leverages the structure of the models to efficiently estimate densities given a generative model and partially specified observations. The authors demonstrate the practicality of their approach through comprehensive experiments, highlighting its effectiveness in binary classification compared to diverse state-of-the-art methods, including their competitors like Adversarial Random Forests (ARF), CT.GAN, Vine copulas auto-encoders (VCAE) or Kernel Density Estimations(KDE).

The authors introduce a unique training methodology for these generative models, which is based on class probability estimation (CPE) in binary classification tasks. The loss functions are decomposed into partial losses for positive and negative classes, with the Bayes risk function defined as the optimal achievable loss given a certain positive base-rate. Properness and strict properness of the loss functions are critical properties, ensuring that the loss is minimized when the predicted probability matches the true class probability. The paper discusses various proper losses, such as square loss, logarithmic loss, and Matusita's loss, which are symmetric and differentiable. Furthermore, the paper explores loss functions in the context of density ratio estimation and introduces the likelihood ratio risk for assessing the performance of generative models. Using Bregman divergence and the generalized perspective transform, the authors propose a framework for evaluating the density ratios of different distributions.

This theoretical foundation supports the practical applications of their models, which show competitive performance over the dataset proposed. The authors carried out experiments on a total of 21 datasets from sources such as UCI, Kaggle, OpenML, and the Stanford Open Policing Project, or simulated datasets. These include datasets like iris, ringGauss, gridGauss, forestfires, tictactoe, ionosphere, student performance, winered, abalone, kc1, sigma-cabs, compas, artificial characters, jungle chess, open-policing-hartford, and electricity, with varying attributes and missing data conditions. The paper concludes that their models, trained with the novel generative tree algorithm, are strong contenders against existing state-of-the-art techniques.

优点

Regarding to the originality, the paper introduces a novel approach to generative modeling by creating a generative forest based on Class Probability Estimation (CPE). This method represents a significant advancement over current state-of-the-art algorithms. By leveraging ensemble methods and decision tree induction for generative tasks, the paper presents a fresh perspective that has not been extensively explored in the literature. The combination of well-known techniques in a new context adds substantial value to the field and showcases the authors' innovative approach to generative modeling.

The quality of the submission is commendable. The paper is well-organized and meticulously indexed, with a comprehensive appendix that supports the main content. The methodology is robust, encompassing a detailed mathematical explanation, pseudocode for the proposed algorithm, and extensive experimental validation. The authors have tested their model on 21 diverse datasets from reputable sources such as UCI, Kaggle, OpenML, and the Stanford Open Policing Project. This thorough evaluation across various tasks, including data generation, missing data imputation, and density estimation, provides strong evidence of the model's performance and versatility. Additionally, the paper continues the line of research in generative models, building on previous work and demonstrating significant progress in this domain.

The paper is clearly written and well-organized, ensuring that readers can easily follow the authors' reasoning and methodology. The detailed descriptions of the mathematical foundation, the pseudocode, and the experimental setup are particularly noteworthy. These elements allow that an expert can reproduce the results. Furthermore, the inclusion of a comprehensive appendix adds to the clarity by providing additional details without cluttering the main text.

The results presented in the paper are of high value. In that line, the authors have made a commendable effort by comparing their method with the closest approaches (ARF, copycat) cited in [42] and [29]. Furthermore, they have clearly outlined the distinctions and enhancements of their algorithm in comparison to these approaches.

The authors suggest that generative forest can effectively eliminate "the slack" when utilizing decision trees as discriminators. This implies that decision trees exhibit a more regulated and foreseeable behavior in contrast to neural networks. Decision trees are inherently more calibrated and provide more deterministic outcomes. As a result, the feedback they provide to the generator during training is more accurate and reliable. The proposed method seems to enhance binary classification performance compared to existing approaches and also addresses challenging tasks such as missing data imputation and density estimation. The model's ability to produce competitive results across a diverse array of datasets and tasks underscores its potential impact on both the research community and practical applications.

缺点

In section 1, the authors delve into the use of generative AI in the ML community and highlight two significant features of their generative forest. They emphasize its simplicity and the strong convergence guarantees it offers in a weak/strong learning model, akin to Valiant’s PAC learning boosting model. However, it remains unclear whether this generative forest, based on Class Probability Estimation (CPE), can effectively handle multiclass scenarios.

Section 3 introduces the supervised loss for the generative tree. Nevertheless, from a synthesized perspective, the drafting in line 75 lacks clarity. The authors should consider enhancing the clarity of these lines.

Moving on to Section 4, in line 125, the authors refer to 'c', but it is not clearly connected to the context provided in the text.

Regarding the supplementary material, Table A3 is challenging to read as the footnotes overlap with the image.

问题

In Table 2, the final row displays the wins and losses for us. I am not closely following the numbers as you have already chosen the green one, which includes p-values indicating significance. Another query pertains to comparing the number of trees. It seems unfair to compare T=500 with competitors having fewer trees (10, 50, 100, 200). In simpler terms, I coudn’t comprehend why you mentioned that all these approaches rely on models that are vastly distinct from each other. I know that in the article the authors mentioned that ARFs are trained with a varying number of trees T ∈ {10, 50, 100, 200}. Additionally, ARFs incorporate an algorithm for tree size selection, eliminating the need for manual selection. On the other hand, CT-GANs undergo training with a different number of epochs E ∈ {10, 100, 300, 1000} but I don't follow why these models are then compared with generative forests comprising T = 500 trees and trained over a total of J = 2000 iterations. That approach was done for Table 2 and 4.

局限性

The work presented by the authors is highly promising since they have introduced a novel methodology for constructing generative tree models based on class probability estimation (CPE) as a loss function in binary classification tasks. However, the paper would benefit from a dedicated section discussing the current limitations of this method. Specifically, it would be useful to understand whether this approach can be applied with the same efficacy to multiclass classification or, with some modifications, to regression tasks.

Additionally, it would be valuable to identify the types of tabular data that the proposed method might struggle with or the conditions under which its performance may be suboptimal. For instance, are there specific data types or distributions where this model does not perform as well? A thorough examination of these aspects could enhance the article.

Moreover, while the authors have laid a strong foundation for binary classification problems in their work, constructive suggestions for improvement include providing an analysis or discussion on how the current method could be extended or adapted to handle multiclass classification tasks or regression problems, which would be helpful.

In that line, another interesting thing could be identifying the specific types of tabular data that might present challenges for the proposed method, which would be beneficial. Under what conditions does the model's performance decline, and what are the characteristics of datasets where this approach is less effective?

作者回复

We thank the reviewer for noting that our approach "[...] presents a fresh perspective that has not been extensively explored in the literature [...]" and noting that the quality of our submission is commendable.

weaknesses

In section 1, [...]

[cvp8-A] we assume the reviewer wants to know if our models can tackle supervised learning with multiclass problems in the context of class probability estimation. The answer is yes, and it just resorts to the fact that our generative models can easily provide the full distribution of missing values given the observed ones and the generative model. Indeed, supervised learning is "just" missing data imputation where the class label is missing and the full distribution we obtain, in this case, is the complete Bayes rule estimates for P[Y=cX]P[Y=c|X] (where cc ranges in any set, thus allowing multiclass). Note that we can do something "slicker" than supervised learning when additional variables are missing because in this case, we get the complete estimates for P[\mboxallmissingvariables=\mboxsetofvalues\mboxobservedvariables]P[\mbox{all missing variables} = \mbox{set of values}|\mbox{observed variables}]. The pointwise max for the set of values necessarily contains a prediction for the class.

Section 3 introduces the supervised loss for the generative tree. [...]

[cvp8-B] we would definitely be happy to use part of the +1 camera-ready page to expand on this part as it grounds our approach in the supervised learning framework.

Moving on to Section 4, in line 125, [...]

[cvp8-C] we agree. In fact part of the hardness probably comes from the fact that we did some "acrobatic" implicit referencing to save space: (C) is used before in lines L119+. Same as for [cvp8-B], we believe we can use part of the +1 page to polish this.

Regarding the supplementary material, Table A3 [...]

Indeed !! We apologize (this will be fixed), the obfuscated line is:

"Table A3: 2D density plots ( Generative forests x = Life Style Index and y = Customer Rating) for data generated on sigma_cabs, from models learned with 5% missing features [...]"

It is important because sigma_cabs is a real-world domain and some features appear to be "trickier" to generate than others.

questions

We directly reply here to the list of concerns.

[cvp8-D] First, ARF have fewer trees but they are much bigger than ours (the technical reason being that to get a good model overall, each tree needs to be a good model of its own already, because the distribution learned is a convex combination of tree's -- get one wrong, and the overall model can be heavily penalized). In fine, their total number of splits is equivalent or bigger than ours. We propose to indicate the average total number of splits as well for ARF.

[cvp8-E] "[...] Additionally, ARFs incorporate an algorithm for tree size selection, eliminating the need for manual selection [...]" Correct ! As we say in L311-L312, we do not have one in this paper. Such a matter would deserve a work of its own. However, the blueprint of such an algorithm seems intuitive: get inspiration from pruning algorithms in supervised decision tree induction.

"On the other hand, CT-GANs [...]" We understand that the reviewer points out that some parameters look like pears and apples. It is true and we accept the criticism: we deliberately wanted to compare techniques coming from many different model types [ALL]. It then came as a constraint to figure out some good model parameters to train for the contenders: in the case of KDE, we kept the kernel that gave overall the best results on our datasets; we tried to use a number of epochs that looked sound for the datasets that we had in CT-GANs, etc. . This explains why we give several different results for our contenders, in order not to "fix arbitrary" values.

limitations

The work presented [...] multiclass classification [...] regression.

Our method can indeed cope with supervised learning and this comes from the abilities of our models [cvp8-A]. However, this would represent a fourth task and we are not sure this represents a limitation of our work, which already contains experiments on 3 different tasks (no paper cited by the reviewers addresses this many tasks). Probably worth adding in a work of its own on [cvp8-E] ?

Additionally, it would be valuable to identify [...] proposed method might struggle with [...]

[cvp8-F] We agree. We did not do it because we strongly suspect from our experiments that more than "hard" datasets, there is actually an eventually suboptimal choice of our parameters J,TJ, T. By deliberately choosing those from a very small set in our experiments, we accepted the eventual suboptimality of some of our results. We did not want to hypertune our method and we believe there is a simple path forward, which would then be useful to guess eventual hard datasets: first solve the pruning problem [cvp8-E]. We are happy to elaborate more on this in the +1 page.

Moreover, while the authors have laid a strong foundation [...]

We refer to [cvp8-A] on this.

审稿意见
5

The authors introduce a novel generative model (Generative Forests - GF) for sampling and density estimation, leveraging an ensemble of trees. Their method partitions the input domain by considering all possible intersections of the supports of the leaves across the ensemble, resulting in a much finer partitioning than a single tree could achieve. For each subset of the input domain thus obtained, the model estimates a uniform density based on an empirical training measure. Efficient algorithms for sampling from such a model and evaluating the densities are derived.

To train the model, the authors propose an algorithm that iteratively splits a selected leaf node in one of the ensemble's trees. This splitting is guided by minimizing a strictly proper loss for a binary class probability estimation task.

The authors evaluate their method's performance against competing approaches across three main dimensions: the quality of synthetic data generation, missing data imputation, and density estimation.

优点

  • I find that the paper reads resonably clearly. The language and notation used throughout the paper is very precise although it occasionally appears overly complex.

  • The proposed method is interesting and I believe enjoys reasonable novelty.

  • The method and theoretical results all appear to be sound.

  • The proposed method's ability to estimate densities makes it more versatile than other generative methods commonly used for tabular synthetic data generation. Few methods can match its generality in handling various types of tabular data, except for other tree-based methods like Adversarial Random Forests (ARF). In this context, the positive results presented in comparison to ARF are promising.

  • The authors have made their implementation code available.

缺点

While the authors adequately explain their ensemble tree-based model and its sampling and density computation methods, the explanation of the training algorithm is less thorough. There is insufficient discussion on how the method scales computationally with the number of trees and splitting iterations (see my first question). Additionally, more discussion on the motivation behind the formulation of the CPE task would be welcome.

According to the paper's Lemma 4.4, GF estimates a different density value at each non-empty intersection of the supports of each leaf, derived directly from the empirical training measure. This is similar to ARF or other works such as [1] or [2] when using uniform densities at the leaves with the exception that these methods would output an averaged density estimated independently from leaves of each tree in the ensemble. It is unclear why the proposed approach should be better. From a generalization perspective, I would intuitively assume that averaging densities from multiple trees has a regularizing effect compared to using a single tree (see [2]). On the other hand, GF's density estimation is based on a finer partitioning, whose cardinality grows with the product of the depths of all trees in the ensemble. While this could providing greater representational capacity, it would seem to make the method more susceptible to overfitting.

The theoretical analysis provided also focuses only on approximating this empirical training measure, without addressing overfitting and the approximation of the data-generating measure. For a model such as described above, it seems obvious that the less coarse the partitioning becomes, the more faithfully an empirical distribution can be approximated. I am therefore unsure of the practical significance of these results and I think the paper would benefit more from a discussion of generalization ability.

My main concern with the paper, however, lies with the evaluation and comparison to other methods:

  • For the many analyses presented in the main paper and the appendices, different methods are compared against and for each method/scenario combination different datasets are used. This lack of consistency in comparisons and datasets in the presented results, leaves an impression that results are missing.

  • The choice of datasets focuses on relatively small numbers of data points and mixes both synthetic toy data and real datasets together. This is compounded by some of the main results being presented only as summary data (#wins/#ties/#losses) over this heterogeneous group of datasets.

  • There is no serious attempt to compare the proposed method with state-of-the-art approaches for synthetic data generation, particularly neural network-based ones:

    • The comparison is limited to CT-GAN, which has been shown to underperform almost every other popular neural network method, including the TVAE method put forward as a baseline in the same paper.
    • Additionally, the authors do not tune any hyperparameters for CT-GAN beyond the number of epochs. As a GAN, CT-GAN can have convergence issues, and without proper tuning, it may fail to converge. This lack of tuning undermines the fairness of the comparison, especially since it is unclear how much the authors optimized the hyperparameters of their own method.
    • The authors also report unusually long training times on a very small dataset, presumably because no GPU is used to train these models and possibly training for too long to no avail. While I understand that the proposed method can be trained on a CPU alone, and maybe that is the point being made, I don't think this remark is made in good faith. In my experience, methods like CT-GAN, TVAE, and TabDDPM can achieve reasonable performance quickly (within minutes to tens of minutes) when trained on a consumer GPU (e.g., 2080) even on larger datasets than those used in the reported experiments.
    • Results for CT-GAN on the two larger datasets are not reported, so the comparison primarily relies on very small datasets where neural networks might perform worse.

    A comparison with TabDDPM [3], a strong neural network-based generative method, would be valuable, as it is currently a standard benchmark in papers evaluating synthetic data and, in my opinion, a consistent contender for a state-of-the-art method.

  • The evaluation of generated samples relies solely on an Optimal Transport distance metric. Although I am not well-versed in the practical computation of this metric, the authors' comment about selecting an entropic regularization parameter that avoids numerical instabilities across all domains is not particularly reassuring. It raises the question of why other metrics were not used in conjunction.

    • Additionally, the setup described by the authors (GEN-DISCRIM), which uses Random Forests to distinguish real from synthetic data, appears both reasonable and well-founded. I am curious why this wasn't the primary evaluation method for comparing with other approaches. Instead, it is only used to compare GF against optimistic and pessimistic baselines.
    • On this topic, the authors might also want ot look at e.g., [4] for utility metrics in machine learning tasks that are common in comparing synthetic data generators, although I wouldn't personally fault them if they choose to disregard such metrics.
  • Finally, given that the proposed model is able to estimate densities and that evaluating the log-likelihood of held-out data is the gold standard for objectively measuring the quality of such models I don't understand why this evaluation doesn't have a more prominent place in the paper:

    • The authors only compare their method to KDE, which GF seems to underperform on several datasets, particularly those with a moderate number of dimensions. This is concerning, and I think it is important to investigate how the proposed method scales to higher-dimensional data compared to KDE.
    • ARF is missing as a baseline in these evaluations which is strange since it is another model that is capable of estimating densities. Additionally, other methods, such as [2], report consistently outperforming KDE, making it worthwhile to include as an additional comparison in my view.
    • Finally, the evaluation setup is not throughly explained in the main paper and only summary results against KDE are presented. It seems that the authors don't always use log-likelihood to compare these models due to their proposed model predicting 0 density for test samples (see question 3 below). This issue is only mentioned in the appendix but should, in my view, be disclosed in the main paper.

[1] Alvaro H. C. Correia, Robert Peharz, Cassio de Campos. Joints in Random Forests, 2020 (https://arxiv.org/abs/2006.14937)

[2] Hongwei Wen, Hanyuan Hang. Random Forest Density Estimation, 2022 (https://proceedings.mlr.press/v162/wen22c.html)

[3] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko. TabDDPM: Modelling Tabular Data with Diffusion Models, 2022 (https://arxiv.org/abs/2209.15421)

[4] Lasse Hansen, Nabeel Seedat, Mihaela van der Schaar, Andrija Petrovic. Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark, 2023 (https://arxiv.org/abs/2310.16981)

问题

  1. In line 218 the authors mention regarding finding the optimal split in splitPred\text{splitPred} that "the optimisation is no more computationally demanding" than for decision tree induction. I'm not sure if this is a typo but it seems to me that, as the trees in the ensemble grow deeper, finding the best split could quickly become intractable. As an example, in the case where only single stumps are used, the potential cardinality of P(T)\mathcal{P}(\mathcal{T}) at iteration jj is 2j2^j which means that intersections with that many subsets have to be considered when finding the best split point. This seems like an untenable scaling of the computation with the iteration number but maybe I am misunderstanding something about how the best split is selected as the authors' explanation is not clear.

  2. Regarding density modelling the authors only compare with KDE. Since ARF also provides density estimates why not compare against this method as well? Additionally [2] could also be worthwhile comparing against.

  3. Could the authors clarify what they mean by "expected densities" that are used "to cope with the eventuality of zero (pointwise) density predictions" in appendix section V.V.3.7? If their proposed model assigns 0 probability to a subset of the domain that has non-zero mass under the data generating distribution I would say that this is a defect of the model and a sign of poor regularization. Changing from the commonly used log-likelihood metric to one that doesn't penalize this so harshly should be surfaced and discussed in the main text in my opinion. Furthermore, it seems that this choice is made on a dataset-by-dataset basis, I can only assume depending on whether the proposed model yields -\infty log likelihoods which seems to happen for most non-toy datasets. As it stands, these details are hidden in the appendix while the main paper presents a potentially misleading picture based only on #wins.

  4. Still regarding 0 density predictions, the authors propose a way to avoid this issue for GFs in line 173. It would be interesting if this was explored further and the resulting (more regularized) density model compared to other approaches for density estimation based on log-likelihood.

[2] Hongwei Wen, Hanyuan Hang. Random Forest Density Estimation, 2022 (https://proceedings.mlr.press/v162/wen22c.html)

局限性

I find that the authors hardly discuss the limitations of their method. Most importantly there is little discussion of training times and how they scale with the number of iterations/data dimensions/number of samples. There is also no discussion of how performance and generalization is expected to scale with these factors.

作者回复

We thank the reviewer for pointing the soundness, novelty and versatility of our approach. We must confess we found the review a tad offensive at times [kTDz-D] [kTDz-H], accusing us of a complete lack of seriousness in some of our experiments, acting not in good faith elsewhere, finding strange missing parts elsewhere. It is hard to swallow not just because of the language that implies we did not meet ethical standards of scientific papers, but also because it sounds like an a priori guilty verdict, that comes even before any explanation of ours. We apologize if our tone sounds "combative" at times.

We proceed in order of the review, with tags to help cross-referencing and navigation. Due to the length of our rebuttal, we put answers to "questions" field here and postpone the rest of our rebuttal to additional comments.

Questions

In line 218 [...] I'm not sure if this is a typo but it seems to me that, as the trees in the ensemble grow deeper, finding the best split could quickly become intractable. [...]

[kTDz-I] Wrong: to find the best split, we only need to keep those elements in P(T){\mathcal P}({\mathcal T}) with >0>0 empirical measure because the rest has zero measure (and related calculi to find the best split are trivial for all those), so we only have to store those elements with >0>0 empirical measure and their number is limited by the size of the training sample. It has also been implemented this way (please check the commented code, this is the purpose of the class MeasuredSupportAtTupleOfNodes). in addition, as we get deeper, the computation can in fact be much quicker because it also depends on the size of the training sample in the elements to compute the effect of the split, and this size obviously decreases through splitting.

It is no more computationally demanding than DT induction because then we only need to parse the whole set of training point at the leaf and check whether they go left or right, which is exactly the same procedure as for DT induction. At the end, we need to eventually split the corresponding elements of P(T){\mathcal P}({\mathcal T}) whose support is included in the leaf, instead of one for a DT, but this cannot represent more than the size of the training sample at the leaf and thus costs in general less than the spltting test, for a global complexity that is still O(DT induction) (we sketch) -- we would be happy to make this more explicit.

See also [6MHb-A].

Regarding density modelling the authors only compare with KDE. Since ARF also provides density estimates why not compare against this method as well? 

We hope we answered in [kTDz-H].

Could the authors clarify what they mean by "expected densities" that are used "to cope with the eventuality of zero (pointwise) density predictions" in appendix section V.V.3.7? [...] this is a defect of the model and a sign of poor regularization. 

Indeed, but this happened with our contender, not our technique: for generative forests, a mechanism prevents this, that we have implemented (see L173-L174) and for ensembles of generative trees (Section IV) this situation is impossible (L672-L673). We can make this more clear using part of the +1 page space.

[...] As it stands, these details are hidden in the appendix while the main paper presents a potentially misleading picture based only on #wins.

We agree but this comes as a consequence of (i) us adapting to our contender and (ii) us having to summarize all of this in a short format. See also [kTDz-C]. Since there is +1 page in the camera ready, we would be happy to oblige in moving more details to main file.

Still regarding 0 density predictions, the authors propose a way to avoid this issue for GFs in line 173. It would be interesting if this was explored further and the resulting (more regularized) density model compared to other approaches for density estimation based on log-likelihood.

Apart from indeed implementing this mechanism, we have not, for a simple reason: we did not optimize our models for density estimation because it was a side task (our "motto" could be "train a generative model, get missing data imputation and density estimation for free"). We are however convinced that there would be further optimization possible for the density estimation metrics that would most surely get better results (we can elaborate more in the camera ready using the +1 page)

评论

[...] This is similar to ARF or other works such as [1] or [2] when using uniform densities at the leaves [...]

[kTDz-A] Even with this simplification of the models, this would not hold unless the trees used in ARF, [1] and [2] are very big. Consider for example Table 1 in our paper and the 50 stumps model we obtain. As far as we can tell, each of the other methods would have to learn very big models to have the contrast we get between hi and low densities (otherwise, the convex combination of densities, as e.g. in ARF, would lead to a very "blurry" solution -- extreme case for such methods: datasets with functional, non-stochastic dependences among variables, see [kTDz-B]). Hence, each tree would have to separately be a good model (and thus big), which is not our case. Our models could be very small.

This leads us to the formal standpoint: none of the methods cited have convergence results: [1] stick to behaviour in the limit: consistency for [1] and ARF. For [2], approximation is guaranteed with respect to the sample size and not the model size (hence to get a better model one needs to strictly have diverging sample size). Note also the restrictive assumptions made in all these papers to get the results (on the target density for [2] and ARF, on the model size and structure for [1]). We can argue that our Weak Learning Assumption is much weaker and it yields rates (on training). We can only stress the need for training rates (convergence), not just for their importance "per se" but also when they are not known for so many existing techniques.

[kTDz-A2] Regarding generalization, we disagree, both from the intuition and formal standpoint: from the intuition standpoint, the claimed regularization effect can be sufficient to get good results but it is in no way necessary. From the formal standpoint, it is in fact not hard to show that our algorithm is consistent. A straightforward way to show it in the setting of [1] would be to use our Lemma 4 and then use the regularity assumptions about the model in [1] to show the (almost sure) convergence of our model to the true one, using e.g. the l1l_1 norm as in [1]. Note that more sophisticated proofs that would rely on different / weaker assumptions are possible. Our purpose is to give one example of such and show that we can get the same asymptotic results as in the review's cited papers. Those being the references in the review's content, we hope is sufficient to address the "generalization" issue the review mentions there. We would be happy to extend this using the +1 page in the camera ready.

For the many analyses presented in the main paper and the appendices, different methods are compared against and for each method/scenario combination different datasets are used. This lack of consistency in comparisons and datasets in the presented results, leaves an impression that results are missing.

[kTDz-B] We do agree with the reviewer, but this essentially comes as the consequence of having experiments on three different settings -- data generation, missing data imputation, density estimation -- with five+ different contenders (as an example, none of the papers cited by the reviewer have this level of complexity in experiments). Each of the contenders has constraints in the form of data used, which can depend on the setting. We refer to the Appendix (V.2) and the related papers for discussion on settings and restrictions / bugs that we eventually experienced. The datasets that we kept were definitely worth keeping at least for one setting: for example, we could not run kc1 on ARF (thus, no "lifelike" results for kc1, yet the 2D heatmap we got in Figure 3 are definitely show that our models can definitely model deterministic (functional, non-stochastic) dependences as well).

The choice of datasets focuses on relatively small numbers of data points and mixes both synthetic toy data and real datasets together. This is compounded by some of the main results being presented only as summary data (#wins/#ties/#losses) over this heterogeneous group of datasets.

[kTDz-C] We are not sure this is a criticism: the small number of data points (see e.g. Table 2) was only picked to show that a simple guess of our algorithm's parameters can beat a number of parameterisations of other algorithms. Summary data is just a consequence of the numerous experiments (settings x contenders) that we have run, that need to fit in the paper's constraints (we can only point the suggestion of other reviewers to add even more experiments [ALL]). Indeed, we have considered heterogeneous datasets, but the main reason was only to show results on a wide range of datasets instead of focusing on a single category (e.g. real-valued) of datasets. We wanted to provide a fair "panoramic" picture of the results of our algorithm. See [ALL] for a proposal to all reviews on additional experiments to take into account the +1 page.

评论

There is no serious attempt [...] I don't think this remark is made in good faith [...] might perform worse [emphases ours].

[kTDz-D] We must confess we found these lines of comment a tad offensive. Note that the request thus addresses 1/3 of our settings and 1/5+ of our algorithms.

  • First, we are surprised by the comment: "The comparison is limited to CT-GAN, which has been shown to underperform almost every other popular neural network method, including the TVAE method put forward as a baseline in the same paper.", when in fact the CT-GAN paper, which indeed introduces CT-GAN and TVAE, says quite the opposite: "CTGAN achieves competitive performance across many datasets and outperforms TVAE on 3 datasets." (page 2). This is why we used CT-GANs.

  • Second, we are surprised by the next comment: "Additionally, the authors do not tune any hyperparameters for CT-GAN beyond the number of epochs. As a GAN, CT-GAN can have convergence issues, and without proper tuning, it may fail to converge. This lack of tuning undermines the fairness of the comparison, especially since it is unclear how much the authors optimized the hyperparameters of their own method."

[kTDz-E] The reason why a reader might have the impression that we do not comment on our hyperparameter tuning is because there was essentially none: we have just two parameters to tune for our algorithm (J,TJ, T). Each of our experiments is done using essentially a single choice of parameters, eventually shared among experiments. To make sure we would not disadvantage other contenders, we made sure to test them on a range of their parameters (number of trees for ARF, of epochs for CT-GAN, kernels for KDE, etc). The reviewer may conclude that this is not enough for a fair comparison and we should have tuned more CT-GAN to get better results, but then one may ask: given the results that we have, isn't a method with almost no necessary hyperparameter tuning better than some that would require a heavy (dataset dependent) machinery to tune theirs ?

  • Third, we were also surprised by the comment on times: "The authors also report unusually long training times on a very small dataset, presumably because no GPU is used to train these models and possibly training for too long to no avail. While I understand that the proposed method can be trained on a CPU alone, and maybe that is the point being made, I don't think this remark is made in good faith. In my experience, methods like CT-GAN, TVAE, and TabDDPM can achieve reasonable performance quickly (within minutes to tens of minutes) when trained on a consumer GPU (e.g., 2080) even on larger datasets than those used in the reported experiments."

[kTDz-F] We understand the reviewer is a strong proponent of CT-GAN and we respect that. Our only remark on times is in L850-L851 (we assume it is the one that led to the comments). Our point was not to discredit any (CT-GAN or other) NN algorithm based on times. Our paper is not a paper competing on computers "running on steroids" (for whatever specs that would represent), we never write so and a short glimpse at computers used (L758-L760) show that it could not have been the case. We know that different techniques eventually require different material to run optimally -- this is why we eventually did not provide times, to avoid such discussions. And note that such discussions would be useless in our case: on all our datasets, our algorithm was deliberately run on the "low-end" Laptop (not even on our desktop !) because we had the constraint to be able to train our algorithms on very simple material. CT-GANs were trained on our (more powerful) desktop. We do not want to suggest to add this before L850, but we are confident that with this knowledge, the reviewer will take back their comment on us putting comments not in good faith. If computation times are sensitive to discuss, then we can just remove them -- this will leave space to treat questions from all reviewers -- or make a general statement around the fact that running times depend on the material used.

  • Fourth and last, we reply to the comment: "Results for CT-GAN on the two larger datasets are not reported, so the comparison primarily relies on very small datasets where neural networks might perform worse.". Indeed... but Table A6 explains that we had issues on getting results on some datasets. This is not a criticism of CT-GAN's implementation. The reviewer may notice that VCAEs and ARFs also did not provide results sometimes (Table A7, Section V.2).
评论

A comparison with TabDDPM [3] [...]

[kTDz-G] We value this opinion but we would like the reviewer to consider that we have, at this stage, five (5) contenders in our experiments (in fact, 7 according to other metrics, see Sk9u). This suggestion adds one more and in fact, aggregating the suggestion of other reviews [ALL] we would get a total of 12 contenders. Consider that the TabDDPM reference [3] contains comparisons with three contenders (CTABGAN+ is an extension of CTABGAN). At this stage, the reviewer already criticizes the sheer amount of results that forces us to cram a lot in a small space [kTDz-C]. Considering also the fact that CT-GAN is arguably among the most cited Tabular + NN generative method, we would prefer to stick to our choices of contenders. This being said, we are happy to extend our list of contenders to a family of techniques not yet included in our benchmark, see [ALL] -- should all reviewers concur, of course.

The evaluation of generated samples relies solely on an Optimal Transport distance metric. [...] entropic regularization parameter [...]

[kTDz-F] This is a difficult question, not in technical terms (we regularize the OT metric with an entropic term because it considerably speeds up computations -- it is fact standard now, as can be seen from the many papers citing Cuturi's paper [8], and it is largely necessary because of computational burden of acting without, see our rebuttal to Sk9u). It is difficult because, as the reviewer points out, there comes the question "why not another metric" ? To compare densities, one has essentially three generic categories of "distortion measures" (some are not metrics): (i) based on the density (e.g. f-divergences), (ii) based on the parameters (e.g. Bregman divergences) and (iii) based on the support (e.g. OT). Of course, one can "break down" the computation of the metric into several ones, e.g. depending on features, but we wanted to avoid "hybrid" ones that are eventually harder to explain. The interest in an OT metric is that it is computed using the "full" support of the distribution (A Bregman divergence depends on parameters = "aggregates"), it does not require absolute continuity assumptions (unlike f-divergences), and it is computed using a cost metric over this whole support (no part of the space "left behind"). Since we learn generative models, it looked like a maximally fine-grained way to assess the distribution learned.

Additionally, the setup described by the authors (GEN-DISCRIM), [...] I am curious why this wasn't the primary evaluation method for comparing with other approaches.

[kTDz-G] Because all domains used for GEN-DISCRIM had to be supervised domains, i.e. we could not use it in general.

The authors only compare their method to KDE, which GF seems to underperform on several datasets [...]

[kTDz-H] We quote our introduction (emphases ours): "A GF can thus be used to generate data, but can also be used for side tasks like missing data imputation or density estimation.". Density estimation is a side task among a total of three that we had. All papers cited by the reviewer in [1-4] focus on exactly one task, with the exception of Kotelnikov et al', which focuses on two, but the "side task" involves one contender only. We did investigate those side tasks because our generative models allow to treat those side tasks "for free" (not additional training) and it would have been a pity not to try how it fares against techniques specifically designed for such tasks. Our objective was not to be better than techniques that were designed for the side tasks (think "no free lunch") but rather to fairly locate our technique with respect to well known contenders. Again, think that we can treat those side tasks "for free". We were very pleased that we would very efficiently compete against KDE at least sometimes. We also invite the reviewer to check from L750-L752 that we did some tuning of KDE so select the best kernel on our data.

ARF is missing as a baseline in these evaluations [...] which is strange [...]  Additionally, other methods, such as [2] [...] worthwhile to include (emphasis ours). 

[kTDz-H] So far in the review, these are the fourth and the fifth baseline that we are being asked to add. There are two reasons why we did not use FORDE (such is the name). First, FORDE is just a wrapper around FORGE, which we use to compare for synthetic data generation, since we substantially beat FORGE, we were concerned that beating FORDE would be, or would be seen as, unfair. Second, we make it clear that our objective was to get a panorama of many different techniques. This is, we believe, a strong selling point of our paper, exposed from the abstract: we do not just focus on comparing against neural nets or tree-based approaches. KDE is a very well used family of techniques that brought kernels to our comparison. We believe this explains why we did not consider [2] either.

评论

I thank the authors for their detailed response. I also apologize for my language but I want to reassure the authors that they are extending their interpretation of what I wrote far past what I actually wrote or ever meant to imply.

In order to be mindful of the authors' time I will focus the discussion on the main concerns that are stopping me from increasing my score. Most of these relate to the evaluation setup. I will also note that I had no context on what other reviewers would ask and I do sympathize with the authors but I need to evaluate the paper based on what is currently there and my opinion remains that the experimental setup is inadequate.

(1) Comparisons on Synthetic data generation

The authors claim that their main task is synthetic data generation and use that to justify not comparing with ARF (FORDE) on density estimation. I would expect any paper to compare to the best available methods especially on their main task but that is not the case here. Even if not TabDDPM, I believe TVAE or perhaps CTAB-GAN would provide a bigger challenge than CTGAN, particularly given that only the number of epochs was tuned.

(2) Single metric for evaluating synthetic data generation

  1. I remain somewhat skeptical of the OT metric used for synthetic data evaluation because it requires defining a distance in the domain of the data. It is not obvious to me how to meaningfully define one on a domain with differently-scaled numerical features and categoricals. I would therefore be much more comfortable if there was at least another metric used for comparison.

  2. I still don't understand why GEN-DISCR can only be used in supervised tasks. I assumed that L901 referred to the pipeline in the ARF paper and that the authors' pipeline only involved training discriminators to distinguish between real and fake data (L895-L896). If not GEN-DISCR, at least some ML utility metric.

(3) Density comparisons

  1. I would not frame ARF (FORDE) as an additional evaluation since the authors already compare to ARF for sample generation (FORGE). This sets my expectation that this comparison should have been done in the original paper despite the author's claim that it is a side quest. Even looking at the documentation for the ARF package, it is necessary to estimate a density model (FORDE) first in order to then run sample generation (FORGE).

  2. I also want to remark that the authors beat FORGE on the OT metric used for synthetic evaluation. In my opinion, a density evaluation based on log-likelihoods would be that much more valuable as it offers a different type of metric and one that is much more beyond criticism for this type of data in my opinion.

  3. Synthetic data for ARF and GF is just sampling from a density model. Therefore comparing the density models directly makes a lot more sense in my opinion.

(4) Scalability

My concerns with scalability were mostly addressed by the authors. However, I would very much welcome if the authors were able to share some actual training times for their algorithm as other reviewers also requested (Sk9u). I am particularly interested in seeing how these times vary with the size of the datasets.

(5) Results Reporting

I agree with the authors' own statement that they cram a lot of results in a small allocated space but, in my view, the problem is not the too many results but the too small space that was allocated for reporting them. I particularly do not agree with the pooling of toy datasets and real datasets into the same aggregated results.

(6) Minor point

I appreciate the author's clarification on the OT metric. I wonder however if regularization is really necessary with such small datasets. Couldn't the assignment problem be solved exactly and avoid introducing a regularization parameter that will only raise further questions?

评论

I really appreciate the authors' efforts to resolve some of my main concerns, particularly those regarding the use of a single evaluation metric and the need for stronger comparisons.

However, the authors only provide these new metrics for CTGAN and the new Forest-flow method. It would be valuable to see the same comparison against ARF, given that the authors themselves identify it as the primary competing method.

评论

Thank you for the ARF results.

While I still have some reservations about the paper, I am increasing my score to a 5 to reflect the reduced set of concerns.

作者回复

To all reviewers

[ALL] We warmly thank all reviewers for their work and for all granting our paper with general "good or excellent" contribution field. Many questions have been asked and we take it as a value found that so many contenders were proposed, added to the five (5) we already have (in fact, 6 if we include the generative trees of [29] -- ref in our paper, and 7 if we add COPY in gen-discrim) for our three (3) distinct experimental settings (TVAE, TABDDPM, ARF-FORDE, RFDE, Forest-Flow). We hope it is obvious that it would be hard to cram substantially more content in our paper as it has already been remarked that its current state can already be hard to parse at points in the experiments [kTDz-C]. However, since there is a +1 page in the camera-ready we propose to add Forest-Flow, because it is another kind of model that does not belong to the categories of the models we have compared against (mixing elements of trees, neural nets, kernels or graphical models). However, the paper on Forest-Flow clearly mentions that their results are worse than MICE on missing data imputation. Since we are competitive with MICE, we would be happy to instead compare on data generation with Forest-Flow.

We would like to stress the need to keep the current formal part to its current state and not reduce it: for example, none of the papers cited by all reviewers have convergence rates on training. We also would like to add a few comments on the statistical consistency of our method [kTDz-A2].

Tags like [ALL], [kTDz-C] have been put for easy search in the page's content.

评论

In our rebuttals to reviews uJTo and 6MHb, we provided tags pointing to Official Comments we lodged to reviewer kTDz. After double checking, we realized those Official Comments were visible only to reviewer kTDz.

We have hopefully fixed the problem and all tags should now be visible. We sincerely apologize to reviewers for this mistake in lodging those.

评论

Results for GF (us) vs CT-GAN (EE = 1 000 epochs, ct) on Coverage, Density, F1, shown as mean (std dev).

p-values smaller than 0.01 are replaced by a symbol in color, {\textcolor{green} {\blacktriangle}} meaning we significantly beat CT-GAN, {\textcolor{red} {\blacktriangledown}} meaning CT-GAN significantly beat us.

Explanations, comments, discussion, see "Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)" for details.

Results with JJ=2000 splits for TT=500 trees (us):


DomainCov\uparrow(us)Cov\uparrow(ct)pvalDens\uparrow(us)Dens\uparrow(ct)pvalF1\downarrow(us)F1\downarrow(ct)pval
ringgauss0.968 (0.010)0.798 (0.050){\textcolor{green} {\blacktriangle}}1.031 (0.049)0.757 (0.025){\textcolor{green} {\blacktriangle}}0.070 (0.030)0.100 (0.073)0.40
circgauss0.945 (0.015)0.734 (0.041){\textcolor{green} {\blacktriangle}}0.993 (0.045)0.401 (0.033){\textcolor{green} {\blacktriangle}}0.514 (0.529)0.746 (0.033){\textcolor{green} {\blacktriangle}}
gridgauss0.975 (0.009)0.828 (0.053){\textcolor{green} {\blacktriangle}}0.995 (0.077)0.649 (0.058){\textcolor{green} {\blacktriangle}}0.017 (0.015)0.034 (0.011)0.21
randgauss0.961 (0.012)0.659 (0.049){\textcolor{green} {\blacktriangle}}0.979 (0.023)0.582 (0.035){\textcolor{green} {\blacktriangle}}0.014 (0.013)0.079 (0.053)0.06
winered0.962 (0.028)0.808 (0.016){\textcolor{green} {\blacktriangle}}0.961 (0.026)0.589 (0.129){\textcolor{green} {\blacktriangle}}0.459 (0.031)0.654 (0.030){\textcolor{green} {\blacktriangle}}
winewhite0.963 (0.002)0.894 (0.026){\textcolor{green} {\blacktriangle}}0.989 (0.017)0.953 (0.035)0.050.492 (0.037)0.581 (0.043){\textcolor{green} {\blacktriangle}}

Results with JJ=500 splits for TT=200 trees (us):


DomainCov\uparrow(us)Cov\uparrow(ct)pvalDens\uparrow(us)Dens\uparrow(ct)pvalF1\downarrow(us)F1\downarrow(ct)pval
ringgauss0.964 (0.022)0.798 (0.05){\textcolor{green} {\blacktriangle}}0.999 (0.055)0.757 (0.025){\textcolor{green} {\blacktriangle}}0.060 (0.047)0.100 (0.073)0.75
circgauss0.954 (0.014)0.734 (0.041){\textcolor{green} {\blacktriangle}}0.968 (0.027)0.401 (0.033){\textcolor{green} {\blacktriangle}}0.520 (0.021)0.746 (0.033){\textcolor{green} {\blacktriangle}}
gridgauss0.964 (0.006)0.828 (0.053){\textcolor{green} {\blacktriangle}}0.953 (0.062)0.649 (0.058){\textcolor{green} {\blacktriangle}}0.034 (0.010)0.034 (0.011)0.98
randgauss0.954 (0.010)0.659 (0.049){\textcolor{green} {\blacktriangle}}0.936 (0.029)0.582 (0.035){\textcolor{green} {\blacktriangle}}0.020 (0.005)0.079 (0.053)0.08
winered0.954 (0.013)0.808 (0.016){\textcolor{green} {\blacktriangle}}0.874 (0.042)0.589 (0.129){\textcolor{green} {\blacktriangle}}0.503 (0.023)0.654 (0.030){\textcolor{green} {\blacktriangle}}
winewhite0.952 (0.008)0.894 (0.026){\textcolor{green} {\blacktriangle}}0.940 (0.019)0.953 (0.035)0.500.510 (0.027)0.581 (0.043){\textcolor{green} {\blacktriangle}}

评论

This Comment addresses the following reviewers' requests:

  • uJTo: comments on ``Novelty''
  • kTDz: review on the need to discuss generalization ability and our reply [kTDz-A2]

We use the shorthands for papers:

[cpdJI] A.H.C. Correia, R. Peharz and C. P. de Campos. Joints in Random Forests. NeurIPS 2020

[llmDT] J.S. Leboeuf, F. LeBlanc and M. Marchand. Decision trees as partitioning machines to characterize their generalization properties. NeurIPS 2020

[lnCO] G. Lugosi and A. Nobel. Consistency of Data-driven Histogram Methods for Density Estimation and Classification. Ann. of Stats. 1996

We provide a consistency result of exactly the same nature as [cpdJI, Theorem 2]. Like in [cpdJI,lnCO], we assume for simplicity that the domain is RdR^d. Hereafter, the size Υ|\Upsilon| of a (binary) tree Υ\Upsilon is its whole number of nodes. A key feature of our models that allow us to branch directly on the proof of strong consistency of [lnCO] is that a GF has density equal to the empirical density measured in each bin of the partition it induces. In [lnCO]'s notations, this is the empirical histogram fnf_n. We need one Lemma that connects our GFs with classical binary trees.

Lemma A The partition achieved by a GF with trees Υ1,Υ2,...,ΥT\Upsilon_1, \Upsilon_2, ..., \Upsilon_T can be achieved with a single tree Υ\Upsilon_* of size at most iΥi\prod_i |\Upsilon_i|.

Proof the proof is simple and constructive. Suppose without loss of generality that T=2T=2. Add one copy of Υ2\Upsilon_2 at each leaf of Υ1\Upsilon_1. The partition generated is exactly the same as that of the GF (L128-L129) and the size of this tree is no more than Υ1Υ2|\Upsilon_1| \cdot |\Upsilon_2|. The construction generalizes to more than 2 trees [end of the proof of Lemma A].

Suppose without loss of generality that all trees have the same size, NN, so that ΥNT|\Upsilon_*| \leq N^T. Like in [cpdJI,lnCO], proving consistency involves computing and bounding a combinatorial parameter function of the model class, the growth function Δ\Delta^* (See Definition 1 in [cpdJI] or Section 2 in lnCO]). Lemma A tells us that the growth function of the trees in a GF is bounded by that of a tree partitioning the domain. Proposition 2 in [cpdJI] bounds the growth function for any such tree. The VC dimension of size-NN decision trees is O(NlogN)O(N \log N) [llmDT]. Using [cpdJI], we obtain from Sauer's Lemma the bound ΔmCN2TlogN\Delta^* \leq m^{CN^{2T}\log N}.

There are three conditions to get to strong consistency in [lnCO] (Theorem 1). Condition 2 reads (1/m)logΔ0(1/m) \log \Delta^* \rightarrow 0 (as the training size m+m \rightarrow +\infty). To satisfy Condition 2, we thus need our models parameters to vanish as a function of mm, namely Nm2TmlogNm0N_m^{2T_m}\log N_m \rightarrow 0 (indexes are used as in [cpdJI] to indicate the dependence of model parameters in mm). This is similar to condition i) in [cpdJI, Theorem 2].

Condition 1 imposes that the maximal empirical mass in one element of the partition must also manish with mm. In our case, it is guaranteed by our condition (c) in our weak learning assumption and the fact that our algorithm splits the heaviest leaf at each iteration.

Condition 3 imposes that the true mass in any ``big'' element of the partition must also vanish with mm. This is also condition ii) in [cpdJI, Theorem 2] (our formulation would be no different. We spare the reviewers with additional notations).

[Consistency result 1] Hence, we are able to prove, under the same conditions as Theorem 2 in [cpdJI] that, with our notations and those of [lnCO], R(x)G(x)dx0\int |{\color{red} R}(x) - {\color{blue} G}(x)|dx \rightarrow 0 with probability 1.

(the statement + explanations can fit in 1/2 page; the proof could go in appendix)

评论

This Comment addresses the following reviewers' requests:

  • uJTo: request to compare vs Forest-Flow
  • Sk9u: request to compare vs Forest-Flow
  • kTDz: points (1) (Jolicoeur-Martineau et al. substantially beat TVAE and CATB-GAN) and (3.2) (more evaluation metrics, time), request to compare vs ARF on new metrics

We use the shorthands for papers:

[jfkGA] A. Jolicoeur-Martineau, K. Fatras and T. Kachman. Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees. AISTATS24

[noucyRF] M.F. Naeem, S.J. Oh, Y. Uh and Y. Choi. Reliable Fidelity and Diversity Metrics for Generative Models. ICML20

For the data generation experiment, [Exp-1] and [Exp-2] provide experiments vs Forest Flow or (non exclusive) on new metrics.

Forest-Flow: we used the R code provided in [jfkGA], run on the same machine we ran our experiments (the Laptop). Parameters were set following the default values in https://htmlpreview.github.io/?https://github.com/SamsungSAILMontreal/ForestDiffusion/master/R-Package/Vignette.html. We also followed the recommendation that fixing 50 trees per XGBoost model led to state of the art performances [jfkGA, Section 3.2]. With trees of depth at most 7, this represents a maximum number of J=J=6 750 splits per model used. Times displayed are times to learn a model on each fold. Forest-Flow requires numerically reencoding domains with categorical/text values and all our contenders used so far process our domains "natively", without reencoding. To avoid encoding bias affecting results, we have run Forest-Flow on the fully numerical domains of our benchmark (also, the computation of some of our new metrics is better suited to fully numerical domains). This included domains with binary outcomes encoded as 0/1.

CT-GAN: we considered the domains in the intersection of Table A6 and those we used for Forest-Flow.

ARF: we considered the domains in the intersection of Table 2 and those we used for Forest-Flow.

Metrics: we added three metrics to our synthetic data generation experiment:

  • The Coverage and Density metrics from [noucyRF], which present better alternatives to precision and recall for generative models evaluation. We computed those metrics following the paper, using the training / test samples as built from our experiments; In the kk-NN rule, we picked k=5k=5 following suggestion in [noucyRF, Section 3.4].
  • The F1 metric from [jfkGA], which if F1discF1_{disc} in their experiments. Since our domains are fully numerical, we used for prediction a kk-NN rule with the same kk as in our Coverage + Density experiment.

We also ran those metrics on our full set of experimental results of CT-GAN and ARF. We only provide the results for 1000 epochs for CT-GAN and TT=200 trees for ARF for readability but we have computed all other results as well (as in Tables 2 and A6).

The parameters we fix for our models are the same as in our submission (T=500,J=2000T=500, J=2000 as in Table 2, T=200,J=500T=200, J=500 as in Table A8).

Discussion:

  • vs Forest-Flow: there seems to be two different behaviours. On "small" domains (m×dm\times d small, dd small, our simulated domains), there is a slight tendency for our method to beat Forest-Flow on Sinkhorn, Coverage and F1 whereas the tendency is reversed for Density. On larger domains (e.g. artificial_characters, m×dm \times d>80K), it seems we do significantly beat Forest-Flow by large margins (p<0.01 for most metrics, even with our small models). Also, execution times are slightly at our advantage (more if we consider our smallest models). Caution advised for comparing times since Forest-Flow is implemented in R and our code is in Java.

  • vs CT-GAN: we had to remove artificial_characters from our benchmark because of issues with CT-GAN (Table A6 in our paper). We also do not report times for comparison because the algorithms were ran on different computers. The conclusion follows the broad lines of our conclusion in our paper with the additional metrics we have considered.

  • vs ARF: the picture resembles that of Table 2 when we traing our GFs with TT=500 trees (we significantly beat ARF on >1/3 of the cases and are never statistically significantly beaten). When reducing the number of our trees to TT=200, we still observe that our models beat GFs in many cases (statistical significance on 5 of them).

评论

Results for GF (us) vs Forest Flow (ff) on Sinkhorn, Coverage, Density, F1, shown as mean (std dev). Time = average per fold (s).

p-values smaller than 0.01 are replaced by a symbol in color, {\textcolor{green} {\blacktriangle}} meaning we significantly beat Forest Flow, {\textcolor{red} {\blacktriangledown}} meaning Forest Flow significantly beat us.

Explanations, comments, discussion, see "[Exp-0] Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)" for details.

Results with JJ=2000 splits for TT=500 trees (us):


DomainSink\downarrow(us)Sink\downarrow(ff)pvalCov\uparrow(us)Cov\uparrow(ff)pvalDens\uparrow(us)Dens\uparrow(ff)pvalF1\downarrow(us)F1\downarrow(ff)pvalTime(us)Time(ff)
ringgauss0.285 (0.008)0.276 (0.001)0.070.968 (0.010)0.957 (0.020)0.091.031 (0.049)1.045 (0.020)0.280.070 (0.030)0.051 (0.031)0.074535
circgauss0.351 (0.005)0.354 (0.003)0.270.945 (0.020)0.956 (0.015)0.250.993 (0.045)0.989 (0.027)0.790.514 (0.529)0.530 (0.028)0.467432
gridgauss0.390 (0.002)0.393 (0.001)0.040.975 (0.009)0.954 (0.012)0.010.995 (0.077)1.013 (0.050)0.210.017 (0.015)0.045 (0.007){\textcolor{green} {\blacktriangle}}6536
randgauss0.286 (0.003)0.287 (0.001)0.430.961 (0.012)0.927 (0.011)0.020.979 (0.023)1.000 (0.008)0.050.014 (0.013)0.028 (0.008)0.099858
winered0.980 (0.032)1.030 (0.029){\textcolor{green} {\blacktriangle}}0.962 (0.028)0.954 (0.021)0.710.961 (0.026)1.001 (0.034)0.040.459 (0.031)0.458 (0.052)0.97173251
winewhite1.064 (0.003)1.097 (0.007){\textcolor{green} {\blacktriangle}}0.963 (0.002)0.945 (0.007){\textcolor{green} {\blacktriangle}}0.989 (0.017)0.970 (0.023)0.130.492 (0.037)0.498 (0.041)0.78654661
compas0.532 (0.007)0.891 (0.007){\textcolor{green} {\blacktriangle}}0.556 (0.021)0.548 (0.027)0.320.430 (0.008)0.404 (0.013){\textcolor{green} {\blacktriangle}}0.496 (0.021)0.526 (0.018)0.04161756
artificial_characters0.821 (0.004)0.834 (0.016)0.120.947 (0.003)0.879 (0.008){\textcolor{green} {\blacktriangle}}0.967 (0.013)0.774 (0.016){\textcolor{green} {\blacktriangle}}0.429 (0.056)0.530 (0.032){\textcolor{green} {\blacktriangle}}469748

Results with JJ=500 splits for TT=200 trees (us):


DomainSink\downarrow(us)Sink\downarrow(ff)pvalCov\uparrow(us)Cov\uparrow(ff)pvalDens\uparrow(us)Dens\uparrow(ff)pvalF1\downarrow(us)F1\downarrow(ff)pvalTime(us)Time(ff)
ringgauss0.283 (0.005)0.276 (0.001)0.090.964 (0.022)0.957 (0.020)0.310.999 (0.055)1.045 (0.020)0.070.060 (0.047)0.051 (0.031)0.75635
circgauss0.356 (0.003)0.354 (0.003)0.300.954 (0.014)0.956 (0.015)0.740.968 (0.027)0.989 (0.027)0.060.520 (0.021)0.530 (0.028)0.551032
gridgauss0.392 (0.002)0.393 (0.001)0.530.964 (0.006)0.954 (0.013)0.070.953 (0.062)1.013 (0.050)0.010.034 (0.010)0.045 (0.007)0.02936
randgauss0.290 (0.007)0.288 (0.001)0.550.954 (0.010)0.927 (0.011)0.040.936 (0.029)1.000 (0.008)0.010.020 (0.005)0.028 (0.008)0.111158
winered1.027 (0.037)1.030 (0.029)0.840.954 (0.013)0.955 (0.018)0.950.874 (0.042)1.009 (0.034){\textcolor{red} {\blacktriangledown}}0.503 (0.023)0.458 (0.052)0.1340251
winewhite1.120 (0.013)1.097 (0.007)0.050.952 (0.008)0.945 (0.007)0.180.940 (0.019)0.970 (0.023)0.030.510 (0.027)0.498 (0.041)0.48214661
compas0.537 (0.015)0.891 (0.007){\textcolor{green} {\blacktriangle}}0.546 (0.026)0.548 (0.027)0.740.424 (0.014)0.404 (0.013)0.030.500 (0.025)0.510 (0.032)0.3629756
artificial_characters0.829 (0.006)0.834 (0.016)0.510.932 (0.018)0.879 (0.008){\textcolor{green} {\blacktriangle}}0.897 (0.004)0.774 (0.016){\textcolor{green} {\blacktriangle}}0.457 (0.057)0.530 (0.032){\textcolor{green} {\blacktriangle}}90748

评论

Results for GF (us) vs Adversarial Random Forests (arf, TT=200 trees) on Coverage, Density, F1, shown as mean (std dev).

p-values smaller than 0.01 are replaced by a symbol in color, {\textcolor{green} {\blacktriangle}} meaning we significantly beat ARF, {\textcolor{red} {\blacktriangledown}} meaning ARF significantly beat us.

Explanations, comments, discussion, see "[Exp-0] Explanations related to Experiments [Exp-1], [Exp-2] and [Exp-3] (summary: more contenders, more metrics)" for details.

Results with JJ=2000 splits for TT=500 trees (us):


DomainCov\uparrow(us)Cov\uparrow(arf)pvalDens\uparrow(arf)Dens\uparrow(ct)pvalF1\downarrow(us)F1\downarrow(arf)pval
ringgauss0.968 (0.010)0.960 (0.008)0.051.031 (0.049)0.976 (0.030)0.110.070 (0.030)0.086 (0.047)0.54
circgauss0.945 (0.015)0.962 (0.021)0.350.993 (0.045)0.954 (0.011)0.050.514 (0.529)0.507 (0.032)0.92
gridgauss0.975 (0.009)0.908 (0.010){\textcolor{green} {\blacktriangle}}0.995 (0.077)0.630 (0.050){\textcolor{green} {\blacktriangle}}0.017 (0.015)0.043 (0.012)0.04
randgauss0.961 (0.012)0.953 (0.004)0.290.979 (0.023)0.940 (0.027){\textcolor{green} {\blacktriangle}}0.014 (0.013)0.029 (0.010){\textcolor{green} {\blacktriangle}}
winered0.962 (0.028)0.929 (0.010)0.140.961 (0.026)0.801 (0.028){\textcolor{green} {\blacktriangle}}0.459 (0.031)0.531 (0.028){\textcolor{green} {\blacktriangle}}
winewhite0.963 (0.002)0.946 (0.012)0.100.989 (0.017)0.941 (0.013){\textcolor{green} {\blacktriangle}}0.492 (0.037)0.500 (0.027)0.04
compas0.556 (0.021)0.560 (0.037)0.750.430 (0.008)0.440 (0.011)0.110.496 (0.021)0.520 (0.020)0.26
artificial_characters0.947 (0.003)0.892 (0.014){\textcolor{green} {\blacktriangle}}0.967 (0.013)0.747 (0.013){\textcolor{green} {\blacktriangle}}0.429 (0.056)0.512 (0.037){\textcolor{green} {\blacktriangle}}

Results with JJ=500 splits for TT=200 trees (us):


DomainCov\uparrow(us)Cov\uparrow(arf)pvalDens\uparrow(us)Dens\uparrow(arf)pvalF1\downarrow(us)F1\downarrow(arf)pval
ringgauss0.964 (0.022)0.9600 (0.008)0.530.999 (0.055)0.976 (0.030)0.420.060 (0.047)0.086 (0.047)0.23
circgauss0.954 (0.014)0.962 (0.021)0.270.968 (0.027)0.954 (0.011)0.040.520 (0.021)0.507 (0.032)0.67
gridgauss0.964 (0.006)0.908 (0.010){\textcolor{green} {\blacktriangle}}0.953 (0.062)0.630 (0.050){\textcolor{green} {\blacktriangle}}0.034 (0.010)0.043 (0.012)0.26
randgauss0.954 (0.010)0.953 (0.004)0.940.936 (0.029)0.940 (0.027)0.740.020 (0.005)0.029 (0.010)0.14
winered0.954 (0.013)0.929 (0.010)0.040.874 (0.042)0.801 (0.028){\textcolor{green} {\blacktriangle}}0.503 (0.023)0.531 (0.028)0.01
winewhite0.952 (0.008)0.946 (0.012)0.220.940 (0.019)0.941 (0.013)0.880.510 (0.027)0.500 (0.027)0.09
compas0.546 (0.026)0.560 (0.037)0.100.424 (0.014)0.440 (0.011)0.030.500 (0.025)0.520 (0.020)0.33
artificial_characters0.932 (0.018)0.892 (0.014){\textcolor{green} {\blacktriangle}}0.897 (0.004)0.747 (0.013){\textcolor{green} {\blacktriangle}}0.457 (0.057)0.512 (0.037)0.02

最终决定

The authors propose a tree-based method as a generative model for tabular data.

  • Reviews: Initially, the paper received two very positive and three more critical reviews. Points of criticism were novelty, clarity, missing technical details, and the selection of experiments.
  • Rebuttal: The authors were able to convince also the rather critical reviewers by clarifying misunderstandings and providing numerous new experimental results.
  • Decision: While I recommend acceptance, it should be noted that scores negatively correlate with confidence. The reason, as also pointed out by Reviewer uJTo, likely is that the simplicity of the approach is sometimes hidden behind overcomplicated mathematical concepts. At the very least, the authors should simplify notation and improve the paper's readability in the camera-ready version.