Generalized Random Forests Using Fixed-Point Trees
We introduce a fixed-point tree-growing algorithm that replaces the gradient-based splitting mechanism in GRFs, enabling faster and memory-efficient forest construction while preserving estimator accuracy.
摘要
评审与讨论
This paper presents an approach to improving the computational efficiency of GRF by replacing the conventional gradient-based tree-splitting criterion with a fixed-point approximation. The authors argue that their method maintains the theoretical guarantees of GRF while significantly improving scalability in high-dimensional settings. The paper includes extensive theoretical analysis, experimental validation on both synthetic and real-world datasets, and detailed comparisons with existing methods.
给作者的问题
NA
论据与证据
All the claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
The proposed method successfully constructs a causal forest while incorporating the computation of the derivative matrix. However, a more detailed interpretation and justification are needed to explain why the derivative matrix can be replaced by an arbitrary scalar.
理论论述
While the theoretical justification is strong, a more intuitive explanation of why fixed-point iteration is effective for tree-splitting would improve accessibility for a broader audience. Including visualizations or diagrams to contrast the gradient-based and fixed-point splitting mechanisms could enhance comprehension.
实验设计与分析
The simulation results should demonstrate metrics beyond speed improvement, and highlighting enhanced accuracy with a smaller sample size.
补充材料
NA
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
NA
其他意见或建议
NA
... explain why the derivative matrix can be replaced by an arbitrary scalar
A1: In order to show that choosing a different scalar does not change the stability or outcome of the splitting process, let us compare two types of fixed-point pseudo-outcomes. Consider first with an arbitrary scalar . Next, consider , where we choose . Clearly, we have the relationship: . Using these pseudo-outcomes, the fixed-point approximations for the child node parameters become: for , and for . Then, the difference between child nodes satisfies a simple scaling relationship
this is simply because
Based on eq (17), the CART splitting criterion is for , for . Again, these criteria satisfy a clear scaling relationship:
This scaling shows clearly that multiplying the splitting criterion by any positive scalar does not affect the CART splits. Since CART only ranks splits based on the criterion--it selects the partition that maximizes :
Therefore, choosing a different scalar does not change the stability or outcome of the splitting process. The absolute scale of does not matter; only its ranking across candidate splits determines the final partition.
... Including visualizations ... to contrast the gradient-based and fixed-point splitting .... could enhance comprehension
A2: Thank you for this valuable suggestion. Below, we provide a simple, clear visualization that illustrates why the fixed-point approximation is effective for tree splitting, comparing it directly with the gradient-based method.
We simulate data from a simple VCM with 1D covariates and 2D regressors . Outcomes are generated from: where , with .
To simplify, we focus clearly on just a single candidate split at a threshold . We define candidate partitions and and denote the true splitting criterion by . We compare with three approximations:
-
Gradient based:
-
Fixed point with :
-
Fixed point with :
We plot these criteria against all possible splits from 0 to 1.
The visualization clearly shows:
-
The curves for the true criterion , gradient-based , and fixed-point with , i.e. , are all very close.
-
Even the fixed-point criterion with , i.e. , although scaled differently, identifies the same optimal split.
-
A vertical dashed line shows that the optimal split chosen by four methods are the same:
Since CART chooses splits by ranking candidate splits , the absolute scale of each splitting criterion does not matter -- only their rankings determine the chosen split. Thus, our fixed-point method is effective and stable.
... should demonstrate metrics beyond ... a smaller sample size.
A3: Please refer to the comments under A5 for Reviewer 1 (Gxac) regarding the larger-scale experiments.
The paper introduces a computational improvement to generalized random forests (GRFs). In the tree-growing algorithm of GRFs, the paper proposes to replace the gradient-based approximation for evaluating the split criteria by a more computationally efficient fixed-point based approximation. Theoretical guarantees of statistical effectiveness of the proposed fixed-point approximation are established. Simulation experiments show the proposed algorithm can speed-up training without compromising accuracy performance, especially in dimensional settings.
update after rebuttal
I thank the author(s) for their detailed rebuttal, which solves many of my concerns. Given that I have already given a positive rating, I will keep my score unchanged.
给作者的问题
N/A
论据与证据
- Claims in the paper are in general sound.
- Simulation studies miss some aspects, making the results less convincing. Please see my comments in “Experimental Designs Or Analyses”.
- Some additional clarification on the methodology would be appreciated. Please see my comments in “Methods And Evaluation Criteria”.
- In Section 8, the paper claims that the computational saving scales as . Why is the saving not as the computational cost of is eliminated?
方法与评估标准
While the proposed algorithmic modification is technically sound, I do have two questions on the methodology:
- In (27) the paper proposes an approximation to , how would this affect the theoretical results in Proposition 4.1?
- How would the approximation (27) be generalized to beyond the settings of linear regression?
理论论述
The theoretical results are stated in a rigorous manner. The main assumptions and proofs for Theorems 4.2 and 4.3 are directly taken from Athey et al. (2019), and do not seem to have issues. I briefly reviewed the proof of Proposition 4.1, and it looks reasonable to me, though I did not carefully examine it line by line.
实验设计与分析
The simulation studies are overall well-deigned, except that some aspects are not discussed or investigated:
- It would be interesting to see the model performance in terms of speed and accuracy in additional settings. For example,
- Small and moderate sample size
- Correlated : It would be interesting to see how the findings in Figure 1 are translated to actual performance difference.
- Nonlinear models beyond VCM and HTE.
- Comparing Figures 4 and 18, it seems that the computational gain for VCM is higher than HTE. Could you provide any insights on this?
- Could you comment on the negative computational gain in Figure 18 under the setting of RFG function, , , and Trees = 50?
补充材料
I carefully reviewed the appendices related to simulation and real data experiments, and briefly reviewed the technical details on proofs and implementation.
与现有文献的关系
The paper presents an interesting computational improvement to the tree-growing algorithm used in GRF (Athey et al., 2019). The modification can effectively speed up the model fitting procedure by removing the need to inverse Jacobian matrices, while preserving the theoretical properties of the original GRF models. Hence, it provides a novel and solid contribution to the GRF literature in my opinion.
遗漏的重要参考文献
N/A
其他优缺点
Please see my comments in "Relation To Broader Scientific Literature" for assessment on the paper's novelty and contribution.
其他意见或建议
The paper is very well written, except for some minor issues:
- Line 5 of the abstract: "which in large dimensions..."
- Figures 2-4 provides somewhat duplicated evidence on computational saving. The author(s) may consider consolidating them.
- In the last paragraph of Section 6: "Figure 10 in Appendix 10"
Why is the saving not
A1: In the general case, removing the inverse should indeed provide a computational saving of order . However, for the specific applications discussed in our paper—such as varying coefficient models (VCM) and heterogeneous treatment effect (HTE) models—the product (eq 20) has a simple analytical form without the need for a full matrix inversion, making the computational saving scale instead as , due to the required matrix-vector multiplications. In general settings without an analytical form for , our method would indeed yield savings.
In (27) an approximation to , how would this affect the theoretical results in Prop. 4.1?
A2: The approximation does not affect our theoretical results. Here we provide a brief and direct argument. Consider pseudo-outcomes defined with the one-step approximation :
Let be the child node estimator derived from these approximate pseudo-outcomes. A direct calculation shows that the difference between our original estimator and this approximate estimator satisfies
where . Under GRF's conditions, in a limit where , for a PSD matrix , and hence . Let and denote the centered observations with -th rows and , respectively. Then
Under GRF's conditions, in a limit where (the radius of the parent node) and , we have , and . Therefore and so
This means the approximation does not alter the asymptotic behavior of our estimator. The theoretical results in Prop. 4.1 remain fully valid and unaffected.
How to generalize approximation (27) beyond linear regression?
A3: The closed form approximation (27) specifically applies to linear-like settings--such as VCM and HTE models--where can be replaced by the one-step exact-line-search approximation (27).
In a more general setting, such approximation is not available, but one can still use the general fix-point approximation, which would still lead to computational gain.
additional settings. 1) Small sample size
A4: To address this, we have run an experiment for small n with data generated according to VCM Setting 3, tested on a separate set of 1,000 observations over 50 replications of each model using a forest of 50 trees
| n | K | Speedup | 100xMSE grad | 100xMSE FPT |
|---|---|---|---|---|
| 1000 | 16 | 1.82 | 2.97 | 2.94 |
| 2000 | 16 | 1.75 | 2.33 | 2.32 |
| 4000 | 16 | 1.74 | 2.00 | 1.99 |
In all cases, both the exact and approximate FPT methods are still observe a computational gain relative to GRF-grad, with no material impact on MSE
additional settings. 2) Correlated W
A5: The case of correlated W has been already considered for HTE models in App C.4, Table 2.
additional settings. 3) Nonlinear models beyond VCM and HTE
A6: Due to space constraints, we will demonstrate other nonlinear models in the final manuscript.
Fig 4 and 18, it seems that the speed gain for VCM is higher than HTE. Explain?
A7: The computational gain is higher for VCM than HTE because VCM uses continuous regressors , while HTE uses categorical ones. Continuous regressors provide a larger set of candidate splits. Evaluating these extra splits makes the gradient-based method more computationally demanding, so the simpler FPT approach achieves greater relative savings.
comment on the negative computational gain
A8: It appears to be caused by random fluctuations in computation rather than by the FPT algorithm itself. To clarify this, we repeated the experiment five times under the same conditions. The median computational gain is clearly positive, as shown below
| Model | Setting | Trees | dim(X) | n | K | Speedup |
|---|---|---|---|---|---|---|
| HTE | 5 | 50 | 5 | 10000 | 256 | 1.38 |
These stabilized results are consistent with the gains observed in other HTE experiments
This paper proposes a method to enhance Generalized Random Forests (GRF) by introducing Fixed-Point Trees (FPT) as a gradient-free alternative for recursive partitioning. The authors highlight the limitations of gradient-based methods in GRF, particularly in handling highly correlated regressors and ill-conditioned local Jacobian matrices. To address these issues, they develop a fixed-point approximation approach that eliminates the need for computing Jacobians and improves the stability of tree splits. Experimental results demonstrate that FPT-based splits lead to more robust partitions compared with gradient-based approaches, especially in scenarios with high regressor correlation.
给作者的问题
-
Given that FPT improves computational efficiency, are there specific scenarios where the gradient-based approach still offers advantages?
-
In Figure 10 of Appendix F for example, the Mean Squared Error (MSE) of the FPT method is higher than that of the gradient-based method under the conditions of VCM Setting 2. The FPT method is not always better than grad. Please elaborate on these results.
论据与证据
I did not find any problematic claims.
方法与评估标准
The proposed methods make sense. However, it lacks somewhat novelty.
理论论述
The proofs from theoretical analysis are majorly motivated by Athey et al. (2019). What are the differences and the challenges?
实验设计与分析
The experiments are comprehensive, including simulation and real-world data.
补充材料
I majorly review the experiments in Supplementary Material.
与现有文献的关系
The authors propose a general random forests method to estimate heterogeneous effects in large dimensions, which is a scalable alternative for localized effect estimation in machine learning and causal inference applications.
遗漏的重要参考文献
None.
其他优缺点
Strengths:
-
The paper effectively addresses the instability issue in traditional gradient-based splitting criteria when dealing with highly correlated regressors, providing a practical solution that enhances tree stability.
-
The proposed method is extensively validated on two structured prediction models, VCM and HTE, demonstrating its versatility across different applications.
Weaknesses:
-
The paper lacks a rigorous convergence analysis for its fixed-point iteration method. Specifically, it does not discuss whether the proposed approach satisfies key conditions, such as the contraction mapping principle, which are essential for ensuring convergence.
-
In Section 3.4, the choice of step size η is arbitrary, with the paper setting η = 1 without theoretical justification. This assumption may not always hold and could potentially impact the stability of the splitting criterion.
-
The paper primarily compares FPT with gradient-based approaches but does not benchmark it against other potential improvements in recursive partitioning. This limits the comprehensiveness of the method’s evaluation.
-
It seems that the contributions are somewhat limited.
其他意见或建议
Please see Other Strengths And Weaknesses.
The proofs from theoretical analysis are majorly motivated by Athey et al. (2019). What are the differences and the challenges?
A1: The main difference is our new splitting criterion based on a fixed-point approximation instead of a gradient-based approximation. Proposition 4.1 in our paper, which proves the asymptotic equivalence of our criterion, differs clearly from Proposition 2 in Athey et al. (2019). Our proof directly addresses the unique challenges posed by removing the Jacobian term. These challenges include establishing asymptotic equivalence and ensuring statistical accuracy without relying explicitly on the gradient information.
... lacks a rigorous convergence analysis for fixed-point iteration... does not discuss whether the proposed approach satisfies ... the contraction mapping principle, which are essential for ensuring convergence.
A2: We would like to clarify a misunderstanding regarding the role of fixed-point method in our approach.
-
Numerical convergence: Our method does not perform iterative fixed-point optimization. Instead, we take inspiration from fixed-point iteration but use only a single step of it to simplify the tree-splitting criterion. This single step gives us simplified pseudo-outcomes. Then, these pseudo-outcomes go directly into a standard CART algorithm, which decides how to split the data. Because we rely on CART, our method inherits CART’s proven numerical stability. The stopping rules for CART are clear and well-established (such as minimum node size: fewer than 5 observations or less than 5% of the parent samples). Thus, numerical convergence does not depend on the contraction mapping and is always guaranteed.
-
Statistical convergence: From a statistical viewpoint, Prop 4.1 shows clearly that our simplified fixed-point criterion is asymptotically equivalent to the original gradient-based criterion. Hence, all statistical convergence guarantees from the original GRF method still hold.
In Section 3.4, the choice of step size is arbitrary, with the paper setting without theoretical justification. This assumption may not always hold and could potentially impact the stability of the splitting criterion.
A3: Please refer to our response to Reviewer 4 (DWaw) (A1 and A2) for the justifications supporting the choice of , as well as an explanation of why this choice does not impact the stability of the splitting criterion.
The paper primarily compares FPT with gradient-based approaches but does not benchmark it against other potential improvements in recursive partitioning...
A4: Thank you for raising this point. We would like to clarify a misunderstanding. Our method (FPT) only simplifies the criterion used for tree-splitting in generalized random forests. It does not attempt to improve the recursive partitioning algorithm itself. Both our approach (GRF-FPT) and the gradient-based method (GRF-grad) use exactly the same standard CART algorithm for recursive partitioning.
Moreover, to our knowledge, GRF-FPT and GRF-grad are the only existing methods for approximating the splitting criterion in generalized random forests. Thus, no other comparable methods exist for benchmarking.
It seems that the contributions are somewhat limited.
A5: We respectfully disagree. Our method applies broadly to a wide range of important statistical models—local linear regression (Friedberg et al., 2020), survival analysis, missing data problems (Yifan et al., 2023), nonparametric quantile regression, heterogeneous treatment effect estimation, and nonlinear instrumental variables regression (Athey et al., 2019).
More generally, our approach works for any model defined by nonparametric estimating equations of the form in eq (1). It is especially useful in high-dimensional settings, where the computational cost of GRF becomes a bottleneck. Our fixed-point method offers a practical and scalable alternative while preserving statistical guarantees. We believe this broad applicability and efficiency represent a clear and substantial contribution.
... are there specific scenarios where the gradient-based approach still offers advantages?
A6: No. In all our extensive studies, we have not found any clear scenario where GRF-grad outperforms GRF-FPT, either statistically or computationally.
In Fig 10 of App F, e.g. the MSE of the FPT method is higher than that of the gradient-based method under the conditions of VCM Setting 2. The FPT method is not always better than grad. Please elaborate on these results.
A7: Our simulations show that sometimes GRF-grad slightly outperforms GRF-FPT in accuracy, while at other times GRF-FPT slightly outperforms GRF-grad. The differences are typically small (about 3-5%). The key benefit we emphasize is that GRF-FPT achieves nearly identical statistical accuracy with a substantial improvement in computational speed.
This paper introduces a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in high-dimensional settings. The authors propose a gradient-free "fixed-point" approximation approach that eliminates the need for Jacobian estimation in the GRF splitting criterion, which is both computationally expensive and potentially unstable in high dimensions. The authors show that their method preserves the theoretical guarantees of GRFs (consistency and asymptotic normality) while significantly improving computational efficiency.
给作者的问题
- Have you explored how the computational gains scale with very large datasets (e.g., millions of observations)? Are there any potential bottlenecks in scaling the approach to such datasets?
- The method shows particular advantages when the target dimension K is large. Are there any practical upper limits to K where memory or other constraints might become problematic?
论据与证据
The claims about computational efficiency and maintained statistical accuracy are well-supported by both theoretical analysis and extensive empirical experiments. The authors provide formal proof of consistency and asymptotic normality, and the experiments show clear speed advantages across various problem settings while maintaining comparable statistical accuracy.
方法与评估标准
The proposed methods are appropriate for the problem at hand. The evaluation using both simulated data and real-world California housing data provides a good test of the method's performance in controlled and practical settings. The comparison metrics (computation time and estimation error) are appropriate.
理论论述
I checked the main theoretical claims (Proposition 4.1, Theorem 4.2, and Theorem 4.3) and the proofs appear sound. The authors rigorously show that their fixed-point approximation maintains the same asymptotic properties as the original gradient-based approach.
实验设计与分析
The experimental design is sound, involving a comprehensive comparison across different settings (varying coefficient models and heterogeneous treatment effects) with different dimensionalities. The analysis of computation time and statistical accuracy is appropriate and thorough.
补充材料
I reviewed the supplementary material, focusing on the proofs in Appendix B and the implementation details in Appendix C. The supporting material is detailed and provides the necessary information to understand the method fully.
与现有文献的关系
The work builds directly on generalized random forests and is well-contextualized within the literature on heterogeneous effect estimation and adaptive partitioning methods. The authors clearly explain how their method relates to and improves upon existing work.
遗漏的重要参考文献
The paper provides a comprehensive review of relevant literature. No major omissions were found.
其他优缺点
Strengths:
- The paper addresses a significant practical issue: the computational cost and potential instability of gradient-based splitting criteria in GRFs when dealing with high-dimensional targets. This is particularly important for applications where the treatment space is large, such as multi-arm experiments.
- The theoretical contribution is strong, with clear proofs showing that the fixed-point approximation maintains the same asymptotic properties as the original GRF method while significantly reducing computational complexity.
- The method is conceptually elegant and simple to implement, requiring only minor modifications to existing GRF algorithms. The authors carefully explain how their approach fits within the two-stage framework of GRFs.
Weaknesses:
- The real-world application, while illustrative, is somewhat limited in scope (only one dataset). Additional real-world applications in different domains could strengthen the practical relevance of the method.
- The paper mentions but does not deeply explore how the approach behaves in the presence of highly correlated features, which is a common challenge in real-world applications. Some additional experiments specifically designed to test robustness to multicollinearity would be valuable.
- The paper could benefit from more discussion of the hyperparameter sensitivity of the approach, particularly regarding the choice of subsample size in the honest splitting procedure.
- The qualitative interpretation of the results from the California housing dataset is somewhat limited. A deeper discussion of the substantive insights gained from the spatially varying coefficient estimates would enhance the applied contribution.
其他意见或建议
None
...Additional real-world applications in different domains could strengthen the practical relevance of the method.
A1: We agree with the reviewer. Space limits our discussion here. In the camera-ready version, we will add real-world applications from different domains.
...Some additional experiments designed to test robustness to multicollinearity would be valuable.
A2: We conducted an additional experiment using the VCM with highly correlated features. We modified our original VCM Setting 3 by generating correlated features , where for .
Below is a clear summary of computational performance (speedup factor of GRF-FPT over GRF-grad) and statistical accuracy (MSE, multiplied by 100 for readability). Results are averages over 50 replications with , tested on a separate set of samples using a forest of 10 trees:
| dim() | Speedup | 100xMSE grad | 100xMSE FPT | |||
|---|---|---|---|---|---|---|
| 5 | 10000 | 64 | 0.00 | 2.55 | 16.60 | 16.83 |
| 5 | 10000 | 64 | 0.50 | 2.53 | 15.48 | 15.47 |
| 5 | 10000 | 64 | 0.90 | 2.35 | 10.95 | 11.09 |
These results demonstrate clearly that GRF-FPT remains robust, stable, and computationally efficient, even under high multicollinearity. We will include this analysis in the revised manuscript.
The paper could benefit from more discussion of the hyperparameter sensitivity of the approach, particularly regarding the choice of subsample size in the honest splitting procedure.
A3: Because our method changes only the splitting criterion for the recursive partitioning, both GRF-FPT and GRF-grad share exactly the same subsample size . We do not expect subsample size to change the relative computational advantage or statistical accuracy of our method.
To demonstrate, we ran an experiment varying the subsample ratio from 0.25 to 0.75 using VCM Setting 3. The table below summarizes our results (averaged over 50 replications, with a test set of 5,000 samples):
| dim() | Speedup | 100xMSE grad | 100xMSE FPT | |||
|---|---|---|---|---|---|---|
| 2 | 10000 | 64 | 0.25 | 2.77 | 2.86 | 2.90 |
| 2 | 10000 | 64 | 0.50 | 3.10 | 2.91 | 2.90 |
| 2 | 10000 | 64 | 0.75 | 2.98 | 3.21 | 3.19 |
These results show clearly that the statistical accuracy (MSE) of GRF-FPT relative to GRF-grad does not depend strongly on the subsample ratio.
...A deeper discussion of the substantive insights gained from the spatially varying coefficient estimates would enhance the applied contribution.
A4: Our model (eq 36) clearly shows how the influence of housing features changes across different locations in California. Our results from the housing data uncover interesting geographic patterns.
Fig 5 clearly illustrates this point. In major urban centers such as LA, San Francisco, and Sacramento, housing prices decrease as the number of households increases. This likely reflects overcrowding in densely populated areas. In contrast, rural regions show the opposite trend: prices rise slightly as the number of households grows. This suggests that, in sparsely populated rural areas, a modest increase in households makes these places more attractive and livable.
Population size, however, consistently shows a negative effect on housing prices across nearly all of California, with less geographic variation. Higher populations generally mean lower housing prices, highlighting broader statewide pressures on housing affordability.
Have you explored how the computational gains scale with very large datasets? ...
A5: We ran additional experiments to clearly show how our method scales for very large datasets. Using a forest of 10 trees, we tested our method on VCM Setting 3 with , dim() = 2, and data size up to . The results appear below:
| dim() | Speedup | ||
|---|---|---|---|
| 2 | 256 | 10000 | 4.54 |
| 2 | 256 | 20000 | 3.59 |
| 2 | 256 | 50000 | 3.49 |
| 2 | 256 | 100000 | 3.11 |
| 2 | 256 | 200000 | 3.04 |
| 2 | 256 | 500000 | 3.08 |
These results clearly demonstrate that, even as the dataset grows very large, our method consistently remains faster than GRF-grad. While the relative speedup slightly decreases at first, it stabilizes at a consistent advantage as grows very large. Thus, our method has no clear computational bottleneck and maintains a robust advantage at scale.
...Are there any practical upper limits to K where memory or other constraints might become problematic?
A6: As the dimension grows large, memory usage increases for both methods, potentially hitting hardware constraints. Yet our method (GRF-FPT) is clearly less memory-intensive compared to GRF-grad.
Specifically, GRF-grad scales in memory as (number.of.nodes * ) for each tree. In contrast, our GRF-FPT scales as (number.of.nodes * ). Thus, GRF-FPT remains practical even when becomes large, and memory constraints become significant.
The paper introduces a more efficient implementation of generalized random forests in which the original gradient-based approximation for evaluating the split criteria is replaced by a more computationally efficient fixed-point based approximation.
Generally speaking, the reviewers rated the paper positively. I believe that the authors adequately addressed most of the clarifying questions and believe that the additional empirical results will strengthen the paper considerably. Moving forward, I encourage the authors to include the additional empirical results mentioned in their replies to Gxac and fl8w. I additionally encourage the authors to include short discussions summarizing their replies A5 to Gxac; A1, A2, and A7 to fj8W; and A1 and A2 (perhaps including the visualization in a supplement) to DWas.