PaperHub
7.3
/10
Poster4 位审稿人
最低4最高6标准差0.9
4
6
4
4
3.0
置信度
创新性3.3
质量2.8
清晰度2.8
重要性3.3
NeurIPS 2025

Learning Counterfactual Outcomes Under Rank Preservation

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

Learning the individualized treatment effects under the assumption of rank preservation.

摘要

关键词
Causal InferenceNonparametric IdentifiabilityIndividualized Treatment Effects

评审与讨论

审稿意见
4

This paper presents a new approach for individual-level counterfactual inference that does not require a known structural causal model (SCM). The authors introduce a "rank preservation" assumption to identify the counterfactual outcome. Building on this, they propose a convex ideal loss function and an unbiased kernel-based estimator. Theoretical analysis shows their assumption is no stronger than those used in prior work, and experiments on semi-synthetic and real-world data demonstrate the method's effectiveness.

优缺点分析

Strengths:

  • The paper is well-written, and its arguments are clear.
  • A significant advantage of the proposed method is its ability to perform counterfactual inference without relying on a known SCM.
  • The approach is supported by theoretical results that formally establish its properties.

Weaknesses:

  • The paper's core contribution relies on the rank preservation assumption. While the introduction briefly explains it, it lacks a discussion or illustrative examples to motivate why this assumption is appropriate and practical in real-world scenarios. Adding this would significantly strengthen the paper's introduction.
  • There are some minor typographical and organizational issues that should be addressed:
    • On line 184, "a SCM" should be corrected to "an SCM".
    • Colons appear to be missing in several places where definitions are introduced (e.g., lines 94, 159, 194).
    • The placement of Section 6 after Table 1 disrupts the flow; it would be more logical to place it before the table.

问题

  • The proposed loss function is a key component of the method. Could the authors provide more intuition and discussion behind its design?
  • The paper claims that the method can be extended to continuous treatments and that Proposition 5.3 remains valid in this setting. However, a formal proof for this claim is not provided. Could the authors please discuss on this point?

局限性

yes

最终评判理由

I think the authors' rebuttal addresses some of my concerns.

格式问题

No

作者回复

Thank you very much for your positive evaluation of our paper. Below, we hope that our clarification addresses your concerns.

W1: The paper's core contribution relies on the rank preservation assumption. While the introduction briefly explains it, it lacks a discussion or illustrative examples to motivate why this assumption is appropriate and practical in real-world scenarios. Adding this would significantly strengthen the paper's introduction.

Response to W1: Thank you for your helpful suggestions. We first provide more explanations for Assumption 4.2 ($\rho(Y_x, Y_{x'}\mid Z)=1$). Assumption 4.2 implies that an individual’s factual and counterfactual outcomes have the same rank in the corresponding distributions of factual and counterfactual outcomes for all individuals. To better understand this, imagine two counterfactual worlds:

  • In the first world, every individual receives treatment xx;
  • In the second world, every individual receives treatment xx'.

For a given individual ii, their outcomes in these two worlds are yi,xy_{i, x} and yi,xy_{i, x'}, respectively. Assumption 4.2 states that the rank of yi,xy_{i, x} among all outcomes in the first world yj,x:j=1,...,N{y_{j, x}: j = 1, ..., N}, is the same as the rank of yi,xy_{i, x'} among all outcomes in the second world yj,x:j=1,...,N{y_{j, x'}: j = 1, ..., N}.

Then, for an illustrative example in clinical medicine, we might consider a scenario where each individual’s underlying health status (UU) is fixed (here we don't consider ZZ for simplicity), and thus its rank is determined prior to treatment. If all patients received the same treatments, then the rank of outcomes for each individual remain the same rank of the individual underlying health status.

W2:: There are some minor typographical and organizational issues that should be addressed.

Response to W2: Thank you for pointing this out. We will revise it accordingly. Thanks again.

Q1: The proposed loss function is a key component of the method. Could the authors provide more intuition and discussion behind its design?

Response to Q1: Thank you for raising this interesting question. Actually, the design of the loss function is more mathematical than intuition. Inspired by Koenker and Bassett (1978) [reference 32 of the manuscript], we found that E[YxtZ=z]t=2P(YxtZ=z)1\frac{\partial \mathbb{E}[ |Y_{x'} - t| \mid Z=z ] }{ \partial t} = 2 \mathbb{P}(Y_{x'} \leq t \mid Z=z) - 1 To construct a loss function whose first derivative satisfies 2P(YxtZ=z)P(YxyZ=z)=0.2 \\{ \mathbb{P}(Y_{x'} \leq t \mid Z=z) - \mathbb{P}(Y_{x} \leq y \mid Z=z) \\} = 0. It is natural to define the loss function as Rx(tx,z,y)=E[YxtZ=z]+(12P(YxyZ=z))t=E[YxtZ=z]+E[sign(Yxy)Z=z]t. R_{x'}(t| x, z, y) = \mathbb{E}[ |Y_{x'} - t | | Z=z ] + (1 - 2\mathbb{P}(Y_{x} \leq y \mid Z=z)) t = \mathbb{E}[ |Y_{x'} - t | | Z=z ] + \mathbb{E}[ \text{sign}(Y_x - y) | Z=z] \cdot t.

Q2: The paper claims that the method can be extended to continuous treatments and that Proposition 5.3 remains valid in this setting. However, a formal proof for this claim is not provided. Could the authors please discuss on this point?

Response to Q2: Thanks for your comments. In Appendix D, we have presented the loss function for continuous treatments along with a brief proof. A similar derivation can also be found in Theorem 1 of [1]. We would be happy to provide a more detailed proof if the current presentation requires further clarification.


[1] Nathan Kallus and Angela Zhou (2018), Policy Evaluation and Optimization with Continuous Treatments, AISTATS

评论

Thanks for your response. The rebuttal addresses some of my concerns.

审稿意见
6

The authors introduce a novel condition (assumption 4.6) for counterfactual identifiability in RCTs that is a strict relaxation of the common "homogeneity + monotonicity" requirement. Roughly, this condition demands that "for any two individuals sharing the same observable pre-treatment features Z=zZ = z, if one individual fares better under any treatment xx they fare better under all treatments". The authors develop a consistent estimator for counterfactual learning under this condition.

优缺点分析

The authors' condition cuts explicit dependence on UXU_X, and, in my view, directly captures the intuition used when justifying the monotonicity condition. This is a significant improvement over a classical result and the authors clearly communicate the advantage of their approach through examples and experiments.

As for weaknesses, I have a few comments:

  • I do not agree with line 169 stating arbitrary heteroskedastic models satisfy the condition. If h(X,Z)h(X, Z) changes sign between X=xX = x and X=xX = x' for some Z=zZ = z then the assumption is violated since we have h(x,z)h(x,z)<0h(x, z) h(x', z) < 0. This is an important distinction since the authors' assumption does not allow a treatment to be beneficial to some individuals with the same Z=zZ = z and harmful to others.
  • The authors should make the non-technical statement of their assumption explicit, rather than describing it in solely in terms of the rank coefficient.

问题

The authors did not benchmark their estimator in an example where "homogeneity + monotonicity" is satisfied. How would their estimation strategy would compare to the classical quantile-based estimator in this setting?

局限性

Yes

最终评判理由

I am impressed with the authors' submission and any concerns raised by the other reviewers have not significantly affected my score.

"Monotonicity + heterogeneity" is a classical, heavily cited assumption often taught to undergraduates. Fundamentally, it is a functional assumption which looses its intuition (no defiers) for continuous treatments and/or outcomes. The authors show that a weaker assumption, based purely on counterfactual reasoning, can achieve the same identifiability results.

格式问题

None

作者回复

Thank you very much for your positive evaluation of our paper. Below, we hope that the our clarification addresses your concerns.

W1: I do not agree with line 169 stating arbitrary heteroskedastic models satisfy the condition.

Response to W1: Thank you for pointing this out. We fully agree with you and we will revise it accordingly.

W2: The authors should make the non-technical statement of their assumption explicit, rather than describing it in solely in terms of the rank coefficient.

Response to W2: Thank you for your helpful suggestions. Below, we provide more non-technical statements for Assumption 4.2 ($\rho(Y_x, Y_{x'}\mid Z)=1$).

Assumption 4.2 implies that an individual’s factual and counter factual outcomes have the same rank in the corresponding distributions of factual and counterfactual outcomes for all individuals. To better understand this, imagine two counterfactual worlds:

  • In the first world, every individual receives treatment xx;
  • In the second world, every individual receives treatment xx'.

For a given individual ii, their outcomes in these two worlds are yi,xy_{i, x} and yi,xy_{i, x'}, respectively. Assumption 4.2 states that the rank of yi,xy_{i, x} among all outcomes in the first world yj,x:j=1,...,N{y_{j, x}: j = 1, ..., N}, is the same as the rank of yi,xy_{i, x'} among all outcomes in the second world yj,x:j=1,...,N{y_{j, x'}: j = 1, ..., N}. In clinical medicine, we might consider a scenario where each individual’s health status (UU) is fixed, and its rank is determined prior to treatment. Furthermore, this rank may influence—or depend on—the rank of potential outcomes after treatment.

W3: The authors did not benchmark their estimator in an example where "homogeneity + monotonicity" is satisfied. How would their estimation strategy would compare to the classical quantile-based estimator in this setting?

Response to W3: Thanks for your insightful comments. We add experiments by modifying the data generation process to make the data satisfying "homogeneity + monotonicity". Specifically, we set α=1\alpha = 1 and add a constant in Y1Y_1, i.e., Y1=WyZ+U1+1 and Y0=WyZ+U1 with WyN(0,Im).Y_1 = W_y · Z + U_1 + 1 \textup{ and } Y_0 = W_y · Z + U_1 \textup{ with } Wy \sim N (0, I_m). The corresponding numerical results are presented in the following tables. They show that under the 'homogeneity + monotonicity' setting, the proposed method performs less favorably than the quantile-based estimator.

MethodsSim-5Sim-10Sim-20Sim-40
In-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sample
Quantile-Reg1.772±0.9031.759±0.9261.356±0.6481.354±0.6541.607±0.5071.600±0.5031.563±0.5951.585±0.593
Ours1.405±0.1101.393±0.1101.347±0.2741.347±0.2921.593±0.7531.606±0.7481.611±0.8131.620±0.823
MethodsSim-5Sim-10Sim-20Sim-40
In-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sample
Quantile-Reg1.188±0.9061.162±0.8960.876±0.6300.873±0.6471.243±0.7181.231±0.7141.074±0.7301.091±0.725
Ours0.984±0.1570.967±0.1640.779±0.4600.771±0.4991.537±0.9491.532±0.9441.138±0.7931.157±0.800
审稿意见
4

This paper proposes an approach for individual-level counterfactual inference without requiring structural causal models (SCMs). The approach is based on the introduction of the Rank Preservation Assumption which posits that confounders account for all the correlation between potential outcomes, with correlation defined in terms of Kendall’s rank coefficient. Under this assumption, the authors derive a convex loss function and an unbiased estimator for the counterfactual outcome. They propose a kernel-based estimator and demonstrate its performance through theoretical analysis and empirical evaluation on both semi-synthetic and real-world datasets. Compared to existing methods, which typically require estimating an SCM along with a monotonicity assumption (i.e., the potential outcome is a monotonic function of unobserved variables) or assume equality of conditional quantile functions, this paper’s approach relies on weaker identifiability conditions and simpler estimation procedure.

优缺点分析

The paper addresses the complex problem of counterfactual inference problem by introducing the rank preservation assumption. The authors prove that this assumption is strictly weaker than homogeneity and strict monotonicity, which are commonly used in prior work. Based on this assumption, the authors derive a theoretically loss function that avoids SCM estimation and bi-level optimization, making their method practically implementable. The proposed estimator also demonstrates strong empirical performance across multiple benchmarks and is backed by consistency, unbiasedness, and convergence guarantees.

Despite these contributions, the paper could benefit from more details on the assumption, in particular a more intuitive explanation of the rank preservation assumption (Assumption 4.2). While the definition is mathematically rigorous, its practical implications may not be immediately obvious. A brief paragraph and example are provided, but this may not be sufficient to help readers build strong intuition. Including additional pedagogical material to discuss in depth the different assumptions would significantly improve accessibility.

In addition, the exposition of the paper and in particular in Section 5, the derivation and approximation of the ideal loss, could be streamlined. The writing is mathematically dense and is difficult to follow. Distinguishing between the intuition, formal derivation, and implementation steps could enhance clarity.

问题

The authors claim that Assumption 4.2 holds for common models such as those with additive noise Y=g(X,Z)+UY = g(X, Z) + U or heteroscedastic noise Y=g(X,Z)+h(X,Z)UY = g(X, Z) + h(X, Z)U. Could you elaborate on why this is the case?

  • How can practitioners assess or empirically validate the rank preservation assumption in real-world applications?

  • In the experiments, the proposed method appears to outperform all competitors consistently. This is somewhat surprising, as methods usually have specific regimes where they excel and others where they do not. Could you provide some intuition for these results? Have you explored other scenarios or settings? Did you use simple methods for estimating nuisance functions for the baselines? This is particularly surprising in settings where the assumption is violated, such as in Table 4 of the appendix, where some competitive methods are missing. Also, in non-simulated settings, how do you choose or tune the bandwidth parameter?

  • The reliance on kernel-based estimation could limit scalability in high-dimensional or large-sample settings. How do you handle scenarios with many continuous confounders? In particular, in your experiments, the estimator appears quite stable even when the covariate dimension reaches 40; could you clarify why?

minor remarks:

  • Finally, the abstract states that rank preservation is “not stronger” than homogeneity and strict monotonicity, whereas the main text (e.g., Proposition 4.1) shows that it is in fact strictly weaker. Clarifying this language in the abstract would help avoid potential confusion for readers.
  • In line 120, the references cited do not seem directly relevant to the point made. Highlighting heterogeneity does not necessarily imply that one must estimate individual-level effects rather than conditional effects. One could, for instance, use an extreme form of stratification with very small group sizes instead.

局限性

The paper provides limited guidance on how to assess the plausibility of the rank preservation assumption in practice. As it is untestable from observational data, incorporating a discussion or any diagnostic tools would strengthen the method’s practical value.

最终评判理由

The authors will include in the revised version more insight about the rank assumption as well as details on the estimation process which will reinforce the contribution.

格式问题

no

作者回复

Thank you very much for your positive evaluation of our paper. Below, we hope that the our clarification addresses your concerns.

Q1: The authors claim that Assumption 4.2 holds for common models such as those with additive noise Y=g(X,Z)+UY = g(X, Z) + U or heteroscedastic noise. Could you elaborate on why this is the case?

Response to Q1: Thanks for your comments and we apologize for any lack of clarity. Consider the additive noise model (Y=g(X,Z)+UY = g(X, Z) + U) as an example. This model satisfies both the homogeneity and strict monotonicity conditions (Assumptions 3.1 and 3.2). Then by Proposition 4, $\rho(Y_x, Y_{x'}\mid Z)=1$ (Assumption 4.2) holds.

Q2: How can practitioners assess or empirically validate the rank preservation assumption in real-world applications?

Response to Q2: Thanks for your comments. In response to your question, we would like to clarify the following two points:

First, the rank preservation assumption is unverifiable empirically. This challenge is not unique---all causal inference methods rely on fundamental assumptions that cannot be directly tested, such as unconfoundedness or the validity of instrumental variables. The plausibility of these assumptions must instead be evaluated through domain expertise and contextual knowledge. Consequently, the choice of appropriate causal methods and identifiability assumptions depends on the specific problems at hand. This is the key difference between causal methods and non-causal methods.

Second, to assess the robustness of the proposed method, we typically perform sensitivity analyses under scenarios where key assumptions are violated. In Appendix E, we explored empirical performance with violated assumptions on rank preservation. It indicates that the proposed perform stably well even if the n rank preservation is violated.

Q3: In the experiments, the proposed method appears to outperform all competitors consistently. This is somewhat surprising, as methods usually have specific regimes where they excel and others where they do not. Could you provide some intuition for these results? Have you explored other scenarios or settings? Did you use simple methods for estimating nuisance functions for the baselines? This is particularly surprising in settings where the assumption is violated, such as in Table 4 of the appendix, where some competitive methods are missing. Also, in non-simulated settings, how do you choose or tune the bandwidth parameter?

Response to Q3: Thanks for your comments. We first would like to clarify that as shown in Table 3, although our method overall stably outperforms the baselines, the performance of CFQP on out sample ϵPEHE\sqrt{\epsilon_{PEHE}} is better than our method on the IHDP dataset, and the performance of CFRNet on in sample ϵATT\epsilon_{ATT} is better than our method on the Jobs dataset.

In addition, on the synthetic dataset, as shown in Table 1, our method outperforms among all scenarios, which is due to the data generation process already satisfying the ranking assumption. These results are intuitive: this is because all the baselines (except Quantile-Reg and Ours) are not designed for estimating counterfactual outcomes; they are designed to estimate the conditional average treatment effects E[YxZ=z]E[Y_{x'} \mid Z=z]. In addition, they do not use the information of yxy_{x} when estimating yxy_{x'}. For the estimation methods for counterfactual outcomes, incluing Quantile-Reg and Ours. Since the data-generating mechanisms violate the homogeneity assumption (Assumption 3.1), and the Quantile-Reg rely on the homogeneity assumption and thus underprior our methods.

Moreover, we explored other scenarios or settings as shown in Table 2 and Table 4. Taking Table 4 as an example, compared to the original results in Table 1, the advantage of our method is smaller, and some results (3 out of 8) are not significantly outperformed the best baselines.

We do not use simple methods for estimating nuisance functions for the baselines; all methods are implemented as it should be and all the results are compared in a fair way.

Finally, in non-simulated settings, for tuning the bandwidth parameter. We treat it as a hyperparameter and use grid search for selecting the bandwidth. We will add these details in the revised version. Thanks again.

Q4: The reliance on kernel-based estimation could limit scalability in high-dimensional or large-sample settings. How do you handle scenarios with many continuous confounders? In particular, in your experiments, the estimator appears quite stable even when the covariate dimension reaches 40; could you clarify why?

Response to Q4: We thank the reviewer for the questions. Note that we do not use kernel function directly on the original covariate. Instead, we first learn a low dimension covariate representation, then use this representation as the input, see the code in the supplementary for more details (line 42 in Ours1.py file). This is a standard implementation in the community of causal machine learning. We will clarify this issue in the revised manucript.

Q5: Finally, the abstract states that rank preservation is “not stronger” than homogeneity and strict monotonicity, whereas the main text (e.g., Proposition 4.1) shows that it is in fact strictly weaker. Clarifying this language in the abstract would help avoid potential confusion for readers.

Response to Q5: Thanks you for pointing this out. We will revise the abstract to make it consistent.

Q6: In line 120, the references cited do not seem directly relevant to the point made. Highlighting heterogeneity does not necessarily imply that one must estimate individual-level effects rather than conditional effects.

Response to Q6: Thank you for your valuable comment. The cited references here all emphasize the distinction between individualized treatment effects and conditional average treatment effects, rather than asserting that one must estimate individual-level effects instead of conditional effects. We will revise this subsection to clarify this point further.

评论

Q2: Thank you for the responses, but unfortunately they do not fully satisfy me. In causal inference there are hypotheses that are not testable; however, we still try to provide guidance to users. For example, for the unconfoundedness assumption, by discussing with experts, you try to build a DAG and identify the variables related to the treatment and the outcome, and then, as you mention, you also turn to sensitivity analysis.

My point, which is in fact shared by the other reviewers, is that more intuition and guidance are needed on this key assumption. You have provided some interpretative elements in your responses to the other reviewers, and it is really crucial to include these in the main body of the text, while also trying to go further and provide concrete examples where the assumption makes sense.

The method works better under this assumption, and as shown in the new simulation, the results may be less favorable than the quantile method under the “homogeneity + monotonicity” assumption, which makes sense. But for a user working with data, what would you recommend, and in which cases? The paper would greatly benefit from an in-depth discussion of these scenarios to maximize its practical impact.

Q3: I am not sure you have answered to the question that was why in table 4 you do not have all the same methods than in table 1? Could you provide the results with all the methods? In addition, you answer:"We do not use simple methods for estimating nuisance functions for the baselines; all methods are implemented as it should be and all the results are compared in a fair way." Please, can you be more specific? I was asking how the nuisance components are estimated and to give more details on the estimation.

Q4: Thank you for the clarification, but the details should be included in the manuscript. Estimation is important and if you use dimensionality reduction before applying kernel, which indeed makes sense in high dimension it should be stated also to better understand the impact of the different choices.

评论

Dear Reviewer 4AFD,

Thank you very much for your positive evaluation of our paper and for your helpful suggestions. Below, we hope that our clarification addresses your concerns.

Q2: My point, which is in fact shared by the other reviewers, is that more intuition and guidance are needed on this key assumption. You have provided some interpretative elements in your responses to the other reviewers, and it is really crucial to include these in the main body of the text, while also trying to go further and provide concrete examples where the assumption makes sense. The method works better under this assumption, and as shown in the new simulation, the results may be less favorable than the quantile method under the “homogeneity + monotonicity” assumption, which makes sense. But for a user working with data, what would you recommend, and in which cases? The paper would greatly benefit from an in-depth discussion of these scenarios to maximize its practical impact.

Response: Thank you very much for your kind comments and helpful suggestions. We fully agree with you that providing guidance on assumptions to users is important for real-world applications.

We apologize for not fully addressing the reviewer's concern in our previous response due to our misunderstanding, and we are glad that you noticed we provided some interpretative elements in our responses to the other reviewers. Following your suggestions, we added an additional example below to illustrate the rank preservation assumption (heterogeneity + monotonicity).

ComponentDescription in the PaperReal-World Analogue
Treatment XXBinary (0 vs 1)Type of instruction a student receives in an online math course: X=0X = 0: standard pre-recorded video lectures; X=1X = 1: interactive AI tutor with real-time adaptation
Observed covariate ZZA measured feature that jointly influences both treatmentsStudent’s prior algebra knowledge score from a placement test
Unobserved noise UxU_xCaptures individual-specific factors; allowed to differ across treatmentsEach learner’s intrinsic motivation / self-discipline during the course. <br>Motivation while watching videos (U0U_0) may differ from that triggered by AI tutor challenges (U1U_1).
Outcome YxY_xMust be strictly monotone in UxU_x for any fixed ZZFinal exam score after the course

In addition, we provide more heuristic discussions.

First, as a direct response, compared with the quantile method (ref. 13 in the manuscript), we recommend our method. The reasons are two-fold:

  • (a) Our method relies on weaker assumptions (Proposition 4.4) and accommodates "heterogeneity", making it applicable to a broader range of real-world scenarios;
  • (b) Our empirical results show that our method outperforms the quantile method in the setting "heterogeneity+monotonicity", and performs slightly better or comparably to the quantile method in the setting "homogeneity + monotonicity".

Second, although our rank preservation assumption is weaker than the combination of "homogeneity + monotonicity" assumptions, it is still a strong assumption and may be violated in practice. Thus, in practice, we also recommend conducting a sensitivity analysis to further assess the robustness of the conclusions to potential violations of this assumption. Specifically, when the rank preservation is relaxed into the following Assumption R:

  • Assumption R: P(Yxyx)P(Yxyx)α| P(Y_{x} \leq y_x \mid ) - P(Y_{x'} \leq y_{x'} \mid ) | \leq \alpha for 0<α<10 < \alpha < 1. When α=0\alpha=0, Assumption R reduces to the rank preservation assumption. Then we could explore how the bounds on yxy_{x'} varies with α\alpha (a novel sensitivity analysis method). By theoretical analysis (similar to Theorem 5.2), we can show the following conclusion.
  • Let Rx(tx,z,y,β)=E[YxtZ=z]+E[sign(Yxy)Z=z]t+2βtR_{x'}(t| x, z, y, \beta) = E[ |Y_{x'} - t | | Z=z ] + E[ \text{sign}(Y_x - y) | Z=z] \cdot t + 2 \beta t Under Assumption R, yx[yl,yu]y_{x'} \in [ y^l, y^u ], where the lower bound yly^l is the unique solution of Rx(tx,z,y,α)R_{x'}(t| x, z, y, \alpha), and the upper bound yuy^u is the the unique solution of Rx(tx,z,y,α)R_{x'}(t| x, z, y, -\alpha).

Thus, we can conduct experiments to evaluate how the conclusions (or yly^l and yuy^u) vary with the value of α\alpha in real-world applications.

评论

Q3: I am not sure you have answered to the question that was why in table 4 you do not have all the same methods than in table 1? Could you provide the results with all the methods? In addition, how the nuisance components are estimated and to give more details on the estimation.

Response: Thank you for your helpful comments. At that time, due to space limitations---and since we originally planned to include this table (Table 4) in the main text---we aimed to keep it concise by presenting only the more representative/competing baselines. Following your suggestions, we have rerun the code, and the corresponding results are provided below.

Sim-10 (Rank=0.3)Sim-10 (Rank=0.5)Sim-40 (Rank=0.3)Sim-40 (Rank=0.5)
MethodsIn-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sample
T-learner7.05±0.777.06±0.794.44±0.644.44±0.6715.07±0.6115.03±0.859.45±0.319.41±0.42
X-learner7.01±0.817.05±0.855.29±0.444.98±0.4815.16±0.6415.05±0.889.43±0.299.40±0.42
BNN6.85±0.806.89±0.844.41±0.664.41±0.6915.14±0.6015.03±0.869.31±0.319.33±0.43
TARNet6.92±0.796.96±0.824.38±0.594.38±0.6014.91±0.6114.80±0.829.33±0.369.33±0.45
CFRNet6.85±0.796.87±0.854.34±0.604.35±0.6515.11±0.6714.99±0.949.30±0.179.29±0.36
CEVAE6.95±0.766.98±0.814.38±0.704.40±0.7014.97±0.5014.89±0.879.31±0.339.32±0.46
DragonNet6.97±0.837.02±0.884.43±0.624.44±0.6215.04±0.5414.94±0.779.18±0.309.16±0.50
DeRCFR6.96±0.717.00±0.764.38±0.674.41±0.6715.06±0.7414.98±1.009.12±0.379.17±0.59
DESCN7.31±1.827.37±1.834.26±0.744.29±0.7414.88±0.9614.81±1.239.20±0.409.20±0.46
ESCFR6.91±0.696.96±0.764.41±0.654.41±0.6815.01±0.5814.86±0.829.25±0.189.26±0.31
CFQP6.89±0.676.90±0.694.27±0.574.26±0.5814.49±0.5614.39±0.559.12±0.279.10±0.28
Quantile-Reg6.79±0.796.63±0.844.12±0.634.14±0.6513.12±0.6213.25±0.878.39±0.308.43±0.42
Ours6.00±1.226.01±1.293.88±1.253.90±1.3010.76±0.5210.84±0.436.75±0.246.80±0.21

For the comment "how the nuisance components are estimated", we would like to clarify that the estimation method is included in Appendix E, where we stated:

"We run all experiments on the Google Colab platform. For the representation model, we use the MLP 754 for the base model and tune the layers in {1, 2, 3}. In addition, we adopt the logistic regression model 755 as the propensity model. We tune the learning rate in {0.001, 0.005, 0.01, 0.05, 0.1}. For the kernel 756 choice, we select the kernel function between the Gaussian kernel function and the Epanechnikov 757 kernel function, and tune the bandwidth in {1, 3, 5, 7, 9}"

Q4: Thank you for the clarification, but the details should be included in the manuscript. Estimation is important and if you use dimensionality reduction before applying kernel, which indeed makes sense in high dimension it should be stated also to better understand the impact of the different choices.

Response: Thank you for your comment. We will include the associated details in the revised version. Thanks again.

We hope that the above clarification addresses your concerns.

Warm regards,

Authors

评论

Thank you for your clarification that address my concerns.

审稿意见
4

This paper addresses individual-level counterfactual outcome estimation by introducing a new identifiability assumption called rank preservation, claiming it is weaker than the traditional homogeneity and strict monotonicity assumptions. Building on this, the authors propose an ideal loss function for unbiased counterfactual estimation and develop a kernel-based estimator for empirical implementation. Theoretical results, including identifiability proofs, unbiasedness properties, and consistency analysis, are provided. The paper also critiques quantile regression-based approaches and demonstrates empirical results on semi-synthetic and real-world datasets.

优缺点分析

Strengths

  • The paper addresses a challenging problem in counterfactual inference where full knowledge of the SCM is unavailable, a realistic scenario in many applications.
  • Introducing the rank preservation assumption provides an alternative path to identifiability that appears, theoretically, to relax previous requirements like strict monotonicity.
  • The proposed ideal loss circumvents limitations of previous quantile regression methods, and the bi-level optimization.
  • The authors provide theoretical analysis for the proposed estimator.
  • Experimental results demonstrate consistent improvements over baseline methods.

Weaknesses

  • All identifiability assumptions (including this) are restrictive

While the authors rightly critique homogeneity and strict monotonicity assumptions for being strong and hard to verify, the proposed rank preservation assumption is similarly unverifiable in practice. Whether ranks of factual and counterfactual outcomes align across individuals is fundamentally untestable, especially when counterfactuals are unobserved. The assumption may be mathematically weaker, but its practical plausibility remains speculative.

Suggestion: The paper would benefit from a more candid discussion acknowledging that all identifiability assumptions in counterfactual inference, including rank preservation, impose unverifiable structure on unobserved quantities. Comparing empirical consequences of assumption violations (as partially done in Appendix E) is a good direction that could be expanded.

  • Reliance on propensity score estimation and kernel choices

The method depends on accurate estimation of propensity scores and bandwidth parameters for kernel smoothing. Theoretical sections highlight this, but practical guidelines for tuning and robustness checks are minimal.

Suggestion: More systematic sensitivity analyses, especially in high-dimensional settings, would clarify the method's reliability. Exploring alternatives to kernel smoothing for scalability would also improve practical value.

问题

see suggestions above

局限性

yes

最终评判理由

see discussion above.

格式问题

No

作者回复

Thank you very much for your positive evaluation of our paper. Below, we hope that the our clarification addresses your concerns.

W1: All identifiability assumptions (including this) are restrictive. The paper would benefit from a more candid discussion acknowledging that all identifiability assumptions in counterfactual inference, including rank preservation, impose unverifiable structure on unobserved quantities. Comparing empirical consequences of assumption violations (as partially done in Appendix E) is a good direction that could be expanded.

Response to W1: Thanks for your helpful suggestions. We agree with you that all identifiability assumptions are restrictive and are unverifiable or untestable. We will add a discussion on it in the conclusion.

We also agree with you that comparing empirical consequences of assumption violations is helpful. In Appendix E, we explored empirical performance with violated assumptions on rank preservation. As a complement, we may further theoretically investigate the impact of the violation of rank preservation. Specifically, when the rank preservation is relaxed into the following Assumption R:

  • Assumption R: P(YxyxZ=z)P(YxyxZ=z)α| \mathbb{P}(Y_{x} \leq y_x \mid Z=z ) - \mathbb{P}(Y_{x'} \leq y_{x'} \mid Z=z) | \leq \alpha for 0α<10 \leq \alpha < 1.

When α=0\alpha = 0, Assumption R reduces to the rank preservation assumption. Then we could explore how the bounds on yxy_{x'} varies with α\alpha (a novel sensitivity analysis method). By theoretical analysis (similar to Theorem 5.2), we can show the following conclusion.

  • Proposition 5.7: Let Rx(tx,z,y,β)=E[YxtZ=z]+E[sign(Yxy)Z=z]t+2βtR_{x'}(t| x, z, y, \beta) = \mathbb{E}[ |Y_{x'} - t | | Z=z ] + \mathbb{E}[ \text{sign}(Y_x - y) | Z=z] \cdot t + 2 \beta t Under Assumption R, yx[yl,yu]y_{x'} \in [ y^l, y^u ], where the lower bound yly^l is the unique solution of Rx(tx,z,y,α)R_{x'}(t| x, z, y, \alpha), and the upper bound yuy^u is the the unique solution of Rx(tx,z,y,α)R_{x'}(t| x, z, y, -\alpha).

W2: Reliance on propensity score estimation and kernel choices. More systematic sensitivity analyses, especially in high-dimensional settings, would clarify the method's reliability. Exploring alternatives to kernel smoothing for scalability would also improve practical value.

Response to W2: Thanks for your constructive suggestions. Empirically, we have explored several sensitivity analyses for various compoents of the proposed method, such as covariate dimensions (Tables 1 and 2), violation of rank preservation (Table 2 and Appendix E), heterogeneity degrees (Figure 1), and kernel choices (Appendix E).
Theoretically, we complement a sensitivity analysis for the violation of rank preservation (see our response to W1).
For more discussion on the high-dimensional settings, we leave it to future work. Thanks again for your kindful comments.

评论

Thank you for your rebuttal and additional experiments. While I appreciate the clarification, it does not fully address all of my concerns (i.e., sensitivity analyses in high-dimensional settings). I will keep my score unchanged.

评论

Dear Reviewer niUF,

Thank you very much for your kind comments and timely feedback. In response, we have conducted additional sensitivity analyses in high-dimensional settings. The associated results are shown in Table below, where the data-generating processes are the same as those in Table 1 of the manuscript, except that the covariate dimension is increased from 5–40 to 80–160.

Sim-80Sim-120Sim-160
MethodsIn-sampleOut-sampleIn-sampleOut-sampleIn-sampleOut-sample
X-learner6.74±0.376.67±0.587.55±0.397.67±0.498.95±0.398.78±0.32
TARNet6.45±0.516.49±0.647.77±0.477.86±0.629.34±0.579.15±0.52
DragonNet6.32±0.356.34±0.547.18±0.397.22±0.439.17±0.418.97±0.46
ESCFR6.20±0.506.22±0.577.61±0.567.68±0.628.89±0.388.66±0.39
Quantile-Reg5.70±0.565.76±0.746.61±0.566.63±0.588.02±0.597.72±0.46
Ours4.89±0.594.91±0.645.51±0.385.60±0.426.59±0.196.55±0.23

From this table, we can see that our method stably outperforms the competing methods.

We hope that the above clarification addresses your concerns.

Warm regards,

Authors

最终决定

This paper introduces a new assumption of rank preservation for counterfactual outcome estimation. The authors propose a learning framework that shows both theoretical guarantees and strong empirical performance.

The main strength of the paper is its novelty in relaxing traditional assumptions, providing a more flexible path for counterfactual inference, which was highly valued by the reviewers, while the main weakness is that rank preservation cannot be tested in practice.

Overall, the contribution is clear and significant. With stronger discussion of the practical meaning of rank preservation, the impact of the paper would be even greater.