PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
3
4
ICML 2025

Active Treatment Effect Estimation via Limited Samples

OpenReviewPDF
提交: 2025-01-20更新: 2025-07-26
TL;DR

Active Treatment Effect Estimation via Limited Samples

摘要

关键词
causality

评审与讨论

审稿意见
4

The paper proposes a new active learning strategy for experimental design and an accompanying ATE estimator, derived from literature on estimators with finite sample guarantees. This is especially relevant to cases where experimental sampling must be constrained due to cost or other concerns. Additionally, the paper briefly discusses an extension to scenarios where SUTVA is violated.

给作者的问题

In Table 1, the sample complexity for Addanki et al. is stated as O(dlog(d)+d/ϵ)O(d log(d) + d/\epsilon), which is the sample complexity for leverage sampling. However, only the proposed ITE estimator in that paper relies on leverage sampling. Their proposed ATE estimator instead recursively partitions the data using the GSW algorithm from Harshaw et al., so presumably the sample complexity would be different. Am I mistaken?

论据与证据

The key contributions stated in the introduction are all supported by clear evidence.

方法与评估标准

The proposed methods make sense for the problem.

理论论述

I only briefly reviewed the proofs, and do not have any immediate concerns about correctness.

The proof of Lemma 4.4 borrows notation from Ghadiri et al without defining it or referencing that paper, making it hard to follow without previously reading the other paper. In particular, 2y=t+Zμ2y = t + Z \bigodot \mu. Also, it does not appear that μ\mu is formally defined in the main body of the text, but is presumably consistent with the definition from Ghadiri et al.

实验设计与分析

For the primary proposed algorithm, the experimental section was sound and the baselines were consistent with other papers in this area.

However, there is no experimental section for the SUTVA relaxed CGAS method. This does not impact the primary argument of the paper, but perhaps hints that the Section 6 material is better saved for an expanded sequel with more analysis.

One suggestion: The difference in sample complexity between the proposed RWAS method and the previously proposed RAHT method is primarily a function of the feature space size dd, so I suspect further experiments investigating empirical performance differences as dd varies may be interesting. Further reinforcing this, in Table 2, the Twins data set shows the largest different between RWAS and RAHT and also happens to have the largest feature space.

补充材料

I briefly reviewed the appendix, which is primarily comprised of proofs. I have no immediate concerns.

与现有文献的关系

The paper adapts the proposed method from Ghadiri et al to include a more efficient sampling strategy, derived from the more theoretical work of Chen and Price. The proposed estimator shows improved sample efficiency both theoretically and empirically.

Within the much broader literature, experiments with limited sample sizes are quite prevalent due to a number of reasons, including high costs or ethical concerns. The proposed methodology seems like a potentially valuable solution to this common scenario.

遗漏的重要参考文献

Nothing notable is omitted from my knowledge.

其他优缺点

The paper does a good job of presenting the problem and discussing the solution. It was relatively easy to follow their arguments and evaluation.

There are some minor notational issues, and while section 6 appears to be quite valuable, it does not appear to have the space it deserves, including empirical validation.

其他意见或建议

I think there may be notational issues with the index jj in definition 4.1. Initially we extract the sample with index r(i)=jr(i) = j, so presumably Xr(i)=XjX_{r(i)} = X_j, which is used to create PijP_{ij}. In a following sentence where AA is defined, jj now denotes the jjth coordinate of XiX_i, and we confusingly have Xr(i),jX_{r(i),j} despite previously defining r(i)=jr(i) = j a few sentences prior.

in the section 6 header, 'assumption' is misspelled as assupmtion

作者回复

Thanks for your advice! Here is our response point by point:

Q1: The proof of Lemma 4.4 borrows notation from Ghadiri et al. without defining it or referencing that paper, making it hard to follow without previously reading the other paper. In particular, 2y=t+Zμ2 y=t+Z \bigodot \mu. Also, it does not appear that μ\mu is formally defined in the main body of the text, but is presumably consistent with the definition from Ghadiri et al.

Thanks for the careful review and wonderful comment! Here μ\mu indicates that the vector composes of Yi(0)+Yi(1)Y_i(0)+Y_i(1). and \bigodot denots the Hadamard product. We agree that, even though we had previously stated that these local notations were adopted from Ghadiri et al., it is important to make this explicit again in the main text or appendix. We have revised the manuscript accordingly with our utmost care and sincerity, which will be presented with full rigor and clarity in the final camera-ready version.

Q2: Add network experiment

Thanks for your advice. First, we have conducted large-scale experiments across a range of values for both dd and nn, as shown in RW kg3L. Second, we have fully implemented your suggestion and carried out additional experiments under settings with network interference. Specifically, we followed both the simulation setup and the real-world dataset configuration used in the work by Lu et al. [1]. We replicate their basic settings wholly. We refer reader for the experimental results in the link https://anonymous.4open.science/r/Causal-Active-Sample-B0C5/README.md.

Q3: Expeiremnt focuses on d.

Refer to RW kg3L.

Q4: Notation issue Thanks for your comment! We will fix this notational issue by choosing another symbol to denote the row of the vector or matrix to avoid potential incomprehension.

Q5: The proposed ATE estimator in Addanki et al. instead recursively partitions the data using the GSW algorithm from Harshaw et al., so presumably the sample complexity would be different.

Thanks for your comment. The sample complexity for Addanki et al is O(dlog(d)+d/ϵ)O(dlog(d)+d/\epsilon) for the ITE estimation; for ATE estimation, in their excellent paper, the complexity could be controlled by "ATE estimation. For ATE estimation we give a randomized algorithm that selects at most ss individuals for treatment/control assignment and obtains an error of O~(σ/s+(β1+β0)/s)\widetilde{O}\left(\sigma / \sqrt{s}+\left(\left\|\boldsymbol{\beta}^1\right\|+\left\|\boldsymbol{\beta}^0\right\|\right) / s\right)," We will revise it in the final version. Thanks for your wonderful comment!


[1] Adjusting auxiliary variables under approximate neighborhood interference, X Lu, Y Wang, Z Zhang, arXiv preprint arXiv:2411.19789.

审稿意见
4

Experimental design for estimating treatment effects does not generally have strong finite-sample guarantees, especially as the dimensionality of the covariates grows. Recent works implement experimental design based on leverage scores. This work proposes an alternative approach called IRD, which helps achieve a sample complexity for the estimation error that is linear in the covariate dimensionality. The method is validated with a variety of standard semi-synthetic experiments.

Update after rebuttal: after considering the additional results provided, I have decided to increase my score.

给作者的问题

  1. Specifically what role do partitioning and subsampling play in the proposed method?

  2. Does this method easily extend to multiple treatments?

论据与证据

The contribution is clear and the method is sufficiently demonstrated on a variety of standard benchmarks. It would be helpful to see results that specifically highlight the finite-sample guarantee as a function covariate dimensionality. Without this, there is little intuition on the supposed performance benefits.

方法与评估标准

The core novelties of this method like the adaptation of IRD are never fully described. Section 4.1 gives some intuition and a broad overview of the goals for the proposed method, but the steps in Algorithms 1 and 2 are not justified in detail. In my view, the authors need to put more effort into clearly explaining the motivation behind the new algorithm, beyond referencing recent works that they are building upon.

理论论述

I did not check the lengthy appendices that contain all of the proofs. However, I took a brief look and there are no obvious errors.

实验设计与分析

The paper presents numerous standard benchmarks in treatment-effect estimation like IHDP and the Boston dataset, for different subsample sizes.

补充材料

I took a quick look over the proofs that comprise most of the supplementary material.

与现有文献的关系

The contributions are clearly stated in relation to recent works like those of Ghadiri et al. and Harshaw et al.

遗漏的重要参考文献

References are sufficient.

其他优缺点

The problem of active sampling for treatment-effect estimation with high-dimensional covariates is clearly significant. The solution appears to have clear benefits over other recent works. It would be very helpful to better describe the method so that readers can understand the key contributions.

其他意见或建议

Some of the language is inappropriate for a venue like ICML. For instance, in the beginning of Section 4, the authors use exaggerated adjectives like "outstanding" and "superior" without reference to specific claims or objectives.

作者回复

Thanks for your review and comments! Here are the responses to all of your questions.

Q1: It would be helpful to see results as a function of covariate dimensionality.

Thanks for your suggestion! We provide the following supplementary experiments on the performance differences as d varies with n=1000n=1000 samples. The advantage of our RWAS algorithm over traditional CRA/GSW methods is especially apparent when the sample dimension d is large, as a result of the optimal sample complexity in active sampling process.

dHTHajekCRAGSWRAHTSC4-VectorsRWAS (Ours)
101.98 (0.78)1.27 (0.53)0.98 (0.40)0.93 (0.38)1.08 (0.47)1.13 (0.65)1.15 (0.53)1.08 (0.46)
203.67 (0.88)2.88 (0.57)2.10 (0.49)2.28 (0.70)2.41 (0.59)2.50 (0.73)2.46 (0.62)2.19 (0.55)
501.14 (0.74)1.03 (0.50)1.02 (0.39)0.80 (0.64)0.80 (0.56)0.80 (0.68)0.92 (0.61)0.71 (0.47)
1001.82 (0.95)1.62 (0.86)2.22 (0.53)1.78 (0.87)1.53 (0.83)1.73 (0.90)1.80 (0.85)1.51 (0.82)

Q2: Further description on the core novelties of this method like the adaptation of IRD.

Thanks for your concern. Please see the response to RW PucS.

Q3: Revision on the word like outstanding

Thank you for this helpful note. In response, we have carefully revised the wording in the beginning of Section 4 to remove exaggerated adjectives such as "outstanding" and "superior". We now use more neutral, objective language and anchor our claims directly to specific empirical results and theoretical guarantees. This revision aims to improve clarity and expression throughout the manuscript. We sincerely appreciate your feedback on this matter.

Q4: How to emphasize the role of partitioning & subsampling

Thanks for your concern! The overall target of partitioning and subsampling is to find efficient causal effect estimators with limited cost on sample collection and computation, but focusing on different components.

  • Subsampling finds the most representative samples in terms of covariate distribution by selecting and up-weighting ``influential'' directions in the data.

  • Partitioning seeks for well-balanced assignments and provide unbiased effect estimators with controllable expected error.

In Algorithm 1, partitioning and subsampling are deployed in two disjoint subset of the entire samples, one of which applies GSW with regression adjustment, and the other applies IRD. Through weighting and averaging, we obtain an unbiased estimate with the total sample size being O(d).

Besides, the key idea is to improve the MSE through well-balanced assignments and representative samples. As stated in line 65, two types of techniques are applied to achieve these goals: partitioning and subsampling. The IRD algorithm selects the representative samples to ensure a controllable estimation error. At the same time, the GSW design is applied to achieve balanced assignments by partitioning the samples into treatment and control groups. This motivates the following thought, serving as another line of motivation:

  • We can use O(1) samples for the GSW regression-adjusted estimator.
  • However, learning the regression coefficient requires O(d) samples.
  • Therefore, we divide the data into two disjoint sets, one of which applies GSW with regression adjustment, and the other applies IRD, using O(d) samples to learn the regression coefficient from the first set.
  • These two sets each construct an estimator, and after weighting and averaging, we obtain an unbiased estimate, with the total sample size being O(d). Note that using either set alone as an estimator may be biased because we actively select individuals corresponding to X with good representational properties for the estimation. This group is not independent and is randomly selected from the entire sample.

Q5: Does this method easily extend to multiple treatments?

Thanks for your question. It is not complicated to propose a solution for multiple-treatment extension. For example, the estimator on the treatment effects between two treatment levels Z=z1Z=z_1 and Z=z2Z=z_2 can be constructed through our method restricting treatment allocation on {z1,z2}\{z_1,z_2\} as long as the positivity assumption P(Zi=z)>0P(Z_i=z)>0 holds for targeted levels. On the other hand, it is possible to find methods to compare different treatment pairs through one sampling process, or combine multiple processes to further increase efficiency, which requires further research. It might be highly nono-trivial due to the proof related to the martingale in the GSW design.

审稿意见
3

The authors considered the problem of estimating the causal effect in an active sampling framework. In particular, they proposed a method called RWAS which attains the sample query of O(d/ϵ)O(d/\epsilon) to achieve ϵ\epsilon-approximation error where dd is the number of covariates. Moreover, they also provided a lower bound, showing the optimality of the proposed method up to some constant. They performed experiments on synthetic and real datasets showing that the proposed method has a better performance compared to baseline methods.

Update after rebuttal

I checked the proof of the lower bound given in the rebuttal and it looks correct to me. I suggested adding these explanations to the paper. I adjusted my score. However, I still believe that the paper is not well-written. Therefore, I could not recommend a clear acceptance.

给作者的问题

  • What are technical novelties compared to Chen & Price (2019)? It seems that the results are adoptation of results in Chen & Price (2019) to the current setting. The explanation of the proposed method in lines 134-142 is vague. For instance, it is not clear "This approach serves as a refinement of Chen & Price (2019), in which we transfer the strategy (Definition 5 in Chen & Price (2019)) to the design-based setting."
  • Please explain Algorithm 1 line by line and explain the intuition behind each part there.
  • The lines 149-157 are not written well. For instance, "We say it is a “good ϵ\epsilon-reweighting sampling strategy” if (i) Defining a matrix". Moreover, what is the second condition (ii) here?
  • Regarding the statement of Theorem 5.1, first please fix "\forall any" to "For any". Moreover, the statement is a little bit weird as it says that "for any algorithm whose output τ^\hat{\tau} satisfying τ^τ0.1|\hat{\tau}-\tau|\leq 0.1 with probability 0.75", then sample queries is at least 2.86d/ϵ2.86d/\epsilon but τ^τ0.1|\hat{\tau}-\tau|\leq 0.1 does not depend on ϵ\epsilon and it is just needed to be less than 0.1.
  • In line 287, what is the notation "*"? In line, what is the notation N~(i)\tilde{\mathcal{N}}(i)?

论据与证据

The paper is a theoretical work and proofs are provided for theorems or lemmas in the paper. However, the paper is not well-written and it is hard to follow some parts of it For instance, some terms or definitions are not defined. Moreover, it is often assumed that the reader is completely knowledgable about all aspects of the problem. However, the authors should give more explanation or context about the proposed method.

方法与评估标准

The main metric for comparing the methods is a sample query to achieve an ϵ\epsilon-approximation error which is a reasonable evaluation criterion.

理论论述

Unfortunately, I only had time to check the proof of Lemma 4.2.

实验设计与分析

I did not check the codes, but I read the experiment section. Based on the text, it seems that the experimental results are sound and the authors also provided some explanations for the plots.

补充材料

I just checked Appendix A regarding the proof of Lemma 4.2.

与现有文献的关系

Identifying causal effects is one of the main goals in many areas of empirical science. One of the main approaches in causal effect identification is to perform experiments. This paper considers active sampling in experiment design in order to provide order-optimal methods in estimating the causal effect.

遗漏的重要参考文献

I am not an expert in the specific area of active sampling for causal inference. Therefore, I am not sure whether any important reference is missed.

其他优缺点

Strengths:

  • The proposed method provides an order optimal method for active sampling to estimate the causal effect. For this purpose, they also gave a lower bound on the number of queries of any algorithm to achieve an ϵ\epsilon-approximation error.

  • The experimental results showed that the proposed method has better performance compared to previous work in most real datasets.

Weakness:

  • Some parts of the paper is not well written and some notations or terms are used without defining them which make hard to read the paper or validate the proofs.

其他意见或建议

I suggest revising the whole paper and making sure that all the terms are defined. Moreover, please provide a general description of the proposed method first and then describe the details of each part. In the current version, for instance, it is hard to understand how Algorithm 1 works such as what is the purpose of GSW design. As another example, the term HT is used before saying that it refers to Horvitz-Thompson estimator.

作者回复

Thank you for your thoughtful review and valuable questions! We address your questions point-to-point in the following.

Q1: However, the authors should provide more explanation or context about the proposed method.

  1. Insightful motivation.

Motivating example. Suppose XX is n×2n \times 2, and we want a subset to approximate XXX^\top X for regression. Random sampling may cluster points along a single direction, leading to rank deficiency or poor conditioning. In contrast, Algorithm 2 adapts sampling probabilities to promote diversity: it increases the chance of selecting points in underrepresented directions by updating BiB_i. This ensures that new samples help span missing dimensions. After a few rounds, the selected subset captures both primary axes and preserves the geometry.

To avoid over-representing outliers, constraints on αi\alpha_i limit how much influence any point can have. As a result, the spectrum of the reweighted matrix AAA^\top A remains close to that of the full dataset (identity matrix after normalisation). We avoid redundancy while preserving rare but critical directions—achieving near-optimal regression performance with far fewer, well-chosen samples.

  1. Detailed interpretation of main algorithm
StepDescription
Line 1Subsequently assign the Bernoulli trial to the cases selected by IRD.
Lines 2–4The IRD Algorithm (Alg. 2) is deployed to select a subset SS representative to the covariate distribution of the entire samples and derive the estimation on adjustment parameter β^act\hat{\beta}^{act}.
Line 5Deduce the Horvitz-Thompson estimator on SS following the Bernoulli trial assigned in line 1.
Lines 6–7Randomly sample a subset from the complementary set of SS and deploy the GSW design to derive a balanced assignment. Adjust the HT estimator to improve efficiency.
Line 8Weight the causal effect estimators on representative set SS and set Sˉm\bar{S}_{m'} to yield the combined estimator using well-balanced assignments and well-representative samples.

Another line of motivation, refer to RW RneH for the Q4.

  1. Additional comparision
  • The technical novelty (i): First, we refine the previous result in Chen & Price (2019) to adapt to a finite-sample setting. It resorts to the refinement of the parameter in the ϵ\epsilon-reweighting strategy, compared to that of Chen & Price (2019), Definition 2.1. Intuitively, consider the OLS estimator regressing YY on XX for any arbitrary weight wiw_i. We have E[wi(Y^iolsYi)]=0E[w_i (\hat{Y}_i^{ols} - Y_i)] = 0, so the expectation of the noise term vanishes. However, in finite samples, such sum may usually deviate from zero, necessitating further refinement of the weighting parameters.

  • The technical novelty (ii): It brings significant challenges since, compared with the setting in Chen & Price (2019), each individual's outcome is not fixed as YiY_i, instead, the outcome is randomly chosen as Yi(0),Yi(1)Y_i(0), Y_i(1), jumping between the real world and the counterfactual world.

Q2: technical novelties compares to Chen & Price (2019)?

Kindly see Q1-Additional comparison

Q3: Explain Algorithm 1

Kindly see Q1-Detailed interpretation of main algorithm.

Q4: The lines 149-157's writing, and the meaning of Condition (ii).

The second condition (ii) is intended to impose an upper bound on the total weighting factors αi\alpha_i. Specifically, this condition ensures that the cumulative weights are controlled, preventing any single sample or a small group of samples from excessively dominating the overall estimator.

To clarify the mathematical intuition, we provide the following rewritten version of the paragraph. Briefly, it includes (i) Spectral approximation condition, (ii) Bounded cumulative weight condition and (iii) Single-point influence condition. Due to space limitations, we kindly refer you to the anonymous link: https://anonymous.4open.science/r/Causal-Active-Sample-B0C5/README.md.

Q5: The parameter analysis of the lower bound

Overall, whether setting it as τ^τϵ|\hat{\tau}-\tau|\leq \epsilon or 0.10.1 are both correct (as long as ϵ<0.1\epsilon < 0.1); We sincerely appreciate your suggestion—explicitly including the ϵ\epsilon parameter in the estimation error bound improves clarity and helps readers better understand the overall process and how the Shannon-Hartley upper bound applies to the estimation error.

Q6: Notation meaning

“∗” is a “null option”; also refer to the dear reviewers to Algorithm 4 in the appendix. the notation Nbπ(i)\mathcal{N}_b^{\pi}(i) is deferred in the above anonymous link.


We are eager to hear your feedback. We’d deeply appreciate it if you could let us know whether your concerns have been addressed.

审稿人评论

Thanks for the response. It is now somewhat clearer to me what the contributions of the current work are. However, I still believe that the paper is not well-written — at least, I personally find it difficult to follow. This might be due to the fact that I am not a complete expert in this specific area of causal inference, and some concepts that the authors assume to be known could benefit from clearer explanations. To fairly assess this paper, I suggest relying more on the other reviewers who appear to be more familiar with this area.

Regarding the responses, I still have one question about the lower bound. I think the statement in the paper, as well as the explanation in the rebuttal, is not quite correct. If I choose a very small ϵ<0.1\epsilon < 0.1, then an estimator that merely guarantees ττ^<0.1|\tau - \hat{\tau}| < 0.1 would still require access to Ω(d/ϵ)\Omega(d/\epsilon), which is quite counterintuitive. I still believe that the condition ττ^<0.1|\tau - \hat{\tau}| < 0.1 should be replaced with ττ^<ϵ|\tau - \hat{\tau}| < \epsilon, and that the authors are actually proving this bound for any ϵ(0,0.1)\epsilon \in (0, 0.1).

作者评论

Thank you very much for your prompt response!


Such a counterintuitive case of data generation does exist—where the estimation error is only required to be controlled within 0.10.1 with some probability, yet achieving this still objectively requires samples of at least order O(d/ϵ)O(d/\epsilon).

Intuition Even under the mild condition of the estimation error, researchers can still generate "hard, unsatisfactory" instances to ensure that its estimation is "difficult", requiring many samples.

Example In our Appendix E ("The proof of the lower bound"), we provide a counterintuitive construction of data generation under the super-population perspective: We consider the following constructions: τ(Xi)=L(Xi)+μ\tau\left({X}_i\right)=L\left({X}_i\right)+\mu. Here L()L(\cdot) is the pre-fixed function, selected from the linear family L\mathcal{L} satisfying LD=1\parallel \mathcal{L}\parallel_D=1. μ\mu is "important", namely, the i.i.d. Gaussian noise satisfying μN(0,1ε).D\mu \sim N\left(0, \frac{1}{\varepsilon}\right) . D is a uniform distribution under the dd dimensional Euclidean space. We consider that Xi{X}_i is sampled from DD. We set the τ(Xi)\tau(X_i) as the corresponding ITE value τi\tau_i, and then we can naturally provide a feasible construction of Yi(1),Yi(0)Y_i(1), Y_i(0).

Why this example makes sense. We follow the technique in the active sampling.

Step 1 First, there exists a subset {L={L1,L2,...Ls}\mathcal{L}=\{L_1, L_2,...L_s\}} which satisfies s21.919ds \geq 2^{1.919d} and each pair among them contains a nontrivial "distance". Specifically, \exists a subset {L1,L2,...Ls}=:LL\{L_1, L_2,... L_s\}=: \mathscr{L} \subseteq \mathcal{L} with s20.7ds \geq 2^{0.7d}, LiD=1\parallel L_i\parallel_D = 1, Li1\parallel L_i \parallel_{\infty} \leq 1. Moreover, LiLjD0.2 \parallel L_i-L_j \parallel_D \geq 0.2.

This construction is achieved via a recursive greedy algorithm. As we illustrated in Appendix E (page 17), we first restrict L\mathcal{L} to all function mappings to {±1}d{\{ \pm 1 \} }^d. On this basis, we recursively pick the legitimate function from it and in each step guarantees that we remove the function within the distance 0.20.2 from our selected one. This process will output at least 21.919d2^{1.919d} functions according to our Lemma E.1, Appendix 4.

Step 2 We make a lower bound of the mutual information between the randomly chosen generation function LjL_j and any algorithm's output.

Let I(Lj;τ^)I(L_j; \hat{\tau}) denote the mutual information of a uniformly randomly chosen function LjLL_j \in \mathcal{L} and algorithm's output τ^\hat{\tau} given L=s|\mathcal{L}|=s observations (X1,...),(X2,...),...(Xs,...)(X_1,...), (X_2,...), ...(X_s, ...) generating from τ(Xi)=Lj(Xi)+noiseN(0,1ϵ)\tau\left({X}_i\right)=L_j\left({X}_i\right)+noise\sim \mathcal{N}(0,\frac{1}{\epsilon}). By Fano's inequality, we get I(Lj;τ^)=H(Lj)H(Ljτ^)0.4log(s)1.43I(L_j; \hat{\tau}) = H(L_j) - H(L_j \mid \hat{\tau}) \geq 0.4log(s) \geq 1.43 (H()H(\cdot) is an entropy).

Step 3 Third, we can also get an upper bound of the above mutual information.

By Shannon–Hartley theorem, it leads to I(Lj;τ^)sϵ2I(L_j; \hat{\tau}) \leq \frac{s \epsilon}{2}.

Combined with Steps 1-3, it leads to the expected result, i.e, 1.43I(Lj;τ^)sϵ21.43 \leq I(L_j; \hat{\tau}) \leq \frac{s \epsilon}{2} leads to s=Ω(2.86d/ϵ)s = \Omega(2.86d /\epsilon), namely, the requirement of O(d/ϵ)O(d/\epsilon) samples.

Inspired by for insightful consideration, we polish the statement of our theorem (under superpopulation), and will be presented in the Camera-ready version:

Theorem 5.1. (Lower bound). \forall any fixed dimension dd, for any ϵ(0,0.1)\epsilon \in(0,0.1) sufficiently small, there exists a feasible set of {Y1,Y0}\{\boldsymbol{Y}^1, \boldsymbol{Y}^0\} such that for any algorithm whose output τ^\hat{\tau} satisfying τ^τ0.1|\hat{\tau}-\tau| \leq 0.1 with probability 0.750.75 , need at least 2.86d/ε2.86 d / \varepsilon sample queries. Specifically, such data generation process could be chosen as i.i.d Gaussian noise μ=N(0,1ϵ)\mu = N(0, \frac{1}{\epsilon}), and τ(Xi)=L(Xi)+μ\tau\left({X}_i\right)=L\left({X}_i\right)+\mu, where XiX_i sampled from some distribution DD, and LD=1\parallel L \parallel_D = 1.

Insight It demonstrates that the sample complexity of our method is optimal and cannot be improved. This is because, even under a relaxed requirement on the estimation error, there still exist challenging instances for which achieving accurate estimation requires a complexity of O(d/ϵ)O(d / \epsilon ).


References

T. Cover, J. Thomas (1991). Elements of Information Theory. pp. 38–42. ISBN 978-0-471-06259-2.

Hartley, R. V. L. (July 1928). "Transmission of Information" (PDF). Bell System Technical Journal. 7 (3): 535–563.


Your suggestions have already been incorporated into the revised version to improve its clarity and accessibility for a broader research audience. Moreover, it would mean a great deal if our efforts to address your concerns could be reflected in your evaluation. We would be deeply grateful if you would consider revisiting your initial score in light of our discussions with our joint effort.

审稿意见
4

This paper developed a finite-sample estimator with sample complexity analysis for causal effect estimation. The paper demonstrated the near-optimality of the sample size, and further extended the framework to social networks. Numerical experiments with simulated and real-world data supported the effectiveness of the proposed estimator.

update after rebuttal:

I have read through the authors' response and will be keeping my original scores.

给作者的问题

In the experiments, there are several settings where CRA / GSW actually outperformed the proposed estimator. Can you provide more intuition or justifications when the proposed estimators tend to perform better or worse?

论据与证据

Overall the proposed framework is well-described and the main claims in the paper are well-supported. The main theoretical results include the ATE estimation error upper bound and a matching lower bound under a superpopulation perspective. The proof sketch provided for the upper bound is clearly presented.

方法与评估标准

The upper bound was established via two main steps: the IRD algorithm adapted from Chen & Price (2019)'s design-based setting results, and using the obtained coefficients from IRD to adjust the finite-sample estimator. The lower bound shows that this upper bound is near-optimal. Based on the experiments with synthetic and real-world data, the proposed estimator did not substantially outperform all baselines, but achieved comparable or better results in most of the scenarios.

理论论述

To my best knowledge the upper bound proof is correct.

实验设计与分析

Based on the presented content in the main paper, the experimental design appears to be sound.

补充材料

I was able to review the overall proof for the upper bound in rough details, they appear to be correct to my best knowledge.

与现有文献的关系

The broader literature was well-addressed.

遗漏的重要参考文献

I did not notice missing essential references.

其他优缺点

The paper was well-organized, claims and proof sketches were clearly stated.

其他意见或建议

Minor typo: section 4 1st paragraph, Iterative Reweighting in Design-based "Set- ting"

作者回复

Acknowledgement and General Response to RW kg3L, RW PucS , RW RneH, RW fVjN

Thrilled to receive such a positive reception from dear reviewers! Also, sincerely thank all reviewers for their insightful suggestions! In this general response, we carefully synthesized the reviewers’ suggestions to facilitate the AC’s assessment and, more importantly, use them as a blueprint to guarantee the quality of the camera-ready version. We have addressed these points:

  1. Detailed and intuitive explanation of motivation, methods and algorithms (RW PucS, RW RneH). In our rebuttal and our revised version, we provide (i) insightful motivation and intuition, (ii) additional comparison with previous literature and (iii) detailed interpretation of the main algorithm, especially for audiences who are unfamiliar with causality or active sampling.

  2. Experiments (RW kg3L, RW fVjN). We provide additional performance comparisons upon the dimension dd, the size nn, the network interference, etc., to check in which circumstance the method is more advantageous. These empirical results back up our theories and intuitions.

  3. Mention some misleading statements and notations (RW kg3L, RW PucS, RW fVjN). Thanks for the meticulousness. Our improved version has made a few isolated definitions more transparent and intuitive.

If have any further questions, please don’t hesitate to contact us directly. We will respond as promptly as possible and do our utmost to address every concern thoroughly. Let’s enjoy this fruitful discussion together.


For RWkg3L: Thanks and please kindly find below our concise and clear rebuttal addressing your concerns.

Truly appreciate your strong endorsement! Here is our response:

Q1: the synthetic experimental design appears to be sound. In the experiments, there are several settings where CRA / GSW actually outperformed the proposed estimator. Can you provide more intuition or justifications as to why the proposed estimators tend to perform better or worse?

Thanks for your advice! More intuitions are as follows:

  1. The advantage of our RWAS algorithm over traditional CRA/GSW methods is especially larger when the sample dimension dd is large. This is because we have proven, through bounding arguments, that the sample complexity of RWAS is relatively optimal during the active sampling process that achieves a (1+\epsilon) relative fitting error, which is O(d/\epsilon). As a result, when the sample dimension dd is relatively large, our error is minorer than that of other methods.

  2. The data structure (i.e., whether there exists a subset with efficient global representational capability) plays an important role in comparing the performance of these algorithms. Compared to the GSW algorithm, our key differences are that

    • We adopt the IRD active sampling strategy (yielding a subset whose XXX^\top X distribution closely mirrors the overall distribution), thus obtaining regression coefficients from XX to the individual treatment effect (ITE = Y(1)-Y(0)); and
    • We apply regression adjustment to GSW itself (leveraging the regression coefficients from the former step). Intuitively, suppose the covariate distribution of this subset closely approximates the overall distribution. In that case, (i) our regression-adjusted GSW will be more efficient, and (ii) our IRD will produce a smaller fitting error for estimating the overall ATE.
  3. Motivated by your inspiration, we provide the following supplementary experiments on the performance differences as d varies, which aligns with the above intuition.

dHTHajekCRAGSWRAHTSC4-VectorsRWAS (Ours)
101.98 (0.78)1.27 (0.53)0.98 (0.40)0.93 (0.38)1.08 (0.47)1.13 (0.65)1.15 (0.53)1.08 (0.46)
203.67 (0.88)2.88 (0.57)2.10 (0.49)2.28 (0.70)2.41 (0.59)2.50 (0.73)2.46 (0.62)2.19 (0.55)
501.14 (0.74)1.03 (0.50)1.02 (0.39)0.80 (0.64)0.80 (0.56)0.80 (0.68)0.92 (0.61)0.71 (0.47)
1001.82 (0.95)1.62 (0.86)2.22 (0.53)1.78 (0.87)1.53 (0.83)1.73 (0.90)1.80 (0.85)1.51 (0.82)

The results indicate that our RWAS methods dominate the previous methods, especially when dd is large, which is relevant in practice. The code is available at https://anonymous.4open.science/r/Causal-Active-Sample-B0C5/README.md.

Q2: Minor typo.

Sincerely, thanks for your careful review! We revise this topo from "Set-ting" to "Setting".

最终决定

Almost all reviewers are favorable to the paper.

There is one issue with the lower bound with the way it was stated. Authors show that a specific data generating process could explain the counterintuitive-looking claim in the lower bound (I read through it and seems satisfactory). I encourage the authors to update the statement of the theorem for lower bound as they have promised.

Authors show that given n units with covariates to be assigned for treatment and control group, in the active learning setting authors propose a way to sub select only O(d/epsilon)O(d/epsilon) units for assignment and also propose a way to partition them also to achieve an ATE estimation error of at most epsilonepsilon when the potential outcomes follows a linear model. Authors exploit a sub selection procedure of Chen and Price from 2019 that actively sub selects covariates for labeling for linear regression that is representative of the full set of covariates and further combining it with partitioning using Gram-Schmidt Walk type experimental design of Harshaw et al 2024 that preserves covariate balancing.

Given the results of Chen and Price 2019 and its advantages over leverage score sampling (employed in previous ATE active estimation works), the results here are not totally surprising. However, the final result establishes near optimal sample complexity tightening existing results.

Therefore I recommend accept.