PaperHub
4.9
/10
Poster4 位审稿人
最低2最高3标准差0.4
3
3
3
2
ICML 2025

Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing

OpenReviewPDF
提交: 2025-01-11更新: 2025-07-24
TL;DR

This paper proposes a randomization inference framework using conformal inference to selectively borrow external controls, ensuring exact type I error control and valid post-selection inference, with application to a lung cancer trial.

摘要

关键词
causal inferencedata fusionrandomization testreal-world data and evidencesmall sample size

评审与讨论

审稿意见
3

This paper proposes to use Fisher Randomization Test (FRT) in RCTs when leveraging external controls (EC). Since FRT only uses the randomization distribution of a test statistic under the sharp null, it always provides valid type-I error control regardless of how the potentially biased ECs are incorporated. In this way, ECs are used as a tool for constructing more powerful test statistics. In doing so, the authors recognize the issue of bias in ECs, and propose to improve the test statistic by selectively borrowing ECs by thresholding their conformal p-values. To choose the threshold, a cross validation paradigm is also proposed. The proposed methods are demonstrated via simulations and real data illustrations.

给作者的问题

  • Is there a connection between MSE of τ^\hat\tau and power of RCT?
  • Is it possible to get back to the estimation problem based on this approach?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

I briefly checked the proofs which seem correct to me.

实验设计与分析

I checked the simulation setups and they seem sound.

补充材料

I reviewed the proof and additional simulation results.

与现有文献的关系

  • Propose the use of FRT in borrowing information from ECs in RCT analysis.
  • Demonstrate the use of selective borrowing in developing more powerful test statistics.
  • Demonstrate the use of conformal inference in selective borrowing.

遗漏的重要参考文献

I didn't find important ones to my knowledge.

其他优缺点

Strengths:

  • The results are rich.
  • Propose the importance of FRT in RCT analysis and suggest a new way of using ECs.

Weakness: In general, this paper contains a lot of results but the motivation and results can be organized in a better way. I list some confusions when reading the paper:

  • The comparison with existing methods which motivates the method is not fully reasonable. In particular, the hybrid doubly robust estimators are aimed for providing more accurate estimators, but the authors propose to switch gears to a randomization test for the sharp null, which makes it not fully comparable with the existing method. This makes the arguments that motivate the current proposal a bit weak.
  • The definition of conformal p-values are not new in this paper, and it takes too much space in the paper.
  • I'm not sure what is the connection between the MSE of τ^\hat\tau and the power of the resulting RCT. In reading the paper, I sometimes felt like the authors wanted to improve MSE of the test statistics but sometimes thought they want to improve the power of RCT. I would suggest the authors clean up the storyline.

其他意见或建议

N/A

作者回复

We sincerely appreciate your insightful and constructive comments. Below, we have provided our detailed, point-by-point responses.

1. Connection between MSE of τ^γ\hat{\tau}_\gamma and power of FRT

(i) Variance and power: Theorem 2.4 (power analysis for consistent test statistics) shows that a lower variance of a consistent test statistic leads to a higher power of FRT.

(ii) Bias and power: Simulation-based power analysis demonstrates that borrowing biased ECs can severely reduce FRT power.

(iii) MSE and power: Taking both variance and bias into account, we observe that a lower MSE of the test statistic is associated with higher FRT power. While it remains theoretically challenging to link MSE and power when the test statistic is irregular, our experiments show that MSE-guided γ\gamma selection offers a unified solution that performs well for both estimation accuracy and power.

We also revised our storyline as follows:

Our primary goal is to improve the power of the Fisher Randomization Test (FRT) for testing the null hypothesis of no treatment effect by borrowing external controls (ECs), compared to relying solely on RCT data (No Borrowing). This is guided by three key insights: (i) Theorem 2.4 shows that the power of FRT can be increased by reducing the variance of a consistent test statistic, which can be achieved by borrowing unbiased ECs to augment the small RCT control sample; (ii) borrowing biased ECs leads to inconsistency and significantly reduces the power of FRT; (iii) these insights motivate our use of Conformal Selective Borrowing to enhance power by borrowing unbiased ECs and discarding biased ones from a larger EC pool.

To this end, we introduce an intermediate objective: testing the exchangeability of each EC and selectively borrowing those deemed unbiased. We recognize that the power of this intermediate testing step is limited by the size of the RCT control group. To address this, we propose tuning the selection threshold γ\gamma (the significance level of the exchangeability test) to directly target our primary objective. Although tuning based on empirical FRT power would be ideal, it is impractical because (i) it requires specifying an alternative hypothesis and (ii) it is computationally expensive to compute power across a grid of γ\gamma values.

As a practical alternative, we use the MSE of the Conformal Selective Borrowing estimator as a proxy to guide γ\gamma selection. This MSE-guided approach offers several benefits: (i) experiments show it improves FRT power over RCT-only analysis, even though the theoretical connection between MSE and FRT power is challenging due to the irregular nature of the test statistic; (ii) it yields strong selection performance and supports our intermediate objective (see Fig. 2 and Fig. 14); (iii) with MSE-guided γ\gamma, CSB serves as both a powerful test statistic and a accurate ATE estimator; (iv) empirical MSE can be approximated leveraging the No Borrowing estimator, and we provide a non-asymptotic excess risk bound for the adaptive procedure.

2. Conformal selective borrowing (CSB) as a powerful test statistic and a reliable, efficient estimator

Although our primary motivation is to improve the power of FRT, the final method, CSB with MSE-guided γ\gamma selection, applies to both estimation and testing:

(i) Theorems 3.7 and 3.8 provide non-asymptotic excess risk bounds for the proposed estimator.

(ii) In simulations, CSB shows better estimation performance than adaptive lasso selective borrowing (ALSB) in terms of MSE under small and moderate sample sizes. This is demonstrated in Figures 6(C) and 11(C), where both methods are compared purely on estimation accuracy without involving inference procedures.

(iii) In real-world experiments, CSB improves both robustness and efficiency over No Borrowing (NB) and Full Borrowing (FB). Specifically, CSB reduces the standard error by 20% compared to NB and mitigates the bias of FB. The ATE estimate under CSB (0.138) is close to NB (0.142), while FB overestimates it (0.241).

These results suggest that CSB is not only effective for testing under sharp nulls but also serves as a reliable and efficient estimator of treatment effects.

3. Presentation

Following the reviewer’s suggestion, we moved the full conformal pp-value (which is computationally infeasible) and jackknife+ (a special case of CV+) to the appendix. We kept the split conformal pp-value and CV+ pp-value in the main text, as both are used in our experiments.

审稿意见
3

This paper proposes a method for combining (potentially biased) external controls (ECs) with data from a randomized control trial, in an effort to improve power to detect causal effects, without sacrificing Type 1 error (false positive) control. For controlling Type 1 errors, the key insight is to use a Fisher Randomization Test (FRT), which allows for computation of exact p-values for Fisher's "sharp null" hypothesis (potential outcomes are constant across treatment/control) with any test statistic, including those using potentially biased ECs. For improving power, there is no "free lunch" (as noted by the authors and other related work), but an approach is proposed for "selective borrowing" of less biased ECs, which is illustrated empirically in simulated data and a real-data application to a chemotherapy randomized trial.

Update after rebuttal: As stated below, I will keep my score. My impression of the paper is somewhere between a 3 and a 4.

给作者的问题

I have listed these in priority order, with the first two questions being particularly important to clarify.

  1. I am less familiar with randomization inference (a la the FRT). Does this procedure also provably control the false positive rate under the more conventional general null hypothesis that E[Y1Y0]=0E[Y_1 - Y_0] = 0? I would assume not, since it takes Y1,Y0Y_1, Y_0 as fixed, as opposed to random?
  2. How does CSB compare to ALSB under an apples-to-apples comparison? For context: The comparison in Figure 6 is not "apples to apples", instead it shows FRT + CSB versus ALSB with asymptotic inference. It would be nice to see either (a) FRT+CSB versus FRT+ALSB, or (b) CSB and ALSB considered head-to-head where both use asymptotic inference.
  3. Which of the theoretical results are the most novel, in the view of the authors? To me, it seems like most of the theory is relatively straightforward, and the main contributions are a bit more conceptual (e.g., the realization that one can use the FRT here), but I'm open to arguments that there are more novel theoretical contributions here.
  4. What is the concern with "post-selection inference" here? I don't quite follow the claim (see 090-092 LHS, "We account for selection uncertainty in FRT and offer valid post-selection inference"). The only justification I see is on lines 222-225 (LHS), "By using T(A)=τ^γT(A) = |\hat{\tau}_{\gamma}| as the test statistic for FRT and allowing E(γ)\mathcal{E}(\gamma) to vary with resampling A in FRT could account for selection uncertainty and provide valid post-selection inference". This is mainly just a conceptual argument for why there is no post-selection inference concern, right? I.e., under the null, the entire procedure is run to get a sampling distribution of the test statistic, which can depend in arbitrary ways on the data?

论据与证据

The main claims as I understood them:

  • Their approach will have valid level / false positive rate in finite samples for the Fisher sharp null.
  • Their power can be better than "no borrowing" (i.e., just using the RCT) if the bias is small
  • Their approach is better suited to finite samples than other selective borrowing approaches like Gao et al. 2023.

The claims are supported by clear and convincing arguments, and backed up by synthetic experiments where the bias can be controlled. I did find one claim a little difficult to understand (regarding post-selection inference), which I have mentioned in "Questions for the Authors" below, but I think this is a minor point.

方法与评估标准

Yes, for the most part (one important missing baseline, excluded due to computational concerns). See my comment on "relation to the broader scientific literature"

理论论述

I did not see any issues in the proofs. I checked the proof of Theorem 2.3, which seemed fairly immediate from the definition of the FRT, and Theorem 2.4, which similarly seems to follow fairly directly from auxiliary lemmas from related work (which I did not check). While I did not review the proofs of Propositions 3.1-3.6 in depth, they seem to follow from standard arguments, though I'm a little less familiar with the conformal prediction literature. Theorems 3.7 and 3.8 follow from standard results (and some algebra, which I did not check in detail) in non-asymptotic statistics.

实验设计与分析

The experimental design for the synthetic data seems sound to me, and is broadly similar to how I would have set up a synthetic data experiment for this method, probing failure modes as a function of the bias. The real-data experiment is more an "illustrative application / case study" rather than an experiment that tests a particular hypothesis, but that's a fairly standard approach for these types of papers in my experience.

补充材料

Yes, I skimmed the entire supplement, and read the related work (Section A), some of the proofs (Sections B.1, B.2, B.7, B.8, etc), and the comparison to ALSB in detail (Section C.4, Figure 6).

与现有文献的关系

This paper adds yet another method to a growing literature (well-documented in Section A) on combining observational and experimental data in an adaptive fashion, without making assumptions on the validity of the observational data.

The core insight of this paper, in my view, is that Fisher Randomization Tests can be used with any test statistic, and therefore they are a good candidate for "improvement" with external controls, since the false positive rate is controlled exactly regardless of how good the test statistic is.

Similar to other work in this area, this paper identifies, at least experimentally, a no "free lunch" phenomenon, , i.e., there are moderate biases which can lead to loss of performance. However, beyond some intuition-building theory (e.g., Theorem 2.4), the authors do not formally analyze this relationship between bias and power. That said, I am sympathetic to the difficulty in doing so.

It is harder to judge the significance of the conformal selective borrowing approach. While it is novel, it's not clear (on it's own, without the FRT component), how this approach compares to other selective borrowing approaches, especially given the lack of experimental comparison (on an apples-to-apples basis) with the Adaptive Lasso Selective Borrowing (ALSB) approach.

遗漏的重要参考文献

Overall, I found that the related work (in the Appendix) was quite comprehensive, but would have appreciated having more of that content in the main paper, if space permits.

其他优缺点

Beyond the lack of direct apples-to-apples comparison with ALSB, I have two (relatively minor) concerns regarding clarity / significance.

  1. I'm not sure how to square the null hypothesis tested under the FRT (the "sharp null") with the more conventional null hypothesis that e.g., the average treatment effect is equal to zero, which doesn't require that Y1=Y0Y_1 = Y_0, but rather just that E[Y1]=E[Y0]E[Y_1] = E[Y_0]. Hence, while the FRT controls the false discovery rate under the sharp null, it's not clear that it controls the false discovery rate under more conventional null hypotheses. It may be worth clarifying this point in the paper, or otherwise commenting on what is "lost" by relying on FRT.
  2. It's not clear to me how novel / original some of the theoretical results are, and it would be worth clarifying which theoretical results are considered most novel by the authors. Theorem 2.3, for instance, seems like an obvious consequence of FRT, and I'm surprised that it's given as a Theorem in this paper, as opposed to being cited from elsewhere (though given my lack of familiarity with the FRT literature, I don't have a citation to provide).

其他意见或建议

Some minor grammar / presentation points:

  • On Line 056, right-hand column: "Let the binary treatment denote by AA" -> "Let AA denote the binary treatment"
  • Figure 1 is extremely small. Note that you can use figure* in two-column formats to get a figure that crosses both columns, which would be better for this figure.
作者回复

We sincerely appreciate your insightful and constructive comments. Below, we have provided our detailed, point-by-point responses.

1. Clarification on sharp null hypothesis

The sharp null hypothesis Yi(1)=Yi(0)Y_i(1) = Y_i(0) for all ii \in RCT states that there is no individual treatment effect for any unit. Another common form of the sharp null is conditional independence, YiAiXiY_i \perp A_i \mid X_i, meaning the observed outcome is independent of treatment assignment given covariates. In contrast, the hypothesis that the average treatment effect (ATE) is zero is often called the "weak null."

In a finite-sample exact sense, FRT guarantees Type I error control under the sharp null but cannot guarantee it under the weak null. However, recent work has shown that FRT can also asymptotically control the Type I error under the weak null using studentized or pre-pivoted test statistics (Wu & Ding, 2021; Cohen & Fogarty, 2022). The RCT sample size is typically small in our context, such as trials for rare diseases. Asymptotic approximations may be unreliable, so we rely on the sharp null for a finite-sample exact test.

While this reliance on the sharp null rather than the weak null may be viewed as a limitation of FRT, to our knowledge, there is currently no testing procedure that controls Type I error in a finite-sample exact sense under the weak null without additional distributional assumptions. Section 6 also discusses possible future directions beyond the sharp null.

2. Significance of the Conformal Selective Borrowing (CSB) and comparison to Adaptive Lasso Selective Borrowing (ALSB)

Compared to existing selective borrowing approaches such as ALSB, the significance of CSB can be summarized as follows:

(i) Model-free flexibility: CSB is a model-free approach that allows flexible choice of conformal scores depending on data characteristics. For example, in our real-world application where the outcome exhibits heavy tails and heteroscedasticity (see Figure 14), we use conformalized quantile regression (Romano et al., 2019) for selection. Distance-based scores such as nearest-neighbor conformity scores (Shafer & Vovk, 2008) can be used for binary outcomes. This flexibility is difficult to achieve with model-based methods like ALSB.

(ii) Estimation: CSB performs better than ALSB in terms of MSE under small and moderate sample sizes. This is demonstrated in Figures 6(C) and 11(C), where both methods are compared purely on estimation accuracy without involving inference procedures.

(iii) Computation: CSB is compatible with the FRT, while ALSB is not readily applicable with FRT due to its computational complexity. This highlights an advantage of CSB when exact finite-sample inference is desired.

(iv) Apples-to-apples comparison under asymptotic inference (Asym): We conducted a comparison between CSB+Asym and ALSB+Asym (see here). CSB+Asym generally achieves better Type I error control than ALSB+Asym and performs comparably when b=1b=1.

3. Novelty and implications of theoretical results

Theorem 2.3 builds on classical FRT results (e.g., Lehmann and Romano, 2005, Theorem 15.2.1), but to our knowledge, this is the first formal result establishing the validity of FRT in the context of hybrid controlled trials. It confirms that FRT can be applied in this setting and clarifies how it should be applied to ensure validity. Specifically, Theorem 2.3 provides the following practical guidance and important caveats:

(i) Type I error control is guaranteed only when we permute assignments according to the actual experimental design, that is, permuting treatment assignments only within the RCT while keeping the EC assignments fixed. An important caveat is that permuting across all treated and control units, including ECs, would invalidate the theorem.

(ii) To maintain validity, the test statistic should vary with the permuted treatment vector AA, meaning the selected set of ECs should also be updated under each permutation to account for selection uncertainty. Fixing the EC selection across permutations would also invalidate the theorem.

Theorems 3.7 and 3.8 provide novel non-asymptotic MSE bounds that guide the design of the adaptive selection threshold in finite samples. Please also see our response to Reviewer f66S, Point 1.

4. Post-selection inference

The reviewer's understanding is correct. The selection uncertainty is fully incorporated into the reference distribution by allowing the selected set to vary with the resampled treatment vector AA in the FRT. Therefore, there is no post-selection inference concern when following this principle. However, as noted in the previous point, a key caveat is that the selected EC set should not be fixed during permutation, as this would ignore selection uncertainty and invalidate the test.

5. Presentation

We moved the related work to the main text, revised Line 056, and enlarged Figure 1.

审稿人评论

Thank you for the response - those clarifications are very helpful. I am inclined to keep my score (I'm somewhere between a 3 and a 4 after the response). Reading the review of GG5D, I would also suggest being more explicit upfront that there is no free lunch here, that you have proposed a valid approach that "might" improve power, if all goes well.

作者评论

Thank you very much for your thoughtful follow-up and valuable suggestions. We're pleased that your score has risen to 3.5, even though the official score may not reflect this according to the new scoring scale.

We fully agree that it is important to be upfront about the limitations. Our proposed method controls the Type I error for testing the sharp null and might improve the power of RCT-only analysis when the bias of ECs is either absent or detectable. When the bias is difficult to detect, our method may incur some power loss, though it still maintains valid Type I error control.

We also appreciate the importance of positioning our method within the context of the existing literature. The no-free-lunch limitation is recognized in existing review papers (Oberst et al., 2022; Lin et al., 2024), which point out that no method can uniformly and significantly outperform RCT-only analysis across varying levels of hidden bias, although different approaches optimize the risk-reward trade-off from different perspectives. The most challenging scenarios are those where bias is present but complex to correct or difficult to detect. Our main difference from existing literature is twofold: (i) we prioritize exact Type I error control in small samples first, then seek to improve power; (ii) we optimize the risk-reward trade-off between no borrowing and full borrowing from the perspective of conformal selective borrowing, motivated by our real data, where some ECs are unbiased while others are not.

We will incorporate this important discussion more explicitly in the introduction section to better align with your and Reviewer GG5D’s helpful feedback.

审稿意见
3

This paper proposes a randomization inference framework that can combine the data from randomized controlled trials with external controls. The proposed method controls the Type-I error in finite samples by leveraging conformal inference to select appropriate samples from external controls. In particular, the selection threshold provides an interpolation between no-borrowing and full-borrowing approaches, which can be tuned by minimizing the mean squared error. Some simulation results and real-world applications show the applicability of the proposed method.

给作者的问题

See above.

论据与证据

Basically, the claims in the submission are clear and supported by proofs or simulation studies. However, I feel that the bounds in Theorems 3.7 and 3.8 are loose and not easy to comprehend.

In addition, I don't quite understand why the authors can relate τ^γ^\hat{\tau}_{\hat{\gamma}} to super efficiency on the second column of Line 295. Can the authors provide some theoretical justification to this claim?

方法与评估标准

The proposed methods make sense to me. However, for the simulation setups, the dimension of covariates is p=2p=2, which is too small and won't lead to meaningful conclusions. Can the authors try some larger values for pp? Also, the current covariates XX have independent coordinates. Can we introduce any dependence within the coordinates of XX?

理论论述

I basically checked all the proofs, and they looks correct. A minor question is how Φ\Phi is related to variances of ϵγ\epsilon_{\gamma}. Can the authors provide some short discussion on it?

实验设计与分析

I checked the experimental results and already pointed out my concerns for the simulation setups above.

Moreover, for Figures 5 and 10, the authors should also report the value of γ^\hat{\gamma}. The reason is that the proposed method CBS under the "optimal" value γ^\hat{\gamma} performs not as good as the cases under other values of γ\gamma. This is a severe issue that may limit the impact of this paper.

补充材料

Yes, I read all parts of the supplementary materials.

与现有文献的关系

This paper combines the approaches in conformal inference literature with other randomization tests to propose a new framework for combining the data from randomized controlled trials with external controls. The estimator relies on the one proposed by Li et al. (2023), but the authors also relax the mean exchangability condition imposed by Li et al. (2023).

Li, X., Miao, W., Lu, F., and Zhou, X.-H. Improving efficiency of inference in clinical trials with external control data. Biometrics, 79(1):394–403, 2023.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

As I mentioned above for the simulation results, the proposed method under the optimal choices of selection parameter does not seem to perform as good as the case under other arbitrary choices.

其他意见或建议

  1. Second column of Lines 178 to 182: The sentence "Based on..." seems to repeat what have been discussed in the last paragraph.

  2. Second column of Line 225: There should be a division in the definition of pjsplitp_j^{split}. The same issues happen to pjjackknife+p_j^{jackknife+} and pjcv+p_j^{cv+}.

  3. The plot in Figure 2(D) needs to be zoomed in.

伦理审查问题

The paper exceeds the 8-page limits and potentially discloses the authors' identities.

作者回复

We sincerely appreciate your insightful and constructive comments. Below, we have provided our detailed, point-by-point responses.

1. Bound in Theorems 3.7 and 3.8

We explain the key terms in Theorems 3.7 and 3.8 and their practical implications as follows:

(i) The term cΔδ1c\Delta|\delta_1| involves δ1\delta_1, the bias of the consistent No Borrowing estimator, which is of order 1/n1/n, and Δ\Delta, the maximum bias across all γΓ\gamma \in \Gamma. This motivates us to apply pre-propensity-score matching on all ECs to prevent potentially large bias during the data preparation phase.

(ii) The term cΔΦlog(Γ/ι)c\Delta\Phi\sqrt{\log (|\Gamma|/\iota)} involves Φ\Phi, the largest standard deviation proxy ϕ\phi across γΓ\gamma \in \Gamma, which is of order O(1/n)O(1/\sqrt{n}). The grid size Γ|\Gamma| is fixed (e.g., 10 in our experiments), making this term well-controlled.

(iii) The term max(cΦ2log(Γ/ι),cΦ2log(Γ/ι))\\max (c\Phi^2\sqrt{\log (|\Gamma|/\iota)}, c\Phi^2\log (|\Gamma|/\iota)) arises from sub-exponential tail bounds and scales similarly to (ii).

(iv) The terms max_γΓV^(τ^_γτ^_1)κγ2\\max\_{\gamma\in\Gamma} |\hat{V}(\hat{\tau}\_{\gamma} - \hat{\tau}\_1) -\kappa_{\gamma}^2| and max_γΓV^(τ^_γ)σγ2\\max\_{\gamma\in\Gamma} |\hat{V}(\hat{\tau}\_{\gamma}) - \sigma^2_{\gamma}| corresponds to the estimation errors for κγ2\kappa^2_{\gamma} and σγ2\sigma^2_{\gamma}. This motivates the use of a sufficiently large number of bootstrap replicates to ensure accurate variance estimation.

Overall, Theorems 3.7 and 3.8 provide concrete guidance for implementing γ\gamma selection in practice and justify the design of our MSE-guided adaptive procedure, even if the bounds themselves are conservative due to their non-asymptotic and worst-case nature.

2. Clarifying the connection to super-efficiency-type behavior

The reviewer questions the super efficiency of τ^γ^\hat{\tau}_{\hat{\gamma}}. We acknowledge that this was a misstatement; our intent was not to claim that it is super-efficient, but rather that it exhibits behavior similar to the Hodges estimator. We have revised the sentence as follows:

"This phenomenon highlights that τ^γ^\hat{\tau}_{\hat{\gamma}} behaves similarly to the Hodges estimator (Le Cam, 1953) and to integrated estimators in data fusion (Yang et al., 2023; Oberst et al., 2022): improving upon the baseline estimator (here, the No Borrow estimator) in certain regions of the parameter space (where there is no bias in ECs) inevitably leads to worse performance in other regions (where the bias in ECs is difficult to detect)."

3. Additional simulations

We additionally consider p=5p = 5 and XN(0,Σ)X \sim N(0, \Sigma), where Σ\Sigma is a Toeplitz matrix with (i,j)(i,j)-th entry Σij=ρij\Sigma_{ij} = \rho^{|i-j|} and ρ=0.3\rho = 0.3, to introduce dependence among the coordinates of XX. We did not consider larger pp since the sample size is small, with only 25 RCT controls. The simulation results (see here) show similar patterns and demonstrate the robustness of our method.

4. Φ\Phi and ϵγ\epsilon_\gamma

Φ\Phi is the largest standard deviation proxy ϕ\phi over γΓ\gamma \in \Gamma. If ϵγ\epsilon_\gamma is Gaussian, then ϕ\phi equals its standard deviation. We previously mislabeled ϕ\phi as a variance proxy and have corrected this.

5. Value and performance of γ^\hat{\gamma}

Since γ^\hat{\gamma} is adaptive and varies across simulation replicates, we separately report its values in Figure 4 and analyze its behavior in Section C.2.

We do not expect γ^\hat{\gamma} to outperform all fixed γ\gamma values uniformly. Prior work has shown that no method can uniformly outperform No Borrowing (corresponding to γ=1\gamma = 1) without additional assumptions (Oberst et al., 2022; Lin et al., 2024). Instead, our proposed γ^\hat{\gamma} is designed to improve power when the bias of ECs is either absent or detectable. When the bias is difficult to detect, it is reasonable that our method may incur some inevitable power loss, though within an acceptable range. Importantly, Conformal Selective Borrowing+FRT always controls the Type I error, even in such challenging cases.

6. Presentation

We removed the sentence "Based on...", corrected the missing division symbols, and zoomed in Figure 2(D).

7. Ethical issue

As per ICML guidelines, the Impact Statement is not counted toward the 8-page limit. We initially understood this to apply to the Software and Data section and had no intention of violating the page limit.

We did not disclose any identities or URLs in the submission. We mentioned the availability of the R package solely to highlight the applicability and reproducibility of our work. We sincerely apologize if our identity may have been inadvertently revealed in the package documentation, which is external to the submission; this was entirely unintentional and not meant to compromise anonymity. To avoid any potential concern, we have removed the Software and Data section from the manuscript.

审稿意见
2

Authors study integration of external (historical) controls into randomized controlled trials in a principled way using conformal p-values, to guarantee type-1 error rates with finite samples under potential violation of the exchangeability assumption between trial and external controls.

update after rebuttal

I thank the authors for their detailed response. In particular, the motivation/logic they outline in the beginning through 3 items: (i), (ii), (iii) are helpful and a more detailed and polished version should definitely be included in the updated manuscript. While those items helped me understand the contributions of the manuscript a little better, I still have concerns regarding sample splitting in the RCT to test ECs against: this decreases power (reduced sample size in the RCT, as the samples we use for testing ECs should not be used again in downstream analyses). While there is some empirical evidence suggesting improved power overall, the fundamental tradeoff here should be more clearly analyzed theoretically. Therefore, I maintain my score.

给作者的问题

See Claims And Evidence

论据与证据

The thing that bothers me the most is that it is not clear what authors claim to do and what they end up doing. I can understand the idea of using conformal inference to "select" external controls based on their conformity score, which is straightfoward. However the following question is not answered sufficiently in my opinion:

  • You use RCT controls to select in the first place. In that case, why would you not still be limited statistically by the number of controls in the RCT?

The question above is not answered/discussed verbally or as a theorem. By the latter, I mean the following. Authors claim that their new approach controls the type-1 error and improves power. I do not see a result where they "prove" they improve power. And they should be more clear that this "power" only relates to identifying whether or not the treatment has any effect, and not necessarily what is the size of the effect.

A criticism to this conformal approach would be that each control is tested for pooling individually, and then the high probability guarantees are obtained straightforwardly by union bound. As authors briefly mention, this will suffer from low power and may introduce bias. This to me is the point where the paper is making its main contribution via an improved selection mechanism (correct me if I am wrong). However, my main issue here is the following:

  • You motivate minimizing MSE of the estimator as a proxy to guide better selection and leverage consistency of the estimators in doing that. The consistency is prone to the same issues you cite to be immune to: asymptotical validity, etc.
  • The results you have here again do not connect to an improved power for the FRT, but rather seem to be self-contained and remain as a proxy/heuristic to improve the power.

Please correct me if I am wrong. My main challenges with this paper is that I do not understand what is the one most important result its trying to prove (higher power for for FRT?) and how does the things you do/prove in the paper connect to that? I think the flow of methodology and the motivation of the paper is not very clear as it stands.

方法与评估标准

Real-world experiment is well described and interesting. It is a suitable experiment to run for this paper.

理论论述

I did not check the proofs carefully. Most results seem to be standard from the conformal inference literature.

实验设计与分析

The conformal selective borrowing approach in the real-world experiments seem to be not making a big difference compared to the no borrowing approach. An exception is slightly reduced p-vals and SE's, which might be a function of sample size alone. This falls in line with my earlier intuititon/concern regarding not having improved power as the selection is still limited by the RCT (See Claims And Evidence).

补充材料

There is not one

与现有文献的关系

Current manuscript focuses on a specific problem and claims to improve it using ideas from conformal inference. It does not relate to the broader scientific literature in a significant way.

遗漏的重要参考文献

There is a vast body of work on historical controls, especially in the epidemiology literature. Current manuscript does a poor job covering that.

其他优缺点

See Claims And Evidence

其他意见或建议

See Claims And Evidence

作者回复

We sincerely appreciate your insightful and constructive comments. Below, we have provided our detailed, point-by-point responses.

1. Clarifying the objective and the role of MSE minimization in improving FRT power

Our primary objective is to improve the power of the Fisher Randomization Test (FRT), which is limited when using only RCT data (No Borrowing). To achieve this, we rely on three key insights:

(i) Variance and power: Theorem 2.4 (power analysis for consistent test statistics) shows that power can be improved by reducing the variance of a consistent test statistic, which can be achieved by borrowing unbiased ECs to augment the limited sample size of the RCT control arm;

(ii) Bias and power: borrowing biased ECs renders the full borrowing estimator inconsistent, severely reducing the power of FRT;

(iii) Bias-variance trade-off and power: these motivate the use of Conformal Selective Borrowing to improve FRT power of RCT-only analysis by borrowing unbiased ECs and discarding biased ones from a large EC pool.

Based on above insights, we introduce our intermediate objective: testing the exchangeability of each EC. We acknowledge that the number of RCT controls statistically limits the power of this exchangeability testing. Therefore, we propose tuning the selection threshold γ\gamma (i.e., the significance level of the EC exchangeability test) to optimize our primary objective directly. While using empirical FRT power to tune γ\gamma would be ideal, this approach (i) requires specifying an alternative hypothesis and (ii) is computationally intensive, as computing FRT power across a grid of γ\gamma values is costly.

This leads to our final choice: using the MSE of the Conformal Selective Borrowing estimator as a proxy to optimize γ\gamma. Our MSE-guided γ\gamma selection offers several advantages: (i) experiments show it achieves our primary goal of improving FRT power compared to RCT-only analysis, though we acknowledge that the theoretical link between MSE and FRT power is challenging due to the irregularity of the test statistic; (ii) it yields strong selection performance and supports our intermediate objective (see Fig. 2 and Fig. 14); (iii) with MSE-guided adaptive γ\gamma, Conformal Selective Borrowing serves as both a powerful test statistic and an accurate ATE estimator; (iv) empirical MSE can be approximated by leveraging the No Borrowing estimator, and we provide a non-asymptotic excess risk bound for the adaptive procedure.

2. Finite-sample validity of FRT

For the validity of FRT (Type I error control), we do not rely on asymptotic arguments such as estimator consistency, as shown in Theorem 2.3. This is a key advantage of FRT over existing integrative methods whose validity depends on asymptotic theory. For power improvement, the optimality of the selected γ\gamma does rely on the consistency of the No Borrowing estimator, as shown in Theorem 3.8. However, our experiments show that Conformal Selective Borrowing with adaptive γ\gamma improves FRT power even in small-sample settings.

3. Significance in real-world experiments

The real-world experiments show that Conformal Selective Borrowing (CSB) improves both robustness and efficiency over No Borrowing (NB) and Full Borrowing (FB): (i) Given that CSB+FRT theoretically controls Type I error in finite samples, which existing integrative methods do not, it yields a significant result (p < 0.05) compared to the borderline NB result (p = 0.055), addressing the underpower issue in the original study. (ii) CSB reduces the standard error by 20% compared to NB. (iii) CSB mitigates the bias of FB; the ATE estimate under CSB (0.138) is close to NB (0.142), while FB overestimates it (0.241). We acknowledge that the extent of improvement depends on the quality of the real data at hand. To further evaluate our method in finite samples, our simulations show that CSB can improve FRT power by up to 45% compared to NB.

4. Related work on historical controls

Due to the manuscript's space limitation at the original submission, we included the comprehensive literature review on historical control borrowing in Appendix A. We acknowledge the value of such a review in the main text and will move the related work on historical controls to the Introduction section at the resubmission. Please let us know if we have missed any relevant literature.

最终决定

This paper introduces a new method to integrate observational external controls with data from a RCT to estimate causal effects while controlling the risk that the external controls introduce bias into the estimates. Most reviewers found the introduction of Fisher randomization test, alongside an adaptively tuned selective borrowing step, to be worthwhile additions to the literature that would be of interest to others working on the problem of integrating RCTs and observational data. The main concern, raised by one reviewer, is that the tuning in the selective borrowing step targets a proxy objective (MSE) instead of the true power. In my view this is a reasonable choice and the concerns here can be addressed by revising the paper to be clearer about what strategy is being chosen and why.