Strategic A/B testing via Maximum Probability-driven Two-armed Bandit
This paper proposes a new and more powerful approach to A/B testing based on two-armed bandit.
摘要
评审与讨论
This paper builds on Strategic Two-Sample Test via the Two-Armed Bandit Process to enhance the detection of small average treatment effects. It proposes a more powerful one-sided two-sample test by adjusting the balance between the mean and volatility terms, yielding a statistic that is more concentrated under the null and less so under the alternative. The framework is adapted to the Rubin Causal Model (RCM), where only one potential outcome per subject is observed, with a doubly robust estimator used for causal effect imputation. To address sensitivity to sample ordering, the authors incorporate meta-analysis by repeatedly reordering samples and recalculating p-values. Theoretically, they show that as approaches infinity, the asymptotic distribution converges to a spike, ensuring valid inference.
给作者的问题
No.
论据与证据
The validity of the proposed statistic is questionable. For instance, in Equation (2), classical statistical inference typically expresses the mean term as the sample average of for . However, the authors instead define it as , raising a fundamental issue: what is the meaning of when ? Since this formulation underpins the entire paper, its incorrect mathematical structure casts doubt on the validity of the conclusions, theory, and experiments. The authors should carefully reexamine their theoretical derivations, algorithmic details, and experimental code.
方法与评估标准
Yes.
理论论述
Yes. The proofs for the theorems on the asymptotic properties of PWTAB are standard and rely on the strategic CLT. However, as noted in the Claims and Evidence section, the mathematical formulation of the proposed statistics seems incorrect, casting doubt on the validity of the theoretical results.
实验设计与分析
Yes.
补充材料
Yes. I mainly review the proof of theorems and simulation details.
与现有文献的关系
This paper primarily addresses the problem of one-sided two-sample testing, with a particular focus on paired two-sample testing. Compared to previous works, such as Strategic Two-Sample Test via the Two-Armed Bandit Process, which establish a more comprehensive theoretical framework—including proofs of the strategic central limit theorem (CLT) and other asymptotic properties—this paper places greater emphasis on practical implementation and the technical challenges that arise in finite samples.
Despite repeatedly highlighting its connection to the two-armed bandit, the constructed bandit framework assigns rewards to the two arms as exact opposites, effectively reducing it to a one-armed bandit. Moreover, in traditional two-sample testing, data from both populations are fully observed, eliminating the partial observation challenge that is central to bandit problems. As a result, this paper primarily addresses two-sample testing and has only a limited connection to the bandit literature.
遗漏的重要参考文献
I mention only one important paper, Strategic Two-Sample Test via the Two-Armed Bandit Process. While this paper does not cite the foundational work, much of its content serves as a technical extension of it.
其他优缺点
Strengths: The weighting and permutation tricks is beneficial for finite-sample performance in statistical inference.
其他意见或建议
No.
We sincerely appreciate the reviewer’s thoughtful feedback and valuable suggestions. We also sincerely apologize for any difficulties or confusion arising from insufficient clarity in the presentation of some fundamental equations and references in the paper. Below, we will discuss the issues you have raised:
1. Effectiveness of the Proposed Statistic
- Notation Clarification:
We define , where and denotes the indicator function. Since our paper does not introduce or define the notation for , we will explicitly define as stated above, immediately following Equation (2).
- Intuitive Explanation and Theoretical Perspective:
Using instead of reduces the variance of the mean term while preserving the asymptotic properties of the statistic. This leads to a faster and more stable convergence, as also verified experimentally. Replacing with represents a key methodological improvement.
2. Motivation for Linking to the Two-Armed Bandit Framework
- Differences in Research Objectives:
The classical two-armed bandit model focuses on maximizing the average reward by balancing exploration and exploitation. In contrast, our model does not seek to identify the arm with the highest return. Instead, it aims to maximize the target probability through collaborative arm selection. Within the hypothesis testing framework, this maximized target probability corresponds to the tail probability .
- Clarification in Writing:
Given the distinct objectives of our proposed model compared to the classical framework, we appreciate the opportunity to improve clarity in our manuscript. To this end, we will explicitly define our proposed model following the second paragraph of Section 2.2, thereby clearly distinguishing it from the classical model.
3. Missing References
We sincerely apologize for the unintentional omission of the reference to “Z. Chen et al., Strategic Two-Sample Test via the Two-Armed Bandit Process” during the drafting process. This reference has been included in the revised manuscript along with a detailed comparison:
- Research Background:
Chen et al. addressed the one-sided two-sample testing problem under independent, batch-wise paired observations. While valuable, their approach is not designed to address hypothesis testing in more complex frameworks, such as causal inference scenarios involving missing data or confounding variables. In contrast, our proposed method incorporates advanced techniques, enabling its application to hypothesis testing within causal inference frameworks. This adaptability enhances its relevance to real-world research contexts.
- Theoretical Contributions:
Chen et al.’s study primarily focuses on the asymptotic properties of the test statistic, without investigating its behavior in finite samples. This limitation can result in inflated Type I errors in small-sample settings. We effectively control the Type I error under finite samples by modifying the mean term as the mean of and introducing a weighting factor . Their result represents a special case of our framework when .
- Algorithm Robustness:
Chen et al. compute the -value from a single ordered sample sequence, resulting in unstable statistical power in both simulations and real-world experiments. To enhance robustness, we introduce a permutation-based meta-analysis, recalculating the -value across multiple sample reorderings. This improvement significantly strengthens the algorithm’s reliability and practical utility.
In summary, while our work draws inspiration from “Strategic Two-Sample Test via the Two-Armed Bandit Process” in its use of bandit strategies for hypothesis testing, our study offers more generalized insights across research frameworks, theoretical advancements, and technical implementations. We hope this clarification underscores the distinct contributions of our work while acknowledging the foundational influence of Chen et al.’s research.
We sincerely appreciate the reviewer’s keen observation, which has enabled us to refine the manuscript and better contextualize our contributions within the existing literature.
Problem:
This work aims to address the limitations of traditional A/B testing in detecting minor treatment effects. The key challenges are: (i) data distributions between the treatment and control groups may differ due to confounding effects, (ii) even when distributions are balanced, measured outcomes can still exhibit high variance, and (iii) test statistics rely on the normality assumption of the central limit theorem, which may not always hold.
Methods Used:
To this end, this work proposes a novel statistical testing framework that:(i) relaxes the normality assumption by leveraging bandit-inspired distributions, which is drawn from the the prior results of strategy-central limit theorem (Chen et al., 2022) [1],(ii) introduces a weighted test statistic to control Type I error.(iii) employs a doubly robust method to obtain unbiased, low-variance causal estimates, and(iv) utilizes a permutation test to enhance statistical power.
Results:
On empirical evaluations, the authors compare their methods, i.e., Permuted WTAB and WTAB, with existing methods (i.e., z-DML, CUPED, and DIM) on both synthetic data and real-world ridge sharing data.
On synthetic data, the authors show that their Permuted WTAB achieves the highest statistical powers compared to other methods while maintaining a similar Type I error rate.
Similarly, on real-world ridge sharing A/B testing datasets, they shows their Permuted WTAB consistently achieves lower p-values compared to CUPED.
Overall — Main contributions, novelty and impact:
- This work introduces the first hypothesis testing method for estimating causal effects, which breaks the assumption that random variables are normally distributed but instead Bandit distributed, significantly improving the statistical power. Its efficacy is further improved by weighted version and doubly-robust estimation.
- The proposed method is novel and theoretically grounded.
- The proposed method has major business implications, particularly in optimizing A/B tests for small treatment effects, which could drive more profitable marketing strategies.
Area Of Improvement:
However, given the current empirical results, additional evaluations are needed to clarify whether this approach is indeed better than other existing. Moreover, their current writing and presentation needs improvement, which I will describe in later sections.
Ref:
[1] Chen, Zengjing, Shui Feng, and Guodong Zhang. "Strategy-driven limit theorems associated bandit problems." arXiv preprint arXiv:2204.04442 (2022). https://arxiv.org/pdf/2204.04442
=================================================================================================================================================
Update After Rebuttal
-
The authors have provided a clearer statement on the contribution of their work, particularly in how they leverage the strategic central limit theorem framework to relax the exchangeability assumption, thereby improving Type I error control while enhancing statistical power.
-
The authors have addressed the typo and the missing legend in Figure 2. They are also aware of areas where the presentation could be improved and have made an effort to address them.
-
They have added a discussion on how to select the regularization parameter λ in a data-driven manner and explained why the stacking approach did not perform as expected, along with suggestions for how to address this issue.
-
The authors have stated their intention to include the proof of Lemma 2.1.
An additional experiment on a real-world dataset has been included. The authors also explained the rationale for generating synthetic data from real-world sources, which is reasonable.
Given the above, I believe the authors have made a substantial effort to address the key concerns raised.
While we are unable to see the full revision due to the constraints of the rebuttal phase, I believe that the proposed revisions and clarifications significantly strengthen the work.
-- I think the merits of the paper outweigh the remaining minor concerns (such as presentation), and I therefore maintain my original recommendation of accept.
-- However, if the committee feels it is preferable for the authors to take more time and submit a fully revised version in a future round, I would also support that direction.
=================================================================================================================================================
给作者的问题
-
I may be wrong if I am not very scrutinizing. However, I am unsure if you have the proof for Lemma 2.1?
-
In line 258, you mention that the ensemble method, specifically stacking, improves efficacy. However, in your experiments on synthetic datasets (Figure 6), stacking is not consistently the best method and, in some cases (e.g., the bottom-left panel for function III), performs worse than the compared methods. Could you provide an intuition or explanation for why stacking underperforms in certain scenarios?
论据与证据
Their weighted version of the two-arm bandit with a permutation test effectively demonstrates control over Type I error while achieving higher statistical power than both state-of-the-art methods and the standard mean estimator on synthetic datasets.
However, in real-world datasets, they only present variance reduction and p-value results without providing analyses on Type I error control or statistical power.
方法与评估标准
-
The authors appropriately utilize both synthetic and real-world datasets to evaluate their statistical testing methods, ensuring a comprehensive assessment of their approach across controlled and practical settings.
-
The authors selected both the baseline model, i.e., the mean average estimator, and state-of-the-art methods, such as DML and CUPED, to ensure a fair comparison.
-
As mentioned earlier, they also need evaluations on type 1 error and statistical power analysis for real-world dataset.
理论论述
I have checked their proofs for their theoretical claims, it is almost the same derivation steps from the strategic central limit theorem [1] with modification of including the weighted version.
Ref:
[1] Chen, Zengjing, Shui Feng, and Guodong Zhang. "Strategy-driven limit theorems associated bandit problems." arXiv preprint arXiv:2204.04442 (2022). https://arxiv.org/pdf/2204.04442
实验设计与分析
Their experimental designs and analyses are valid; however, additional evaluations on Type I error control and statistical power analysis for real-world datasets are needed to ensure comprehensive assessment.
补充材料
I have reviewed the entire supplementary material, including the proofs of Theorem 4.1 and Theorem 4.2. and their additional experimental results for synthetic and real-world datasets.
与现有文献的关系
Unlike traditional normality-based hypothesis tests, this work introduces a Bandit-distributed framework, providing an alternative to standard A/B testing. The incorporation of weighted test statistics, doubly robust estimation, and permutation testing further strengthens treatment effect estimation.
遗漏的重要参考文献
The key paper to derive their results are provided, which is [1]
Ref:
[1] Chen, Zengjing, Shui Feng, and Guodong Zhang. "Strategy-driven limit theorems associated bandit problems." arXiv preprint arXiv:2204.04442 (2022). https://arxiv.org/pdf/2204.04442
其他优缺点
This work cleverly integrates strategy-based statistical testing, which challenges the normality assumption, with existing approaches such as the permutation test and the doubly-robust estimator. While the proof is largely adapted from prior work with slight modifications, its implications are significant in accurately detecting minor treatment effects.
其他意见或建议
-
The authors make frequent use of the term exchangeability, but its precise meaning remains ambiguous. For instance, exchangeability might refer to data-level exchangeability, meaning whether the data are generated i.i.d. Alternatively, it could pertain to the exchangeability of treatment assignment given the observed data. In the context of the strategy central limit theorem, I interpret exchangeability as referring to the rewards derived from the sequence of arm choices. Clarifying this distinction would enhance the paper’s rigor and readability.
-
On line 73, it would be better to first introduce the full term Two-Arm Bandit before using its acronym.
-
Line 193: mean --> main
-
In Figure 2(b), there is no legend to describe what each curve means. Please include them.
-
In section 3.1, it would be better to firstly introduce what Theorem 4.1 is and then use it.
-
In your lemma 2.1, please explicitly define as the alternative hypothesis before using it.
-
In line 410, the paragraph titled "Another simulations" should be more clearly described. Please rephrase it to indicate that it presents results on an ML-based method or another relevant categorization for better clarity.
-
In general, the authors assume that readers are already familiar with the intent of each section and proceed without sufficient introduction. It would be beneficial to include brief overviews or contextual transitions at the beginning of each section to improve clarity and guide the reader through the flow of the paper.
We sincerely thank the reviewer for the comprehensive and insightful feedback. Below, we provide a point-by-point response to the issues raised:
1. The Use of “Exchangeability”
We appreciate the reviewer’s observation regarding the term “exchangeability” and fully agree that a clear definition is essential. In the revised manuscript, we have explicitly defined exchangeability here as referring to the rewards derived from the sequence of arm choices.
2. Presentation Adjustments
- Two-Arm Bandit Introduction:
We acknowledge the suggestion regarding the introduction of the full term “Two-Armed Bandit” before using its acronym. In the revised version, we ensure clarity by introducing the full term at first mention (e.g., “Two-Armed Bandit (TAB)”).
- Typo Corrections:
Line 193: We have corrected the typo from “mean” to “main”.
Line 410: We have changed the paragraph title from “Another simulations” to “More ML-based simulation studies” for clearer expression.
- Figure Improvements:
In Figure 2(b), we will add a detailed legend to describe each curve, ensuring that the visual representation is self-explanatory. The pink dashed, cyan solid, and orange dotted lines represent , and , respectively. Additionally, we will update the caption of Figure 2(b) to “The empirical type I error rate across different and , fixed ”.
- Section Transitions and Introductions:
We agree that additional contextual transitions at the beginning of sections would improve readability. We will include brief overviews in Section 3.1 and elsewhere to better guide the reader through our arguments and experimental results.
- Lemma 2.1 Clarification:
We have revised Lemma 2.1 to explicitly define the alternative hypothesis before applying it, thereby eliminating potential ambiguity.
3. Theoretical and Experimental Clarifications
Regarding the reviewer’s question about Lemma 2.1, its proof is provided in Chen et al.'s article (Z. Chen et al., “Strategy-driven limit theorems associated bandit problems,” Theorem 3.3). To improve the readability of our manuscript, we will include the full proof in the appendix of the revised version.
4. Ensemble Method (Stacking) Performance
We thank the reviewer for highlighting the performance discrepancies of the stacking method in the synthetic experiments. We identify two key factors contributing to this discrepancy:
-
First, the current implementation uses a limited selection of primary learners. We are actively investigating the incorporation of additional machine learning models as primary learners to enhance the performance of the stacking method.
-
Second, the choice of primary learners and their respective weights in the ensemble may not be optimal under all configurations, leading to suboptimal aggregation of predictions. We are exploring the use of more advanced meta-learners (e.g., random forests) instead of simple linear regression to better assign weights to different primary learners and further improve the stacking method’s performance.
5. Additional Evaluations on Real-World Data
We appreciate the reviewer’s suggestion regarding a more comprehensive evaluation on real-world data. To address this, we have conducted additional experiments using synthetic data based on real-world data. The results of these additional experiments are summarized in Table 1.
Table 1: Type I error rates and statistical power based on synthetic data derived from real-world dataset.
| Method | Metric | PWTAB | WTAB | -DML | CUPED | DIM |
|---|---|---|---|---|---|---|
| LightGBM | Type I Error | 0.052 | 0.052 | 0.044 | 0.050 | 0.048 |
| Power | 0.758 | 0.738 | 0.744 | 0.740 | 0.498 | |
| XGBoost | Type I Error | 0.052 | 0.034 | 0.046 | 0.050 | 0.048 |
| Power | 0.758 | 0.738 | 0.746 | 0.740 | 0.498 | |
| Stacking | Type I Error | 0.052 | 0.052 | 0.046 | 0.050 | 0.048 |
| Power | 0.764 | 0.732 | 0.746 | 0.740 | 0.498 |
These results provide compelling evidence of the effectiveness of our proposed PWTAB method in real-world scenarios. When the null hypothesis holds, all methods maintain Type I error rates close to 0.05, preserving the reliability of statistical inference in practical settings. Under the alternative hypothesis, the proposed method consistently outperforms competing methods in terms of statistical power. PWTAB achieves the highest statistical power when used with LightGBM or XGBoost, and its performance is further enhanced when combined with the ensemble learning algorithm Stacking.
We sincerely appreciate the reviewer’s constructive comments, which have been invaluable in improving the clarity, rigor, and overall impact of our work.
I thank the authors throughout the response.
- On the use of exchangeability:
Having a clearer explanation of exchangeability can significantly enhance both the readability and where the impact of the current work lies in. In your revision, please clearly highlight the advantages of your proposed approach compared to traditional A/B testing. Specifically, with the use of the strategy-driven limit theorem framework, the underlying test statistics no longer require the assumption of normality—an assumption typically made in traditional A/B testing. Therefore, your approach offers superior control over Type I errors and increased statistical power.
- On better presentation of your work:
As per Reviewer KBY3 mentioned, the authors should be mindful of the presentation. That is, the authors should either be less reliant on mathematical equations or clearly articulate the intuition behind each mathematical expression. Even in sections that do not need mathematical expression, the authors should still need to ensure that your work clearly convey intuition, provide smooth transitions between ideas. For example, line 241 to 257 could have been having better intuition and transition. That is, in line 243 to 249, the authors could instead say:
" Traditional methods such as CUPAC, which rely solely on linear regression, might fail to capture these intricate patterns. To overcome this limitation, advanced machine learning methods are introduced. Specifically, LightGBM (Ke et al., 2017)—a state-of-the-art gradient boosting algorithm—is employed within the double machine learning (DML) framework (Chernozhukov et al., 2018).
Intuitively, the DML approach mitigates overfitting and reduces regularization biases by partitioning the dataset into multiple subsets. Each subset is used iteratively to estimate conditional relationships, ensuring robustness and improved predictive performance. "
There are additional sections where the presentation could be improved; however, I leave it to the authors to identify and enhance these sections on their own.
- On stacking methods:
Thank you so much for your clarification. It would be great to include them into your discussion section.
- On data-driven lambda:
I agree with Reviewer KBY3. It would be great to include some discussion on how to choose lambda in a data-driven approach.
- On the proof of Lemma 2.1:
Your theoretical results are primarily based on the work of Z. Chen et al. ("Strategy-driven limit theorems associated with bandit problems"). To ensure your manuscript is self-contained, please also include the detailed derivation of Lemma 2.1 in your appendix.
- On your additional Evaluations on Real-World Data:
I request that the authors clearly explain the rationale behind generating synthetic data from real-world data. Couldn't the experiments not be conducted directly using real-world data?
I thank the authors once again. I believe your manuscript is becoming clearer and, thus, better impact.
We sincerely thank you for your continued positive feedback and insightful suggestions on our novel strategic A/B testing method. Through the revisions outlined below, we aim to further strengthen our manuscript and earn your full support.
On the use of exchangeability
We fully agree with your observation. Our proposed test statistic exhibits a concentrated, spike-like distribution around zero under the null hypothesis and a bimodal distribution away from zero under the alternative. By not relying on the normality assumption, our approach achieves superior control over Type I errors while enhancing statistical power. To highlight these advantages, we have revised the manuscript to explicitly compare our method with traditional A/B testing, emphasizing the flexibility afforded by the strategy-driven limit theorem framework.
On better presentation of our work
We value your guidance on improving readability. In addition to addressing the issue you pointed out regarding lines 241 to 257, we have minimized the use of extraneous mathematical formulas in the revised manuscript, retaining only those essential to our research. For these, we have added clear, intuitive explanations. Additionally, we have enhanced the logical flow and coherence throughout the text. For example, the section from line 271 (left) to line 220 (right) has been revised as follows:
“To address this issue, we perform multiple samples reorderings, repeatedly calculate the -value of , and aggregate these via meta-analysis to enhance the robustness of statistical inference.”
The section from line 232 (right) to line 237 (right) has been revised to:
“However, varying sample orderings can yield inconsistent values, and the conclusions drawn from individual -values may be unclear. To resolve this, we apply meta-analysis to synthesize an overall -value, improving the reliability of the results derived from individual values (Walker et al., 2008; Lee, 2019).”
On data-driven
We fully agree with you and Reviewer KBY3 on the importance of a data-driven approach to selecting . As detailed in our rebuttal to Reviewer KBY3, we have proposed a data-driven approach for selecting , which we have now incorporated into the revised manuscript for clarity and completeness.
On the proof of Lemma 2.1
We fully agree with you. To ensure the coherence of the paper, we have independently included the detailed derivation of Lemma 2.1 in the appendix of the latest revised version.
On the additional Evaluations on Real-World Data
We appreciate the opportunity to clarify the rationale behind this approach. Our decision to generate synthetic data stems from two key practical constraints associated with real-world A/B testing datasets:
-
Limited Availability of Real-World Data: Real-world A/B testing datasets are often constrained in size and scope, which can limit their suitability for comprehensive statistical evaluations. Synthetic data allows us to scale experiments and explore a wider range of scenarios while preserving the distributional characteristics of real-world data.
-
Absence of Ground Truth for Strategy Improvements: The average treatment effect in real-world datasets is typically unknown, making it difficult to accurately estimate critical metrics such as empirical Type I error rates and statistical power—both essential for validating our method’s performance. By generating synthetic data based on real-world data, we can control the average treatment effect while preserving the original data distribution, thereby enabling precise and reliable estimation of these metrics.
We hope these clarifications and revisions fully address your concerns. Thank you again for your valuable input, which has greatly improved our manuscript.
This paper introduces a novel approach to A/B testing focused on detecting minor average treatment effects (ATEs) in large-scale applications. The authors propose a maximum probability-driven two-armed bandit process with a weighted mean volatility statistic and incorporation of permutation methods. The key theoretical contribution is the strategic central limit theorem (SCLT), which yields more concentrated distributions under the null hypothesis and less concentrated distributions under alternatives, thereby enhancing statistical power.
The proposed permuted weighted two-armed bandit (PWTAB) method incorporates doubly robust estimation for counterfactual outcomes. Experiments on both synthetic and real-world ride-sharing company data demonstrate PWTAB consistently outperforms standard methods like DIM, CUPED, and z-DML while maintaining proper Type I error control.
给作者的问题
- Beyond the threshold approach, are there data-driven methods to select optimal values?
- How well does the method generalize to domains beyond ride-sharing (e.g., e-commerce) where metrics and effect sizes differ?
论据与证据
The claims are generally well-supported by evidence:
- The central claim that WTAB improves statistical power is backed by both theoretical analysis (SCLT) and empirical results showing superior performance in different simulation settings.
- Type I error control is verified through comprehensive simulation studies in Table 2, with empirical rates remaining close to the nominal α=0.05 level across varied configurations.
- Empirical evidence in Figure 4 demonstrates PWTAB consistently outperforms comparison methods, particularly for nonlinear functions.
方法与评估标准
The methodological approach effectively addresses the problem of detecting minor treatment effects:
- The weighted mean-volatility statistic (Eq. 5) provides a flexible framework balancing detection power with Type I error, with weight parameter λ carefully chosen to maximize statistical power.
- The permutation-based approach (Algorithm 1) using Cauchy combination addresses the "p-value lottery" problem, with B=25 permutations determined sufficient through empirical testing.
The evaluation criteria include both Type I error control and statistical power across varied conditions (linear/nonlinear functions, heterogeneous effects, different noise levels σε ∈ {0.5, 0.6}).
理论论述
The theoretical proofs seems sound and rigorous, building on:
- Theorem 4.1 establishes that the asymptotic distribution follows a spike distribution.
- Theorem 4.2 demonstrates Type I error control and consistency against fixed alternatives, under .
- The weighted statistic maintains the same optimal policy structure (Eq. 10), with constrained by threshold to ensure proper convergence.
实验设计与分析
The experiments are thorough and well-designed:
- Synthetic data tests span 32 configurations combining four different functions F(X1,X2), four G(X1,X2) (including two null hypotheses GI, GII), and two noise levels (σε=0.5, 0.6).
- Sample size n=20,000 realistically represents large-scale A/B testing scenarios.
- Real-world validation uses three datasets (A, B, C) from a ride-sharing company, with results in Figure 5 showing PWTAB achieves smaller p-values regardless of the machine learning algorithm used.
The authors rigorously compared their approach against DIM, CUPED, and z-DML baselines, showing consistent improvements particularly for nonlinear function settings.
补充材料
I went over the appendix proofs briefly, nothing seemed out of place.
与现有文献的关系
The paper effectively connects to relevant literature across:
- A/B testing
- Causal inference
- Multi-armed bandits
- Permutation tests
遗漏的重要参考文献
Nothing completely relevant seems to be omitted from the manuscript
其他优缺点
Strengths:
- Addresses an economically significant problem with a theoretically grounded solution.
- The integration of bandit algorithms with traditional A/B testing creates an innovative hybrid methodology.
- Demonstrates superior performance for nonlinear relationships where CUPED falters.
- The doubly robust estimation approach provides protection against model misspecification.
Weaknesses:
- The mathematical density may limit adoption by practitioners without strong statistical backgrounds.
- Limited guidance on practical λ selection beyond the 0.03 threshold.
- The paper could better explain the intuition behind why breaking exchangeability improves performance.
其他意见或建议
Nothing to add here.
We sincerely appreciate the reviewer’s careful evaluation of our work and the constructive feedback provided. Below, we address the key weaknesses and questions raised:
1. Adding a More Intuitive Explanation of Mathematical Densities
We thank the reviewer for this valuable suggestion. We acknowledge that the extensive mathematical derivations may pose a barrier to practitioners with limited statistical backgrounds. In the revised version, we have incorporated an intuitive explanation section to explain the form of different probability densities under different hypotheses. We believe this additional exposition will facilitate a broader understanding and practical application of our method.
To illustrate, consider the case when the null hypothesis holds with fixed. Given that the optimal policy parameter has an equal probability of being 0 or 1, assume that is observed and that . Consequently, according to the optimal policy, , implying that will be observed. This process continues with until there exists some index such that . Under the assumption that the null hypothesis holds, it is likely that , resulting in , which leads to the observation of , a reward that is more likely to exceed 0. This brief discussion shows that the optimal policy will control the value of to fluctuate around 0 under the null hypothesis, thereby concentrating its distribution around 0. A similar rationale applies when the alternative hypothesis holds.
2. Guidance on Selection
We agree that the guidance on selecting is crucial. The threshold value of 0.03 was derived empirically from our synthetic experiments. However, we are actively exploring more data-driven methods for selecting . We propose a data-driven approach for selecting by first discretizing its range and then employing bootstrapping techniques to generate multiple datasets. For each candidate , we compute the type I error rate across these datasets. The optimal is chosen as the one that maximizes statistical power while controlling the type I error.
3. More Real-world Applications
We appreciate the reviewer’s insightful query on the generalizability of our method to domains such as e-commerce. Although our current real-world validation is based on ride-sharing data, our preliminary experiments in other domains indicate that the method demonstrates strong potential. We are confident that our approach can be generalized to most companies conducting A/B testing. We intend to extend our experimental evaluation to include additional domains, such as a food delivery company and an internet technology company, thereby providing a more robust demonstration of the method’s versatility and robustness.
4. Breaking Exchangeability
We appreciate the reviewer’s interest in this aspect. We will clarify the advantages of breaking exchangeability in the revised manuscript by explicitly detailing how it contributes to enhanced performance. Traditional hypothesis testing methods based on the Central Limit Theorem (CLT) are inherently data-driven; once i.i.d. samples are observed, the construction of the test statistic is independent of the sample order, implying that the data are exchangeable. In contrast, our proposed testing framework is goal-driven—it seeks to progressively construct the test statistic from the available data to maximize statistical power. In our proposed two-armed bandit framework, earlier data actively influences the construction of the current test statistic, making the data non-exchangeable. This shift toward a maximum-probability objective enables the optimal construction of the test statistic, thereby enhancing testing performance.
Once again, we are grateful for the reviewer’s positive comments and valuable suggestions. We are committed to incorporating these improvements to enhance the clarity, interpretability, and impact of our work.
This paper received two positive scores (Accept and Weak Accept) and one Reject. Upon initiating discussion between the reviewers, it was found that the reviewer with negative score had a main concern about the correctness of one of the steps in the theorem proofs. However, from the reviewer discussion, reviewer QGJr (positive review) chimed in to clarify that the concern is mitigated and they believe the proof provided is sound.
In light of this, it seems to me that the main concern of the reviewer voting “Reject” should be dispelled, hence recommending to accept this paper.