/10

Spotlight5 位审稿人

最低3最高4标准差0.4

ICML 2025

Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

Matteo Zecchin,Sangwoo Park,Osvaldo Simeone

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

An adaptive learn-then-test calibration procedure for efficient and statistically valid hyperparameter selection.

摘要

关键词

CalibrationLearn-then-TestHyperparameter Selection

评审与讨论

审稿意见

评分: 42025-03-14

This paper makes progress on the problem of hyperparameter selection with statistical guarantees. The work provides a new, adaptive algorithm for hyperparameter selection that presents an extension of the Learn-Then-Test (LTT) framework that was previously introduced. The LTT framework uses p-values of hypotheses corresponding to hyperparameter selections to obtain high probability bounds on the selection of reliable hyperparamters. The present manuscript extends this framework by utilizing e-value rather than p-values. E-values are using to accumulate evidence measures multiplicatively which allows for combining evidence across studies, or in this case, across sets of collected data. The paper provides some theoretical intuition first and demonstrates the utility of the proposed approach in offline-to-online RL and prompt engineering.

给作者的问题

Q1. What would happen if I, for instance, applied a simple gridsearch and took an argmax? I might be able to achieve TPR=1, correct?

Q2: Why is reliability in line 313L approximated over a single rollout and not an expectation over multiple? That seems like an odd choice especially for highly stochastic domains such as those considered in the paper but maybe I am missing something.

Q3. The work states that there are other methods without guarantees (line 010). Could you elaborate why there are no comparisons against such methods?

论据与证据

The main contribution of this paper is a novel algorithm for hyperparameter testing called adaptive LTT (aLTT). The main claim about this algorithm is that it guarantees rigorous control over the FWER and FDR metrics while reducing the number of testing rounds compared to the non-adaptive version of the algorithm.

To provide evidence for this claim, the work outlines the mathematical properties of e-values and then provides an application study which includes experiments in offline RL as well as prompt engineering.

Even though the paper claims that the algorithm provides guarantees, there is no formal proposition statement that would quantify that. In line 246R, the text states that “if an FWER-controlling method [...] is used, the resulting set [hyperparameter set] is (α, δ)-FWER-controlling”. I believe it would be beneficial to state this as an explicit theorem of the algorithm and provide a proof, even if this proof simply refers to properties or propositions analyzed by other works.

For the experimental section, the claim that aLTT requires fewer rounds and is supported in both applications in Figures 1-4. In both cases, the algorithm also achieves a higher true positive rate indicating it selected fewer hyperparameters that have poor performance. What would be nice to see and is missing is a measure of the achieved reliability.

方法与评估标准

The proposed benchmarks are both problems where hyperparameter tuning is relevant. However, in both cases, the hyperparameter problem becomes significantly more difficult when one does not have access to an online version of the problem. This is not being addressed and it seems that unlimited access is assumed. As a result, e.g. in Figure 2, there are 5000 rounds. If each round collects only a single trajectory of say 500 steps in Half-Cheetah, one gets access to 2.5 million environment steps. This is sufficient to train online agents rather than hyperparameter tune on offline data.

理论论述

As mentioned above, the theoretical claims are solely in prose and I think a formal proposition would round out section 4 nicely. Since things are only in prose, there are no proofs to evaluate. I am not very familiar with this type of work but I am willing to believe that the algorithm is correct to the extent that it follows from direct application of e-value properties and corresponding optimization algorithms.

实验设计与分析

The experimental design is mostly reasonable I believe. The paper evaluates the TPR which seems to be the key indicator relevant to the approach. However, I think there is a lack of simple baselines to put things into perspective. For instance, how well could I do with simple gridsearch? What happens if I use a very small $\varepsilon$ such as 0.01. While the main claims are supported, I think more ablations to understand the relative behavior of the approach and to put it into perspective with other approaches in the domain would be quite useful.

It is also not clear to me how the choices for $\alpha$ were made. $0.57$ in the RL settings seems quite arbitrary. An ablation of varying alpha might be useful to remove any concerns about the choice of this value.

补充材料

I looked at some of the experiments in the Appendix to get a better understanding of the individual performance.

与现有文献的关系

It seems to me that the closer scientific literature has been discussed. However, the introduction in line 10R states "hyperparameter optimization is typically modeled as a bandit problem". This is a very broad statement. There are many other approaches for hyperparameter optimization beside bandits. There is lots of work on Bayesian optimization for adaptive hyperparameter selection and evolutionary strategies. There is also work on PAC-Bayes hyperparamter selection. All these are not mentioned.

遗漏的重要参考文献

I’m not sufficiently familiar with the literature to make a statement here.

其他优缺点

Weaknesses:

It is unclear to me how this approach could handle multiple correlated hyperparameters. The experimental setting only considers one fixed hyperparameter over which needs to be searched. In practice, the problem is much more high-dimensional.

其他意见或建议

I believe the text would benefits from a better description of e-values and e-processes. I had to conduct my own search to gain clarity on what these are. Specifically, in line 100L, the text states the key property of e-values is that they can be combined to obtain e-processes which are then not explained. This section would benefit from conciseness about what the property is that makes this possible and an explanation of e-processes.

I think line 162L is supposed to be "with the expectation".

Overall, I think this is a decently written paper that requires some fine-tuning here and there thus I am recommending weak accept.

作者回复

2025-03-31

Thank you for your insightful comments. Below we address your question point by point:

Claims And Evidence

Missing Theorem: We will add the following theorem on the statistical validity of aLTT “Given a hyperparameter set $\\Lambda$ , a reliability level $\\alpha \\in [0,1]$ , and an error level $\\delta \\in [0,1]$ , aLTT with an FWER or FDR-controlling selection rule returns a final prediction set $\\hat{\\Lambda}^{{\\rm aLTT},T}$ that satisfies $(\\alpha,\\delta)$ -FWER control or $(\\alpha,\\delta)$ -FDR control, respectively.” and provide a proof in the Appendix.
Reliability of selected parameters Models in the predicted set have reliability above $\\alpha$ with a probability at least $1-\\delta$ , ensuring valid hyperparameter selection. However, in our experiments, realized performance often exceeds the target reliability ( https://ibb.co/CpGgwVtP ). We agree that reporting the best model’s reliability is insightful and will include it in the revised version.
Methods And Evaluation Criteria Training an agent online is not always feasible due to safety concerns (e.g., robotic surgery). Alternatively, training with off-policy data requires accounting for covariate shift using propensity scores [1]. Even with known propensity scores, valid performance guarantees must be established via policy rollouts, as done in aLTT.

[1] Uehara M, Shi C, Kallus N. A review of off-policy evaluation in reinforcement learning.

Experimental Designs Or Analyses:

A single grid search has poor TPR and lacks FWER/FDR control, making it unsuitable for statistically valid hyperparameter selection. As an alternative, we benchmark validation-based selection, which tests hyperparameters on a grid and selects those with empirical risk below $\alpha$ . Our experiments show that this approach results in high empirical FWER/FDR, making it unreliable.
The reliability $\\alpha$ parameter must be chosen by the user based on the desired level of reliability for the application. Accordingly, the choice of this parameter must be informed by some knowledge about realistic values for the average risk. In the RL example, one could use off-policy data log to estimate a reasonable target reliability level $\\alpha$ . To evaluate the impact of $\\alpha$ here we perform an ablation study by varying its value ( https://ibb.co/xq5fxXTn ).

Relation To Broader Scientific Literature: In the paper we mention various approaches based on Bayesian optimization [1-4]; however, we do not mention evolutionary strategies or PAC-Bayes methods. Thanks for pointing out this, in the revised paper we will add references to [5-8]. ( References https://ibb.co/twNJDDCd )

Weaknesses: Our approach extends to multiple correlated hyperparameters by associating a multivariate parameter with a single hypothesis. Appendix B.2 presents a wireless resource allocation experiment where hyperparameters include transmit power level and scheduling policy, which are inherently correlated. Even in such cases, aLTT significantly improves testing efficiency and system performance. If space allows, we will move this discussion to the main text.

Clarifying e-values: We acknowledge that readers unfamiliar with these concepts may need more guidance. In the revised version, we will explicitly direct them—already in the introduction—to Section 4.1, which explains e-values, e-processes, and their role in testing.

Questions For Authors:

A simple grid search selecting the hyperparameter with the lowest empirical risk yields a maximum TPR of 1/|reliable hyperparameters|, which can be much lower than 1 when reliable hyperparameters are numerous. For instance, in our RL experiment, with nine reliable hyperparameters, grid search would achieve at most 0.11 TPR, far below aLTT’s 0.85. Moreover, this approach lacks FWER/FDR guarantees, as shown in our experiment.
The reliability requirement is an expectation over multiple roll-out. The definition in (15) is the reward for a single roll-out, while the expectation of (15) over the roll-out distribution defines the reliable hyperparameters, i.e. $\\Lambda^{\\rm rel}=\\{\\lambda\\in\\Lambda: R(\\lambda)=\\mathbb{E}_{P_Z}[R(\\lambda,Z)]> \\alpha\\}$ . We will clarify this in the main text as “The reliability for a single roll-out is measured via the cumulative reward obtained by a policy $\\lambda$ on the MDP $\\mathcal{E}$ , which is defined as…”
We choose to compare aLTT only against methods that are statistically valid, as the goal of this paper is to perform ‘statistically valid hyperparameter selection.’ To illustrate that methods without formal guarantees can violate the target FWER and FDR levels we consider a validation based baseline that picks an hyperparameter if its empirical risk is below $\alpha$ . Figures: RL experiment ( https://ibb.co/xq2KnrGH ) and the APE experiment (https://ibb.co/QFxgkc0G ) .

审稿人评论

2025-04-03

Dear authors, thank you for the proposed changes to the manuscript. I'm going to be trusting that these are included appropriately. Especially the experiment validating the parameter choice seems quite insightful. That being said, I'm happy to recommend acceptance. One caveat I would like to emphasize again is that I am not very familiar with the literature. Some reviewers have raised concerns here and I want to clearly state that my recommendation largely excludes an assessment of the relation to the literature.

作者评论

2025-04-06

Thank you very much for your response and for taking the time to review our rebuttal. We appreciate your positive feedback and we will make sure that the proposed changes and the additional material, including the experiment validating the parameter choice, are appropriately incorporated into the final version of the manuscript.

审稿意见

评分: 42025-03-18

This paper extends the LTT strategy to accommodate adaptive MHT procedures. The resulting procedure is more data-efficient and can be used in situations where the loss function as a function of lambda is not a-priori known, but must be evaluated (with some non-negligible associated cost) on a per-lambda basis. Experiments are performed on offline RL via MuJoCo, prompt engineering, and wireless resource allocation (Appendix).

给作者的问题

No further questions. Great job, this is a wonderful paper!

论据与证据

For the most part, yes. It is clear to me why this procedure works.

However, why no main theorem/proof? To a reader who is an expert in both LTT and e-processes, it is clear that combining these tools leads to a stronger result than in the original LTT paper (accounting for temporal dependencies, allowing early stopping, etc.) But a theorem would substantially help to clarify this point and speak to the power of aLTT in comparison. And a proof, as well. Right now it feels like the paper is incomplete if there’s no theorem or proof of the main theoretical claim (that this is a valid risk-controlling procedure) even though I believe this to be true based on the arguments made in the text of the paper.

方法与评估标准

I believe the method is very well evaluated.

理论论述

Copy-pasting above:

Why no main theorem/proof? To a reader who is an expert in both LTT and e-processes, it is clear that combining these tools leads to a stronger result than in the original LTT paper (accounting for temporal dependencies, allowing early stopping, etc.) But a theorem would substantially help to clarify this point and speak to the power of aLTT in comparison. And a proof, as well. Right now it feels like the paper is incomplete if there’s no theorem or proof of the main theoretical claim (that this is a valid risk-controlling procedure) even though I believe this to be true based on the arguments made in the text of the paper.

实验设计与分析

Yes. Seems sound.

补充材料

Yes. Appendix B.2.

与现有文献的关系

It may be a good idea to reference selectivenet: https://proceedings.mlr.press/v97/geifman19a

At this point, it is known that the selectivenet procedure is not statistically correct. In other words, the guarantee does not hold.

However, it does have a similar adaptive search through $\Lambda$ -space idea presented here, so I believe it is worth a cite.

遗漏的重要参考文献

See above: https://proceedings.mlr.press/v97/geifman19a

其他优缺点

I think the paper is very strong. The idea is truly great, and can really improve the statistical and computational efficiency of LTT. The experiments are fantastic, very strong, relevant, and modern.

My main gripe is the lack of formal theoretical statement being made. It makes it unclear what new algorithm with accompanying theory the paper is proposing.

其他意见或建议

As stopping criteria, the authors suggest (1) stopping once $|\hat{\Lambda}^{\rm aLTT, t}| \geq d$ or (2) stopping once $t = t_{\rm max}$ . Are these the most natural stopping criteria?I would have thought there would be a different criterion. For example, in some cases you may want the smallest or largest $\lambda$ satisfying the constraint of having a population risk less than alpha.

Is there any way to show/argue that this strategy is tight? For example, what does it do in the monotone case? (Can it somehow reduce to RCPS, or something close to RCPS?)

“potentially resulting in empty calibration sets” —> “potentially resulting in empty set of selected prompts $\hat\Lambda^{\rm rel}$ .”

作者回复

2025-03-31

Thank you for taking the time to review our paper and for your positive comments! Below, we address each comment point by point:

"Why no main theorem/proof?...". - Thank you for the suggestion! In the revised version, we will add the following theorem on the statistical validity of aLTT “Given a hyperparameter set $\\Lambda$ , a reliability level $\\alpha \\in [0,1]$ , and an error level $\\delta \\in [0,1]$ , aLTT with an FWER or FDR-controlling selection rule returns a final prediction set $\\hat{\\Lambda}^{{\\rm aLTT},T}$ that satisfies $(\\alpha,\\delta)$ -FWER control or $(\\alpha,\\delta)$ -FDR control, respectively.” and provide a proof in the Appendix.
"It may be a good idea to reference selectivenet...". - Thanks for bringing this related work to our attention, we will include it in the related work.

Other Comments Or Suggestions

We consider the stopping criterion $|\\hat\\Lambda^{\\rm aLLT,t}|>d$ or $t=t_{\\rm max}$ because we don’t make any assumption on the relation between the value of $\\lambda$ and its risk (e.g. monotonicity as in RCPS). For example, in case of the prompt selection, there is no clear notion of larger or smaller prompt. However, if some preference or information about the parameters is available, it is possible to include this information in the design of the stopping criterion. We will clarify this in the revised text.
Yes, assuming that the loss function is monotone, one can use aLTT in a similar manner to RCPS. The e-processes in aLTT can be used to obtain upper confidence bounds on the risk, which can then be used to implement the selection procedure as in RCPS. The validity would follow as in the RCPS paper. However, note that our paper operates in the most general setting, where we remain agnostic to any relationship between risk and hyperparameters.
Thanks for spotting the typo!

审稿意见

评分: 32025-03-19

This paper essentially proposes the following method, as summarized in Section 2.3: Sequential and Adaptive Hyperparameter Selection. Given candidate hyperparameters $\Lambda = \lambda_1, \dots, \lambda_N$ , for rounds $t = 1, \dots, T$ , where $T$ is some stopping time:

Choose $\mathcal{I}^t \subset \Lambda$ based on an adaptive acquisition policy (that is, based only on information seen so far.) For each $\lambda_i^t \in \mathcal{I}^t$ , sample new data $Z_i^t$ independent of the past.
Return a set $\hat \Lambda^{\textup{aLTT}, t}$ according to some decision rule on the data, which is an estimate of $\{\lambda \in \Lambda : E_{P_Z}[R(\lambda_i^t, Z_i^t)] \leq \alpha\}$ . The rule is in the form of a multiple testing algorithm for the $N$ null hypotheses that $E_{P_Z}[R(\lambda_i^t, Z_i^t)] > \alpha$ , which corrects for multiple testing over all of $\Lambda$ .

In this work, the authors are free to use any adaptive acquisition policy and stopping time $T$ , due to the special form of the decision rule, which is based on multiple testing algorithms applied to e-processes. The theoretical observation is that the final set $\hat \Lambda^{\textup{aLTT}, T}$ has the risk controlling properties of the original LTT proposal, and the main, empirical observation is that it has much better sample efficiency, which is useful for scenarios where testing is costly.

给作者的问题

To more convincingly show this method's utility compared to LTT, we should equip LTT with fixed-time valid $p$ -values, which are more powerful than anytime-valid. So: (1) Does your method still dominate LTT with non-adaptive acquisition, when LTT uses $p$ -values based on the hedged capital process of Waudby-Smith and Ramdas (2024), which Angelopoulos et al (2021) call the WSR bound? (2) What about $p$ -values based on the central limit theorem, since $\frac{1}{\sqrt T} \sum_{t = 1}^T R(\lambda_i^t, Z_i^t)$ is asymptotically normal?

This would serve as a more fully convincing demonstration that aLTT can outdo LTT. When the stopping time $T$ is fixed, my hunch is that it cannot always be the case, but may be the case in your applications of interest. I would change my score from Reject to Weak Reject if this is confirmed to be indeed the case (and possibly more if other reviewers concur that the application examples are compelling).

论据与证据

In Section 4.3 it is claimed that $\hat \Lambda^{aLTT, T}$ is an $(\alpha, \delta)$ -FWER or FDR controlling as defined in Definition 2.1 and 2.2. No proof is provided, as it follows immediately from the anytime-valid properties of e-processes and their resulting p-values.

Regarding this claim, however, it is important that the multiple testing procedure involves a correction over all of $\Lambda$ at each step, rather than just $\mathcal{I}^t$ . But Section 2.3 states, as item 1, that $\mathcal{I}^t \subset \Lambda$ is selected for "testing". This is misleading; they are for sampling the $Z_i$ 's, not for testing, as evidenced by the notation in Section 5, which refers to the p-values $1,\dots, N$ . On the contrary, every parameter in $\Lambda$ is tested at every time step.

On the empirical side, in Section 5 they show that their method indeed performs as expected w.r.t FWER and FDR, and has good performance in terms of some metric (true positive rate, length of the prompt) with few rounds of sampling required. These claims are well supported, though it should be mentioned in Figure 3 that the FDR of the FDR controlling method is nearly zero, while its FWER is less than $\delta$ ---a curious behavior which is not implied by the theory.

However, the claim that aLTT is much better than LTT is at least partly unfair to LTT. In the experiment of Section 5.1, LTT is the same algorithm as aLTT at the final time step, where the stopping time $T = 5000$ is a fixed constant. However, even though LTT cannot use adaptive acquisition policies and stopping times (decreasing its power), it can take p-values as input that were not derived from e-processes (increasing its power). I suggest specific comparisons to LTT later.

方法与评估标准

The original LTT motivation was to give a guarantee of absolute safety for high risk applications like medical imaging, which was why the FWER guarantee which appears in that paper has to hold in finite samples. In the RL and LLM examples, it's unclear why a finite-sample guarantee would be absolutely required.

That caveat aside, the method itself and the FWER criteria makes sense for these problems, and the TPR and prompt length criteria of Section 5 seem like good measurements (though I would defer to the opinion of an applied MLer here.) Less clear to me is why the FDR criterion would make sense here---if $\alpha = 0.2$ , then on average, 20% of the $\lambda$ 's might be unreliable, which doesn't seem attractive if the goal is to pick a final reliable one.

As for the evaluation criteria, perhaps this is my unfamiliarity with RL, but it would seem that Section 5.1 requires the MDP to be stationary every time it is queried, since aLTT needs the transcripts $Z$ of state/action/rewards to be iid. I would think that in the real world, interacting with the environment repeatedly changes the underlying MDP. Also, perhaps the loss function in Secton 5.2.1 should be stated explicitly?

理论论述

No proofs to check, though as previously mentioned, ideally it's stated clearly that $\mathcal{I}^t$ is only for sampling, not for testing.

实验设计与分析

The experiments showcase well the utility of aLTT on the example problems. Though, as previously mentioned, the RL problem may be too stylized.

补充材料

Section A seems fine.

与现有文献的关系

Xu et al (2021) stated the e-process bandit framework from which this article inherits. Waudby-Smith and Ramdas (2024) proposed the particular e-process that is used here. And Angelopoulos et al (2021) cast the hyperparameter optimization problem as a multiple testing problem in the form of LTT.

This paper might be distinguished by the recognition that these tools can be fruitfully applied to settings that require expensive queries from an environment, such as policy selection in offline RL and prompt engineering for LLMs. Even if these methods are not directly applied (due to the conservatism displayed in Figure 3), perhaps they could be adjusted in practice and used as a heuristic when the finite sample guarantee is not precisely required.

遗漏的重要参考文献

The Xu et al (2021) is cited, but the bandit multiple testing setup is precisely the one for which the current one is a special case, and this framing is not discussed. An earlier paper using the same framework, but without e-processes, is Jamieson and Jain (2018), and is not cited.

其他优缺点

The paper is relatively clear, but there was a point of frustration regarding whether $\mathcal{I}^t$ is tested over or simply used for sampling, as previously mentioned.

I think this is an interesting paper with an application to a domain not previously considered by Xu et al (2021) and Angelopoulos et al (2021), but methodologically it is not new. Its strength would then lie in its impact to, e.g. prompt engineering or RL policy selection, though I don't think I can evaluate this area.

其他意见或建议

To remind the reader, you could refer to $N$ as the cardinality of $\Lambda$ wherever it appears.
Equations (4) and (5) should be $\alpha$ , not 1.
Box in Section 2.3: change "selected for testing" to something like "selected for estimating the risk more carefully"?
Figure 3: Dashed lines are hard to see, why not say "triangles"? Also, it might be nice to walk the reader through this figure more.

作者回复

2025-03-31

Thank you for your insightful comments. Below we address your question:

Questions For Authors: Yes, adaptive LTT (aLTT) substantially outperforms LTT, even when LTT uses fixed-time valid p-values. The table below reports final TPR for aLTT and LTT using p-values from Waudby-Smith & Ramdas (2024) and Hoeffding-Bentkus p-values (Angelopoulos et al., 2021). The latter is tight for 0-1 losses. In all cases, aLTT with adaptive acquisition ( $\epsilon<1$ ) achieves significantly higher true positive rates (TPR), demonstrating its advantage stems from data-adaptive acquisition rather than LTT’s use of weaker p-values. We will include these results in the revised paper.

	LTT WSR	LLT HB p-value	aLTT $\\epsilon=1$ (non adaptive)	aLTT $\\epsilon=0.25$ (adaptive)
Online RL	0.14 $\pm0.07$	0 $\pm0.0$	0.15 $\pm0.06$	0.86 $\pm0.05$
Automated Prompt Engineering	0.27 $\pm0.06$	0.27 $\pm0.06$	0.27 $\pm0.06$	0.65 $\pm0.05$

P-values based on the central limit theorem are asymptotically valid but unsuitable for rigorous finite-sample guarantees. Such p-values can inflate FDR and lack statistical soundness—for instance, prediction sets from these p-values yield an empirical FWER of 0.154, exceeding the intended $\\delta=0.10$ .

Claims and Evidence

We acknowledge the ambiguity in the term “testing” and will replace it with “evaluation” when referring to the data-adaptive acquisition policy. Specifically, at each time step $t$ , a subset of hyperparameters $\\mathcal{I}^t \\subseteq \\Lambda$ is selected for evaluation, updating only the e-processes in $\\mathcal{I}^t$ (see Eq. 14). This clarifies that all hypotheses in $\\Lambda$ are tested, but only a subset is evaluated at each step.
We believe Figure 3 aligns with theory, as both empirical FWER and FDR stay below $\delta$ . It shows FWER (blue) and FDR (red) for an FWER-controlling procedure (dashed triangle) and an FDR-controlling procedure (solid square). Since FWER is stricter than FDR, for either scheme (triangle or square), the FWER curve (blue) is always higher than the FDR curve (red). Likewise, as the FWER-controlling procedure (dashed triangle) is stricter, its FDR and FWER curves are lower than those of the FDR-controlling procedure (solid square).

Methods and Eval

RL and LLMs are widely used in high-stakes and high-risk applications. For example, in the field of robotics, RL algorithms are used for robotic surgery, autonomous navigation, robotic arm control, and more. In these cases, policies are often trained offline using simulators to mitigate the risks associated with training policies in the real world. However, due to the mismatch between simulators and the real world, the actual performance of these policies is often unknown. In Section 5.1 we exemplify how aLTT can be used to identify policies that satisfy real-world performance guarantees, while limiting the number of real-world interactions. Similarly, LLMs are used in critical fields such as medical diagnosis and legal applications. For example, Med-PaLM is a large language model (LLM) designed to provide high-quality answers to medical questions. The performance of these model can vary depending on how are prompted, rendering their use potentially harmful. In the automated prompt engineering example proposed in Section 5.2 we showcase how aLTT can be used to identify prompt templates whose accuracy falls above the minimum reliability level.
The FWER criterion ensures high-confidence reliability, whereas FDR prioritizes statistical power in large-scale testing. FDR control is common in exploratory research, including genomics, neuroimaging, and social sciences, enabling broader discovery while accepting a higher false positive rate.
We assume the standard episodic RL setting [2], where agent interactions occur in episodes with resets to the initial state. Risk (15) measures average episodic performance, and reliability requirements apply over the initial state distribution $P_{s^1}$ and policy roll-out.

[2] Sutton RS, Barto AG. Reinforcement learning: An introduction.

Related Work

In the revised text, this relation will be further clarified as “This work aims to improve the data efficiency of LTT by leveraging recent advances in sequential multiple hypothesis testing (MHT) based on e-processes (Vovk & Wang, 2021; Waudby-Smith & Ramdas, 2024; Xu et al., 2021). Unlike LTT, which employs standard p-value hypothesis testing, we propose an adaptive learn-then-test strategy based on the hypothesis testing framework proposed in Xu et al. (2021) to perform statistically valid hyperparameter selection for machine learning models and generate finite-sample performance certificates.”
Thanks for bringing to our attention Jamieson and Jain!

审稿人评论

2025-04-08

Thank you for acknowledging my comments! To a few of your points, here are just a few followups, which may not require a response.

Thank you, it makes sense that the LTT + WSR procedure will be dominated. But for LTT + CLT, after T = 5000 rounds, I would have expected normality to kick in, and its small sets may be attractive to a practitioner... this comparison may be worth adding as well. I would find it compelling that it dominates the CLT by so much.
In Figure 3, what I had meant is that it should not be construed that the FDR controlling approach also controls FWER. The FWER is always below $\delta$ , which is not guaranteed by the FDR control. Perhaps this is also worth mentioning.
Thank you for your explanation in Methods and Eval. I actually do find the RL and LLMs example compelling, especially Med-PalM. In robotic surgery and autonomous navigation, the environment does not seem static, leading to a slightly different MDP from episode to episode (that is, every time $Z$ is queried). But the point that it is a standard assumption is well taken.
I am not altogether convinced by the utility of FDR---the reasons for its appeal in a genomics or neuroimaging scenario do not seem to apply here---but perhaps a practitioner may find it appealing as a heuristic way to narrow down the $\lambda$ space.

Also in "Other Comments Or Suggestions", please disregard my note on Equations (4) and (5), which is incorrect, though it might be nice to remark that the alpha dependence lies in $\Lambda^\textup{unrel}$ . The other comments you can take or leave.

I have improved my score to Weak Accept (originally Reject).

作者评论

2025-04-09

Thank you for the thoughtful follow-up comments. We agree that including a comparison with LTT + CLT could be a valuable addition—especially to illustrate its power and practical appeal, even if it lacks formal statistical guarantees. We’ll consider adding this comparison in the revised version. Thank you as well for the clarification regarding the potential misinterpretation of Figure 3. You’re absolutely right that one should not conclude that FDR control implies FWER control, as the observed behavior is empirical and specific to our experimental setup. We’ll make sure to clarify this point in the revision.

审稿意见

评分: 42025-03-23

The authors propose Adaptive Learn-Then-Test (aLTT), a method designed for calibration of AI applications over discrete sets of hyperparameters. The problem is formulated as a multiple hypothesis testing (MHT) task. Hypotheses are tested using e-processes, enabling adaptive and sequential selection of hyperparameters while updating the e-processes based on accumulating evidence over time. The method is applied to two settings: policy selection for offline Reinforcement Learning and prompt engineering, showing improvements in true positive rate compared to non-adaptive baselines.

给作者的问题

What is the intuition behind the inability of aLTT to uncover the entire set of valid hyperparameters in the experiments? Understanding this limitation more deeply could help assess the practical utility and potential extensions of the method.

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed method and evaluation criteria are appropriate and well-justified for the calibration of AI applications.

理论论述

There are no formal theoretical proofs presented in the paper. However, I reviewed the formulas provided and verified that they are consistent with their descriptions in the text. The claims regarding the algorithm’s behavior, as derived from these formulas, appear sound and align with the expected outcomes.

实验设计与分析

The experimental design and analyses are sound, clearly described, and well-aligned with the goals of the paper.

补充材料

I reviewed the entirety of the supplementary material. I appreciated the detailed discussion of FWER and FDR-controlling procedures, as well as the additional experiments, including:

Ablation studies on the betting strategy,
Quantile risk control,
Wireless resource allocation,
Task-level results on the instruction induction dataset.

与现有文献的关系

Key contributions:

Introduction of e-processes for calibration in AI applications.
Adaptive, sequential selection of hyperparameters in a statistically principled manner.

遗漏的重要参考文献

There are no essential references missing from the current discussion.

其他优缺点

Strengths:

Clear and accessible writing.
Well-explained and principled algorithm design.
Strong experimental evidence.

Weaknesses:

No noticeable weaknesses.

其他意见或建议

No additional comments or suggestions.

作者回复

2025-03-31

Thank you for taking the time to review our paper! Below is our response to your question and your comment regarding the absence of formal proofs:

The intuition behind aLTT’s inability to discover all hyperparameters is that, in our experiments, we use finite datasets, which may sometimes be insufficient for aLTT to accumulate enough evidence to identify the entire set of valid hyperparameters.
In the revised text we have decided to state formally the validity of aLTT with theorem as follows “Given a hyperparameter set $\\Lambda$ , a reliability level $\\alpha \\in [0,1]$ , and an error level $\\delta \\in [0,1]$ , aLTT with an FWER or FDR-controlling selection rule returns a final prediction set $\\hat{\Lambda}^{{\\rm aLTT},T}$ that satisfies $(\\alpha,\\delta)$ -FWER control or $(\\alpha,\\delta)$ -FDR control, respectively.”

审稿意见

评分: 42025-03-24

The submission proposes an adaptive algorithm for bandit multiple hypothesis testing with e-processes and anytime-valid false discovery rate control. The algorithm is dubbed "adaptive learn then test" (aLTT) for its connections to the "learn then test" (LTT) framework, which finds a subset of hyper-parameters that provide risk control, but in a non-adaptive fashion.

Experiments on a reinforcement learning and a prompt selection task showcase the behavior of the proposed method in comparison to LTT. aLTT displays larger true positive rates across experiments, while controlling false discovery rate (or family-wise error rate).

给作者的问题

I have no further questions for the authors

论据与证据

The connections between this submission and Xu et al. (2021) should be clarified. The proposed aLTT algorithm boils down to Algorithm 1 in the reference. It should be made clear whether the contribution of the submission is to apply the algorithm of Xu et al. to machine learning settings, or to propose a novel algorithm.

方法与评估标准

The evaluation criteria are appropriate

理论论述

Theoretical claims are mentioned in the main text without formal proofs. Anytime FDR validity does not immediately follow from validity of the e-BH procedure in the general case (see Wang et al., "Anytime-valid FDR control with the stopped e-BH procedure"). Within the context of the submission, the arms are independent so FDR validity should be satisfied. It should be discussed when the proposed algorithm fails to avoid misuse.

实验设计与分析

Could the authors expand on how ground-truth is defined in each experiment to evaluate true positive rate?

补充材料

No, I did not review supplementary material

与现有文献的关系

The submission builds upon bandit multiple hypothesis testing, conformal risk control, and e-processes, which are well-established ideas in their research areas.

遗漏的重要参考文献

Connections with Xu et al. (2021) should be discussed

其他优缺点

Strengths

The problem of statistically-valid hyper-parameter selection is compelling
The idea of using anytime valid inference for this problem is neat

Weaknesses

Presentation is hard to follow at times
Certain claims are hand-wavy and should be made precise

I will expand on my comments and I am looking forward to discussing with the authors!

其他意见或建议

Definition of hyper-parameter

Hyper-parameters are introduced in the submission and generally understood by the community as train-time quantities. The LTT framework and the proposed method, however, focus on calibration of fixed, pre-trained models, or am I missing something?

In the offline reinforcement learning setting, it was somewhat confusing to see the policies themselves being treated as hyper-parameters. I thought the experiment was going to tackle the problem of choosing, for example, the strength of a regularization parameter to guarantee risk control after training.

Clarification questions on proposed method

The considered problem boils down to the bandit multiple hypothesis testing problem described in Xu et al. (2021). That is, at each round of the algorithm, the adaptive mechanism chooses a subset of arms to query, and then updates their respective wealth processes according to the outcomes of the queries. It should be clarified and discussed how aLTT differs from the previous work of Xu et al.

The aLTT procedure is introduced as querying a subset of the arms at each round. In practice, the $\epsilon$ -greedy procedure only selects one arm at a time. This comes short of showcasing the proposed method in full. Could the authors expand on reasonable choices of adaptive mechanisms different from the greedy one?

It should be discussed under which assumptions aLTT provides anytime valid FDR control, as this property does not immediately follow from validity of the e-BH procedure in a general setting (see Wang et al., "Anytime-valid FDR control with the stopped e-BH procedure").

Could the authors expand on the choice of providing FWER control by converting the e-processes to p-processes?

Could the authors expand on the message they are trying to convey in point 2 of the highlighted box on page 3? I am not sure I understand with respects to what it is claimed that the $Z_i$ "can be arbitrarily dependent". $Z_i^t \sim P_Z$ , so the $Z_i^t$ 's are i.i.d.

Experiments

How is ground-truth determined to compute TPR in both experiments?

I have a few questions on the prompt engineering experiment:

How many prompts are being tested?
What kind of tasks are being considered? It would be interesting to include some examples of good and bad prompts to verify whether there are distinctive features in the two groups.
I am not sure how to verify the claim that "stricter accuracy requirements are seen to reduce the number of discovered reliable prompts, leading to an increase in the minimal instruction length". I could have a set { $p_1, p_2$ } with minimal prompt length $\ell_1$ , and a larger set { $p_3, p_4, p_5$ } with minimal instruction length $\ell_2 > \ell_1$ .
Instead of minimal instruction length, it might be more informative to report the bottom $\delta$ quantile in order to account for the failure probability.

Minor comments

Lines 16-17, right column: repetition of "or"
Lines 99-100, left column: e-values are not a p-values.
Definition 2 is not the usually considered notion of FDR: $FDR = \mathbb{E}[FDP] = \mathbb{E}[\frac{\lvert S \cap S_0 \rvert}{\| S \| \vee 1}] = \mathbb{E}[\frac{\lvert S \cap S_0 \rvert}{\| S \|} \mid \|S\| > 0] \cdot \mathbb{P}[\lvert S \rvert > 0]$ .
Algorithm 1: the mechanism $\mathcal{Q}$ should also be a function of $\mathcal{D}^{t-1}$ to reflect adaptability?
Line 296, left column: $\mathcal{E}$ is overloaded with the e-processes
Figures: it might be more readable to report LTT performance as a horizontal line since it does not depend on $t$

作者回复

2025-03-31

Thank you for your insightful comments. Below we address your question point by point:

Theorem and Stopped e-process: In the revised version, we will formally state the aLTT guarantee in a theorem and provide a proof in the Appendix, which follows from the validity of the e-processes. “Given a hyperparameter set $\\Lambda$ , a reliability level $\\alpha \\in [0,1]$ , and an error level $\\delta \\in [0,1]$ , aLTT with an FWER or FDR-controlling selection rule returns a final prediction set $\\hat{\\Lambda}^{{\\rm aLTT},T}$ that satisfies $(\\alpha,\\delta)$ -FWER control or $(\\alpha,\\delta)$ -FDR control, respectively.” Regarding the validity of the e-BH procedure, we will add the following statement clarifying that e-BH maintains statistical validity under the scenario discussed in the paper, while for alternative scenario in which data is re-used at later testing round it is necessary to use the procedure of [1] “Finally, note that while the eBH procedure directly applies to the sequential testing of Section 2.3, where we allow risk estimates $\\mathcal{R}^t=\\{R(\lambda_i, Z_i^t)\\}_{\\lambda_i\\in \\mathcal{I}^t}t$ be arbitrarily dependent --- for example, by using the same data to evaluate all models in $\\mathcal{I}^t$ --- reusing the same data for evaluation of different hyperparameters at later stages can invalidate the e-process validity unless this is properly corrected [1]"

[1] Wang et al. Anytime-valid FDR control with the stopped e-BH procedure. arXiv preprint arXiv:2502.08539. 2025 Feb 12

Hyperparamater Definition: We consider a broad definition of hyperparameters, including train-time, architectural, and post-processing parameters. For example, in the RL experiment (Sec. 5.1), “We produce $N = 20$ control policies by setting the hyperparameter $\\lambda$ in the TD3+BC training objective.” The revised text will clarify that aLTT applies to both train-time and inference-stage hyperparameters.

Xu et al. : We agree that aLTT relies on the sequential multiple hypothesis testing framework based on e-processes introduced by Xu et al. (2021). In this sense, aLTT is to Xu et al. (2001) what the original LTT is to multiple hypotheses testing with FWER control. In the original text we wrote “This work aims at improving the data efficiency of LTT by leveraging recent advances in sequential MH based on e-processes (Vovk & Wang, 2021; Waudby-Smith & Ramdas, 2024; Xu et al., 2021)”. In the revised text, this relation will be further clarified as “This work aims to improve the data efficiency of LTT by leveraging recent advances in sequential multiple hypothesis testing (MHT) based on e-processes (Vovk & Wang, 2021; Waudby-Smith & Ramdas, 2024; Xu et al., 2021). Unlike LTT, which employs standard p-value hypothesis testing, we propose an adaptive learn-then-test strategy based on the hypothesis testing framework proposed in Xu et al. 2021 to perform statistically valid hyperparameter selection for machine learning models and generate finite-sample performance certificates.”

Alternative Data Acquisition: The $\\epsilon$ -greedy strategy generalizes to a top- $K$ approach, where the $K$ largest e-processes are selected with probability $1-\\epsilon$ , and $K$ random hyperparameters are chosen otherwise. We will revise the text accordingly and add an experiment ( https://ibb.co/d46Q2yPK ).

Why p-values from e-processes: While for FDR control there exists procedures that directly operate with e-values, for FWER e-values are first converted to p-values using e-to-p calibration and then FWER procedure are based on p-values are used. [1]

[1] Vladimir Vovk and Ruodu Wang. E-values: Calibration, combination and applications.

Highlighted box page 3: We clarify that while data for each hyperparameter is i.i.d., risk estimates across hyperparameters may be dependent due to shared data reuse (see also first point).

Experiments:

Unless the experiments are conducted using synthetic data, the ground truth is generally unknown. In real-data experiments, the ground truth is estimated from large hold-out samples, which will be clarified in the revision.
The average number of tested prompts is 66 across tasks.
We consider tasks from the Instruction induction dataset; in Appendix B.3 we report the performance per tasks, specifying which task we consider. In the revised version we will subsample some interesting good/bad prompt example from the exhaustive list ( https://ibb.co/1GV2MHyj ).
In the revised text, we will clarify that this statement is made in an empirical and average sense. In fact, due to the randomness of testing, it is possible that the discovered sets are not exactly subsets of one another. However, we find that a stricter requirement leads to a smaller predicted set of reliable prompts (on average) and an increase in the minimal instruction length (on average).
Here we consider plot failure probability instead of minimal length (https://ibb.co/QSCN4sy).

审稿人评论

2025-04-02

I sincerely thank the authors for their consideration of my questions and comments, which have been addressed.

I thank the authors for clarifying the connections between their proposed algorithm and existing methods, I suggest this connection be made crystal clear in order not to confuse readers coming from the broader ML community who may be unaware of testing by betting literature.

The contribution of the paper is a neat and useful application of sequential testing ideas which would be of interest to the community.

I still remain unconvinced on the statement that the proposed method applies to train-time quantities. Selecting policies post-hoc still requires training a policy for each hyperparameter value, which may be prohibitive. I suggest to smooth out this claim, since all experiments presented in the paper apply to post-training quantities.

I believe FWER could also be controlled by an equivalent e-Bonferroni correction without converting to anytime valid p-values? See Hartog et al. "Family-wise Error Rate Control with E-values".

Thank you for including example with a top-K version of the selection mechanism and example prompts! Some of them definitely do not seem relevant to the task at all, e.g. "choose the better word, but my friend doesn't" for "larger animal" task. I was somewhat surprised to see these prompts being evaluated at all. Could the authors expand on their findings regarding useful discoveries of prompts that "sound good" but actually are not selected by the testing mechanism? I think this could be a compelling use case.

I updated my score to reflect the rebuttal and updated claims.

作者评论

2025-04-02

Thank you for your valuable feedback and responsiveness:

We agree with the suggestion to rectify the claim regarding train-time quantities. We will clarify that aLTT naturally applies to hyperparameter selection when testing is performed post-training (as in the experiments). In contrast, if testing is conducted during training, aLTT can still be applied, but with the caveat that any evidence obtained from testing past versions of the model must be discarded, as it does not provide valid risk estimates for the latest model (since we are effectively testing a new model or hypothesis).
You are absolutely right! Hartog and Lei (2025) provide an FWER-controlling procedure that directly uses e-values. This is a very recent reference, and we will make sure to include it in the revised version, as it strongly supports the proposed approach based on e-value testing. Thanks a lot for bringing this up!
The reason for evaluating seemingly irrelevant prompts stems from the way candidate prompts are generated. We followed the forward generation approach from Zhou et al. (2023), where (i) a few-shot example is first provided to a larger LLM, and (ii) the model is then asked to generate prompts that effectively describe the given data. Given this approach, there is a non-negligible chance that some prompts may be less relevant—or even entirely irrelevant—due to imperfections in the larger LLM, despite using a strong model (Llama3.3 70B Instruct, released last December). That said, we fully agree with your point that our testing mechanism successfully filters out prompts that may sound appropriate but perform poorly in actual deployment. For instance, in the first_word_letter task—where the goal is to extract the first letter from a given word (e.g., “y” from “year”)—the prompt “"Extract the first letter from each word” initially appears ideal but achieves a test score of only 0.67, leading to its rejection by our mechanism. In contrast, a similar prompt, “Output the first letter of the word,” achieves a test score of 0.89 and is selected. We will make sure to emphasize this important point in the revised version.

最终决定Accept (spotlight poster)

2025-05-01

This paper extends the Learn Then Test (LTT) paradigm of hyperparameter selection towards an adaptive and on-the-fly approach called adaptive LTT (aLTT). The main benefits of moving away from the static p-value approach and towards a dynamic e-values/processes approach is improving the sample efficiency of such algorithms, making them more amenable to practical applications. To this end, the authors illustrate the improvements of aLTT in a number of diverse tasks, from prompt engineering to offline RL.

Reviewers were more or less unanimous in their positive feedback of this work. The core idea is simple, elegant and well motivated, and the paper is well-written such that all the reviewers were able to achieve a strong understanding. There were no real major concerns, and the authors seemed to address the minor concerns in the rebuttal phase.

Therefore taking all these points together, the paper merits inclusion into the program.