Multi-Objective Hyperparameter Selection via Hypothesis Testing on Reliability Graphs
摘要
评审与讨论
This paper introduces Reliability Graph-based Pareto Testing (RG-PT), a novel framework for multi-objective hyperparameter selection that provides formal reliability guarantees. The authors propose modeling the hyperparameter space as a directed acyclic graph (DAG) called a reliability graph (RG), where nodes represent candidate hyperparameters and edges encode expected reliability relationships. This approach extends existing methods like Learn-Then-Test (LTT) and Pareto Testing (PT) by incorporating structured knowledge about the hyperparameter space, which enables more efficient exploration during multiple hypothesis testing.
优缺点分析
Strengths :
1.The paper provides formal guarantees for FDR control, extending the statistical rigor of existing hyperparameter selection methods to settings with structured knowledge. 2.The authors evaluate their method across diverse application domains, demonstrating consistent improvements over state-of-the-art methods. 3.The framework naturally applies to important real-world scenarios, particularly in prompt engineering for LLMs where trading off reliability and cost is crucial. 4.The method elegantly incorporates prior information through the Bradley-Terry model, allowing domain experts to guide the hyperparameter selection process.
Weaknesses:
1.While the paper demonstrates that RG-PT works well, it's unclear how sensitive the performance is to the specific choice of DAG structure and depth parameter D. The ablation study in Appendix E.3 only partially addresses this concern. 2.The paper mentions computational complexity in Appendix C, but doesn't provide empirical runtime comparisons with LTT and PT, which would be valuable for practical applications. 3.Ironically, RG-PT itself introduces hyperparameters (D, np, τ) that need to be selected. The paper provides values in Table 1 but doesn't fully explain how these were chosen. 4.The paper acknowledges the limitation to discrete hyperparameter spaces, but doesn't explore potential extensions to continuous spaces. 5.While FDR control is proven, there is no theoretical analysis of the statistical power of RG-PT compared to PT and LTT, which would strengthen the theoretical contribution.
问题
1.How might the approach be extended to continuous hyperparameter spaces? Could techniques like discretization or Gaussian process modeling be incorporated? 2.The paper mentions that the choice of prior parameters doesn't affect the validity properties of RG-PT, but how do they impact its power? Is there a principled way to select these parameters? 3.How does the computational cost of RG-PT compare to LTT and PT in practice, especially for large hyperparameter spaces? 4.What is the sample complexity of RG-PT as a function of the DAG structure? Can you provide finite-sample guarantees rather than asymptotic ones, and prove that RG-PT achieves better sample efficiency than PT under specific conditions? This would require analyzing how the graph learning process affects the statistical efficiency of the subsequent hypothesis testing procedure, particularly when the number of hyperparameters is large relative to the sample size. 5.The learned reliability graphs potentially contain valuable domain knowledge about hyperparameter relationships. Can you provide a deeper analysis of these structures across different domains to extract actionable insights? For instance, in the LLM prompt engineering application, what linguistic patterns or semantic relationships are captured in the graph structure that could inform better prompt design principles beyond the specific experimental settings?
局限性
The authors adequately address the limitations of their work, particularly noting the exclusive applicability to discrete hyperparameter spaces and the lack of theoretical results on power and sample efficiency. They also acknowledge potential future work directions, including optimizing the RG structure to maximize power and integration with sequential testing methods based on e-processes.
格式问题
The paper follows the NeurIPS 2025 formatting guidelines. No major formatting issues were observed.
We thank the reviewer for their constructive and thoughtful feedback. Our responses to each of the concerns are provided below.
- Extension to continuous hyperparameter spaces:
As mentioned in lines 88–89 of the draft, a pre-selection step using methods such as LLM judges, Bayesian optimization, and Hyperband can be used to construct a discrete candidate set from a continuous hyperparameter space. For example, Bayesian optimization can efficiently explore the continuous domain using acquisition functions, and the most promising configurations can be selected to form the input to RG-PT [Snoek et al., 2012]. Similarly, Hyperband adaptively allocates resources to a large pool of randomly sampled configurations from a continuous domain, ultimately yielding a high-performing discrete subset [Li et al., 2017].
- On the impact and selection of prior parameters (e.g., , ):
We agree that these parameters influence the power (though not the statistical validity) of RG-PT. We have provided a detailed and principled procedure for selecting both the DAG depth and the prior weight in our response to Reviewer T6Dk.
- Computational complexity:
A detailed complexity analysis and comparison with LTT and PT is already provided in Appendix C of the paper. In practice, we found RG-PT to run efficiently even on large candidate sets (e.g., 100+ configurations), with runtimes under 5 minutes on a standard Apple MacBook Pro (M1 chip). We are happy to include empirical runtime comparisons in the revised appendix to complement the theoretical analysis in Appendix C.
- Sample efficiency and finite-sample behavior:
As demonstrated through various experiments in [Ramdas et al., 2017], DAG-structured testing yields higher power compared to linear testing when the graph encodes true logical dependencies among hypotheses. To illustrate this more rigorously, we provide next a straightforward analytical comparison in the context of fixed sequence testing (FST), which can be seen as a special case of graph-based (DAGGER) testing with a chain graph.
Suppose we wish to test hypotheses . Let of these hypotheses correspond to non-null hypotheses, i.e., reliable hyperparameters, and the remaining be true nulls, i.e., unreliable hyperparameters. Assume all -values , one per hypothesis , are mutually independent and identically distributed. In particular, let be the CDF of all p-values under the null, and be the CDF under the alternative, where reflects the higher concentration of small -values under the alternative hypothesis. Accordingly, for a testing threshold given by (i.e., rejects the null), the type-I probability (of incorrectly rejecting the null) is , while the probability of correctly detecting a non-null hypothesis is .
Assume now a well-informed testing order whereby FST is applied in the order from to . FST proceeds through these hypotheses, deciding for the alternative (reliable hyperparameter) if . The probability of correctly rejecting all non-null hypotheses is given by .
Now consider a misspecified testing order in which null hypotheses appear among the first positions. Let us pessimistically assume that these nulls occur before all non-nulls. The test may stop at the first null encountered if its -value exceeds , which happens with probability . Therefore, to reach and reject all non-nulls, we must reject all preceding nulls (each with probability ) and then reject all non-nulls (each with probability ). Thus, the probability of correctly rejecting all non-nulls is
This derivation shows that even one null placed early in the sequence reduces power by a factor of . In the worst case, with multiple early nulls, the power degradation becomes exponential in .
Similar calculations can be done by considering a more complex structure in the hypothesis space via a DAG. In this case, when the DAG is well specified, the set of true null hypotheses (reliable hyperparameters) corresponds to a rooted subtree of the DAG.
- On extracting domain knowledge from the learned DAGs:
We agree with the reviewer that the learned reliability graphs may encode meaningful domain-specific structure, especially in rich settings such as prompt engineering. For instance, the DAG could implicitly capture relationships between prompt length, linguistic specificity, or formatting cues and their impact on performance. Extracting such patterns could indeed yield actionable design principles for future prompt development.
That said, our focus in this paper is on statistical reliability and multi-objective selection, rather than semantic analysis or pattern mining. A thorough analysis of graph interpretability across domains would require additional infrastructure (e.g., prompt annotation, linguistic taxonomy alignment), which is beyond the current scope. Nonetheless, we fully agree that this is a compelling direction for future work and will explicitly mention it in the conclusion section.
We thank the reviewer once again for the constructive feedback, and we hope that these additions and clarifications improve their view of the work and alleviate their concerns.
References:
-
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, 25, 2951–2959.
-
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2017). Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(1), 6765–6816.
-
Ramdas, A., Barber, R. F., Wainwright, M. J., & Jordan, M. I. (2017). A unified treatment of multiple testing with prior knowledge using the p-filter. Annals of Statistics, 47(5), 2790–2821.
I have carefully reviewed the authors' rebuttal, which satisfactorily resolves my previous concerns. I recommend that the authors include these clarifications in the final version. As my score is already positive, I do not intend to change it.
This paper proposes RG-PT, a new framework for multi-objective hyperparameter selection with formal reliability guarantees. It builds upon LTT and PT by replacing the global linear ordering of candidate hyperparameters with a DAG, termed the reliability graph (RG). The RG is learned using BT ranking and non-negative Lasso regression based on prior information and held-out data. RG-PT applies DAGGER to perform FDR-controlling MHT. The method is validated across a wide range of tasks like prompt engineering, sequence-to-sequence translation, object detection, and telecommunications.
优缺点分析
Strength:
- The incorporation of a learned DAG structure to guide the hypothesis testing process represents a meaningful extension to existing LTT and PT frameworks. And the proposed framework maintains formal FDR guarantees under arbitrary DAG structures using DAGGER, building on established statistical results.
- The experiments cover diverse tasks in NLP and vision, including prompt engineering, sequence-to-sequence translation, object detection, and telecommunications.
- The paper is generally well-written and easy to follow.
Weakness:
-
In the depth assignment step of the RG learning, the BT model first generates scalar scores, and then these scores are clustered into D clusters to assign depth. Although an ablation study over D is done and shows that intermediate depths can improve power, this parameter is manually set and optimized via a grid sweep. This introduces an extra hyperparameter into the RG-PT itself. Given that the method is intended for hyperparameter selection and automation, it is somewhat contradictory that a crucial internal parameter still relies on user intervention and manual tuning. I think the paper could be improved if the authors could add more discussions of whether a data-driven strategy could determine D adaptively, how sensitive the results are to the choice of D across datasets, or whether the chosen D generalizes across data splits.
-
From my understanding, the central assumption for constructing the DAG is that a hyperparameter can be predicted by 's risk vector implies is more reliable. My major concern is that non-negative Lasso only captures linear correlation, not reliability dominance relationship. Is it possible that the linear predictability between two hyperparameters arises from shared noise, task structure, or other facts, rather than indicating a true reliability dominance relationship? This assumption may break the intended semantics of the DAG: edges are meant to indicate reliability dominance, not statistical dependency. Is there any theoretical justification or empirical validation for this assumption?
-
Missing baseline: RG-PT is only compared against LTT and PT. The paper omits any comparison against non-statistical methods which may offer better empirical trade-offs. Moreover, the paper does not compare against simpler but informative baselines that would help clarify where RG-PT's actual advantage comes from. Specifically, I would suggest the authors consider:
- Methods that use weak or uncertain prior signals to sort the candidates, followed by standard PT, without learning a strict DAG structure, and can often be implemented at much lower computational cost. Comparing RG-PT to such baselines would help reveal whether the use of a learned DAG actually contributes to improved reliability discovery or is simply a surrogate for rough sorting.
- PT variants with randomly shuffled orders, or uniform sampling from the candidates. Although such baselines do not provide statistical guarantees, they serve as important references. Without including them, it's difficult to assess whether RG-PT significantly improves naïve strategies.
-
Given the reliance on the DAG structure for statistical validity and selection efficiency, the paper lacks evaluation or discussion of the structural sensitivity of the DAG. For example, can the authors compare different DAG construction pipelines or explore how perturbations to the DAG (e.g., pruning edges, changing cluster assignments) affect the final selected hyperparameters or FDR?
-
(Minor) The font size in Figure 1(a) is small and difficult to distinguish.
-
(Minor)It would be helpful to include a table of notations and definitions in the Appendix for easy reference.
问题
See the weaknesses above, in particular regarding the interpretation of Lasso-based edges, selection of hyperparameter D, and missing baseline.
局限性
Yes
最终评判理由
I maintain my original positive assessment of this paper.
格式问题
N/A
We thank the reviewer for the thoughtful and detailed comments. We address each concern below and outline the specific clarifications and additions we will make in the revised manuscript.
- On the choice of graph depth :
As noted by Reviewer T6Dk as well, we agree that this parameter plays a meaningful role in shaping the power of RG-PT. In our rebuttal to Reviewer T6Dk, we have now proposed and adopted a principled data-driven procedure for choosing .
- On the semantics of edges inferred via non-negative Lasso:
We appreciate the reviewer’s insightful question. Indeed, our goal is to encode reliability dominance in the DAG edges, not merely predictive correlation. To clarify this, we now provide a structural motivation and explain why a linear model is a reasonable and practical choice.
Let us denote the RG as , where is the set of candidate hyperparameters and is the set of directed edges. An edge indicates that the reliability of is dependent on that of , i.e., is a ``parent" of in the DAG.
We assume that each configuration has an unobserved binary reliability label , where indicates that is reliable. We can model the relationship between these reliability variables using a stochastic Boolean structural equation
where is an unknown binary-valued function, and is a noise term, which is generally correlated with the function 's output. The function may, for instance, encode that hyperparameter is likely to be reliable as long as a sufficiently large number of parent nodes are. Moreover, the noise term captures the possibility that a child configuration is unreliable () due to random effects not captured by function .
Although we do not observe the true reliability labels , we do observe scalar-valued performance vectors , such as per-task or per-example risks, which serve as continuous surrogates for reliability. We then approximate the above relationship using a linear regression model:
with non-negativity constraints , which preserve the monotonicity of influence. This leads to the non-negative Lasso objective we solve to infer the edge set .
This formulation does not claim that reliability is fundamentally linear in nature. Rather, it is a linearization of an unknown monotonic Boolean process, which may be justified as done in the literature on influence modeling [Wainright & Chickering, 2006]. The sparsity induced by Lasso ensures that only the most predictive and interpretable relationships are retained. The non-negativity constraint encodes the domain prior that improving the parent should not reduce the reliability of the child.
Ultimately, the FDR control of RG-PT remains valid regardless of any modeling choice made for the construction of the DAG. The DAG only affects the power of the procedure, not its validity.
- On missing baselines:
We thank the reviewer for highlighting the importance of benchmarking against simpler, heuristic, or non-statistical baselines. We note that this comparison has already been conducted in prior work: both LTT and PT evaluate their performance against non-risk-controlling baselines (e.g., "-constrained'' and "-constrained'') and demonstrate improved power under statistical control. Since our method consistently outperforms LTT and PT across all benchmarks, we implicitly improve over these naive baselines as well. Nevertheless, we appreciate the reviewer’s suggestion and have added additional heuristic baselines in the object detection experiment.
-
Chain graph PT: We use our proposed graph learning method to construct a chain graph by setting , and then apply PT using this learned chain as the ordering.
-
Randomly shuffled PT: We run PT on a chain graph where the hyperparameters have been randomly shuffled, effectively removing prior structural information.
-
Uniform Sampling: We uniformly sample hyperparameters (without replacement) and return them as the selected subset, mimicking random selection without statistical guarantees.
The median object recall threshold () achieved by chain graph PT, randomly shuffled PT, and uniform sampling was 0.61, 0.52, and 0.49, respectively, while for PT and RG-PT it was 0.56 and 0.68, respectively. The results show that randomly shuffled PT underperforms compared to standard PT, while chain graph PT offers a modest improvement over PT but remains weaker than RG-PT, highlighting the benefits of using a learned DAG structure. Finally, uniform sampling performs erratically, the selected hyperparameters span the full performance range (reflecting the lack of selection precision), and the method exhibits a high average FDR of 0.54, confirming its lack of statistical reliability.
- On structural sensitivity of the DAG:
We agree with the reviewer that the quality of the learned DAG influences the power of the procedure. That said, the FDR guarantee remains intact regardless of DAG structure, as DAGGER is valid under arbitrary acyclic graphs.
Our approach assumes that there exists an unknown latent DAG that reflects true reliability dependencies. The goal of the learning procedure is to approximate this latent structure from available signals (BT scores, prior edges, and observed risk vectors). There are indeed many ways to learn such a DAG. We proposed a computationally tractable pipeline based on convex surrogates and empirical priors. Our ablations (e.g., on graph depth ) already explore aspects of structural variation. We will consider expanding the ablation scope (e.g., edge pruning and alternative clustering methods) in future work, though we note that defining ground truth reliability DAGs remains challenging in real-world datasets.
- Minor points: We will enlarge the font size in Figure 1(a) and add a table of notations to the Appendix for reader convenience.
We thank the reviewer once again for the constructive feedback, and we hope that these additions and clarifications improve their view of the work and alleviate their concerns.
References:
- Wainwright, M. J., & Chickering, M. (2006). Estimating the "wrong" graphical model: Benefits in the computation-limited setting. Journal of Machine Learning Research, 7(9), 1829–1859.
I am satisfied with the authors' rebuttal. I am therefore inclined to stay with my original positive assessment of this paper.
The paper introduces RG-PT, a principled framework for multi-objective hyperparameter selection that elevates classical Learn-Then-Test and Pareto-Testing paradigms by embedding them in a reliability-graph structure. Each candidate configuration is a node in a DAG whose hierarchy and edges are learned from data-driven Bradley-Terry scores and non-negative Lasso, allowing the method to exploit transitive reliability relationships and to incorporate external priors. A DAGGER-based multiple-hypothesis procedure then traverses the graph, guaranteeing global FDR control while aggressively pruning implausible regions of the search space. Experiments on LLM prompt engineering and NMT decoding show that RG-PT achieves the same reliability guarantees as baselines.
优缺点分析
Strengths:
-
Reliability-Graph Formulation: The paper introduces a graph-based abstraction that generalises both Learn-Then-Test and Pareto Testing, letting the method exploit transitive reliability relations instead of a fragile total order;
-
Strong Statistical Guarantees: By embedding the DAGGER procedure in the graph traversal, the approach proves global FDR control under arbitrary p-value dependence, offering firm, theory-backed reliability;
-
Practical Impact Demonstrated: Experiments on LLM prompt engineering and machine translation show the method finds cheaper or higher-quality configurations than LTT/PT while respecting the same FDR budget, validating both efficiency and effectiveness;
-
The paper’s arguments and experiments are quite comprehensive.
Weaknesses:
- Users need to manually specify parameters such as the graph depth D, the regularization strength of the non-negative Lasso, yet the paper provides no guidance for automatically selecting or tuning these values;
问题
My main concern has already been listed in the weaknesses section, and I look forward to the authors’ response.
局限性
yes
最终评判理由
I have thoroughly considered the authors’ response and find that it adequately addresses my earlier concerns. I recommend incorporating the clarifications into the final manuscript. Since my current evaluation is already positive, I will not revise my score.
格式问题
I have no Paper Formatting Concerns
We thank the reviewer for their helpful review. We would like to first clarify that the theoretical FDR guarantee of RG-PT holds regardless of the choices of the parameters , , and . These parameters affect the construction of the RG and hence influence statistical power, but not the validity of the testing procedure.
That said, we agree that practical guidance is important for users, and we plan to include in the revised manuscript a description of a specific procedure for selecting these hyperparameters. This is the method we employed in our experiments, and we will include it in the appendix of the revised version to aid future applications.
Graph Depth : We select by computing the Silhouette score [Rousseeuw, 1987] for each candidate value and select the value that maximizes the average Silhouette score. The maximum depth can be set to , or a fixed, smaller, value such as setting for computational efficiency. The Silhouette score quantifies both intra-cluster cohesion and inter-cluster separation, and is a well-established criterion for model-free cluster number selection [Kaufman & Rousseeuw, 2009]. This approach is fully data-driven, and fits naturally with the structure of RG-PT, where each cluster corresponds to a DAG level.
Prior Weight : We set using a predictive empirical Bayes procedure, which calibrates the trust in the prior information based on its ability to generalize to held-out data. Specifically, we follow the steps below:
-
Create a set of candidate values for , e.g.,
-
For each candidate , compute the BT scores for the hyperparameters in .
-
Evaluate the predictive log-likelihood of the pairwise outcomes on a held-out validation dataset as where if outperformed in , and 0 otherwise.
-
Choose the that maximizes .
This procedure reflects standard practice in empirical Bayes model selection [Efron and Morris, 1973] and ensures that the prior contributes proportionally to its predictive usefulness.
Lasso Regularization : We select the Lasso regularization parameter using stability selection [Meinshausen & Bühlmann, 2010], a principled technique for tuning regularization parameters. Our goal is to obtain a DAG structure that is both sparse (to maximize testing power and interpretability) and stable (i.e., robust to data perturbations). Stability here refers to the reproducibility of selected edges under subsampling. The procedure is as follows.
-
We generate multiple random subsamples of the calibration dataset , each containing a fixed proportion of the data (e.g., 50%) as in standard stability selection [Meinshausen & Bühlmann, 2010].
-
For each candidate value of , we fit the non-negative Lasso independently on each subsample, and follow the steps to create the RG.
-
For each edge , we record its selection frequency, the fraction of subsamples in which it appears in the inferred DAG.
-
We then compute the edge stability score for a given as the average selection frequency over all edges. This metric reflects the reproducibility of the graph structure.
-
Among all candidate values, we select the one that maximizes stability, subject to a sparsity constraint on graph complexity (e.g., average node degree ), to prevent overfitting and preserve statistical power [Shah & Samworth, 2013].
We will include these procedures in the appendix of the final version. This ensures that RG-PT can be applied in a fully automated and reproducible manner, without compromising its theoretical guarantees.
References:
-
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.
-
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
-
Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach. Journal of the American Statistical Association, 68(341), 117–130.
-
Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417–473.
-
Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55–80.
Thank you for your reply. I have thoroughly considered the authors’ response and find that it adequately addresses my earlier concerns. I recommend incorporating the clarifications into the final manuscript. Since my current evaluation is already positive, I will not revise my score.
This paper introduces a new framework for multi-objective hyperparameter selection named Reliability Graph-based Pareto Testing (RG-PT). The method addresses the common challenge of balancing performance with cost, such as selecting prompt templates in LLMs. RG-PT models the relationships between hyperparameters as a Reliability Graph (RG), which is inferred from prior information and data using the Bradley-Terry model and non-negative Lasso. This structure allows for a more efficient exploration of the hyperparameter space compared to existing methods such as Learn-Then-Test (LTT) and Pareto Testing (PT). The framework uses the DAGGER algorithm to perform multiple hypothesis testing, which formally controls the false discovery rate (FDR). Experiments on tasks including prompt engineering, object detection, and sequence-to-sequence translation demonstrate that RG-PT outperforms LTT and PT by identifying better-performing hyperparameters.
优缺点分析
Strengths
- The core idea of leveraging prior knowledge to build a Reliability Graph that reflects the structure of the hyperparameter space is compelling. This innovative approach represents a promising research direction for more efficient hyperparameter selection.
- The paper provides comprehensive experimental validation across a diverse set of applications, including prompt engineering, language translation, and object detection. These results convincingly demonstrate the superior performance and broad applicability of RG-PT compared to the LTT and PT baselines.
Weaknesses
- Several key technical concepts lack sufficient explanation or citation. For example, Equations (6) and (7) are presented without justification. To improve clarity and rigor, the paper should provide brief derivations or cite the specific sources from which these formulas are drawn.
- Proposition 3.1 establishes that RG-PT controls the FDR by inheriting the guarantees of the DAGGER algorithm, but it does not offer theoretical insight into why RG-PT outperforms LTT and PT. The paper would be significantly strengthened by a formal discussion of the statistical power or sample efficiency gains that the graph-based structure provides.
- The paper's central motivation is providing a formal statistical guarantee (FDR control), which is used to justify excluding methods like Bayesian optimization from the experimental comparison. However, the practical utility of this specific guarantee over other frameworks (e.g., regret minimization) is not clearly demonstrated within the context of the chosen experiments.
- The main text lacks a discussion on the computational complexity and scalability of building the Reliability Graph. While an analysis is present in the appendix, the main paper should address how the method scales with the number of hyperparameters on the Pareto front to clarify its feasibility for larger-scale problems.
问题
- The paper claims that methods like Bayesian optimization and bandit-based approaches 'lack statistical guarantees' (Line 59). This statement requires clarification. While these methods may not offer FDR control, they are supported by theoretical guarantees, such as bounds on cumulative regret. The paper should more precisely define the type of statistical guarantee it focuses on (i.e., post-selection reliability via FDR control) and provide a clearer justification for why this framework is more suitable for the target applications than regret-minimization frameworks.
- In Line 117 ('A discovery is false if the selected hyperparameter A is actually unreliable'), for improved clarity and consistency with standard hypothesis testing terminology, consider replacing 'selected' with 'rejected'.
- The derivation of the p-value in Equation (6) is not explicitly provided. Citing the original source or providing a brief derivation would strengthen the paper's self-containedness.
- The term 'data-driven probability' used to describe Equation (10) is not standard. Please clarify its meaning and provide a justification for why this ratio of p-values can be interpreted as a probability of one hyperparameter being more reliable than another.
- The formulation for the pairwise count in Equation (11) appears to be a heuristic combination of data-driven and prior information. Please clarify if this is a standard approach and provide references if available, or justify the specific weighting scheme used.
- The paper's motivating example (Fig. 1a) has a clear, intuitive structure. In cases where no obvious reliability hierarchy exists among hyperparameters, how does the RG construction perform? For instance, would a lack of structure lead to a flat or fully-connected graph, and what would be the performance implications?
- The paper uses non-negative Lasso to select parents for a node (Line 245). Is it possible for all regression coefficients () to be positive, and if so, would this mean a node has all potential parents from the previous level? Please clarify this edge case.
- The choices for the reliability threshold () and the FDR level () are presented without justification. Are these standard values for these tasks, or were they chosen specifically for these experiments? Providing some context or guidance on how to set these crucial parameters in practice would be beneficial.
- The paper reports the achieved FDR for RG-PT (Lines 311-313), confirming it meets the target. For a more complete comparison, please also report the achieved FDR for the LTT and PT baselines in the experiments.
局限性
NaN
最终评判理由
I increased my score from 3 to 4 since the authors have responded to my questions and concerns in details. I am convinced that the proposed method is a sufficiently solid algorithm with good experiments supported.
格式问题
NaN
We thank the reviewer for the thoughtful and detailed comments. We address each concern below and outline the specific clarifications and additions we will make in the revised manuscript.
- Lack of theoretical insight into why RG-PT outperforms LTT and PT
Please see point 4 of our response to reviewer LYAN.
- Clarification on claim that BO and MAB “lack statistical guarantees”
We will revise Line 59 to clarify that BO and MAB methods lack post-selection error rate control such as FDR or FWER guarantees. Under assumptions such as the correct specification of kernel models, BO and MAB can provide bounds on cumulative or simple average regret. These do not offer the type of assumption-free guarantees on the risk of the selected parameters that can be instead provided by FWER- or FDR-controlling methods.
To validate the practical impact of the different theoretical guarantees by BO and MAB, we have now implemented BO and MAB on the seq2seq task, comparing the achieved FDR with LTT, PT, and RG-PT. For BO, we used the scikit-optimize implementation with the UCB acquisition function. For MAB, we used a standard stochastic bandit algorithm, namely Thompson Sampling.
It was observed that only RG-PT, LTT, and PT satisfy the target FDR level , while the achieved FDR on the test dataset for BO and MAB was 0.14 and 0.21, respectively.
- Terminology: “selected” vs “rejected” in Line 117
Thank you for this suggestion. While “rejected” aligns with classical hypothesis testing, we intentionally used “selected” to match the hyperparameter selection framing. We will clarify this distinction in the revised text for consistency with standard terminology.
- Derivation of the -value in Equation (6) and (7)
Equation (6) is derived using Hoeffding's inequality, applied to the empirical mean of bounded loss values in , similar to the approach in LTT [Angelopoulos et al., 2021] and PT [Laufer et al., 2022]. Additionally, the proof that eq. (7) forms a valid p-value can be found in Appendix A.2 of the PT paper [Laufer et al, 2022]. We will include a short derivation and cite these works explicitly to make the paper more self-contained.
- Interpretation of Equation (10) as a “data-driven probability”
We agree that this terminology is unclear and will revise it. This construction belongs to a broader class of metrics used in paired comparison and choice models [Cattelan, 2011]. A function is called a link function if it satisfies the symmetry condition
and is increasing in . In pairwise comparison models, such a link function is used to compare two items with respective scores and (with higher values indicating better performance) by evaluating . This quantity can then be interpreted as the relative likelihood that item is preferable to item . Examples of commonly used link functions include [Cattelan, 2011]:
- Thurstone model [Thurstone, 1927]: , where is the normal CDF.
- BT model [Bradley & Terry, 1952]: ,
- Transformed BT model (our choice) [Bradley & Terry, 1952]: .
We chose the linear ratio (transformed BT model) form for its simplicity, numerical stability, and effective integration with the Lasso-based DAG construction. In the revised manuscript, we will clearly state this design choice and emphasize that alternative transformations from this family could be substituted without affecting the theoretical guarantees of RG-PT.
- Justification of Equation (11)'s weighting scheme
Equation (11) defines a convex combination of prior and data-driven comparisons to estimate the pairwise reliability strength between hyperparameters and . The ratio encodes our confidence in the prior information relative to the observed data, where serves as a pseudocount analogue.
This formulation is not ad hoc, but is directly inspired by the theory of conjugate priors in exponential family models [Diaconis and Ylvisaker, 1979]. In such models, the natural parameters of the posterior distribution are formed by additive combinations of prior and empirical sufficient statistics. For example, in the Beta-Binomial model, the posterior mean of a Bernoulli parameter is
where are prior pseudocounts and are observed successes and failures. This additive structure generalizes to Dirichlet-Multinomial models (of which the BT model is a special case), where prior and empirical pairwise counts are combined to produce regularized estimates of comparison probabilities.
In our setting, the constructed count can be viewed as a smoothed estimate of the relative reliability of over , constructed analogously to the posterior update in exponential family models with conjugate priors, where the prior and observed sufficient statistics combine additively to determine the posterior natural parameter. The weighting in Equation (11) thus mimics a Bayesian update, where the prior and data each contribute proportionally to the final score. This principle is well-established in the Bayesian literature [Bernardo and Smith, 1994; Diaconis and Ylvisaker, 1979; Gelman et al., 1995], and we will make this connection explicit in the revised manuscript.
- Behavior when there is no structure among hyperparameters
In scenarios where no prior structural knowledge is available, we explicitly set , resulting in an RG that is learned entirely from held-out data via the BT model and non-negative Lasso. This setting is described in Lines 211 and 221 of the manuscript. Even in this purely data-driven regime, RG-PT retains the formal FDR control guarantees of DAGGER.
Clearly, if the hypothesis space is completely unstructured, one cannot hope to learn useful information through this approach. That said, our experiments confirm that RG-PT still exhibits a larger power compared to LTT and PT even when ordering among hypotheses are not hard-coded into the testing procedure. In particular, in the object detection task, we did not assume access to any prior reliability information and set . Despite this, RG-PT still outperformed both LTT and PT, demonstrating its ability to discover and exploit latent structure purely from empirical data. We will clarify this point more explicitly in the revised version.
- Edge case: all non-negative Lasso coefficients are positive
Yes, it is possible that a node has all previous-layer nodes as parents if all Lasso coefficients are positive. In such cases, the node will be tested only if all these parents are declared reliable. For example, the bottom node in Fig. 1 has all the nodes from the previous level as a parent, illustrating this case.
- Justification for the reliability threshold and FDR level
The threshold defines what it means for a hyperparameter to be considered “reliable,” i.e., having risk below . Its choice depends on the downstream application and the desired level of caution. In our experiments, we selected via preliminary validation sweeps to identify values that effectively separate low-risk configurations while maintaining adequate statistical power. For example, values of in the range are commonly used in the applications considered in the LTT paper [Angelopoulos et al., 2021].
The parameter sets the desired control level for the FDR, specifying the maximum acceptable expected proportion of selected hyperparameters whose risk exceeds the target level . Following common practice in the MHT literature, including the LTT and PT papers, we set , corresponding to a confidence level of 90% for the reliability of the selected hyperparameters. We will clarify this rationale and include appropriate citations (e.g., [Benjamini and Hochberg, 1995]) in the revised manuscript.
- Report achieved FDR for LTT and PT
The empirical FDR achieved by RG-PT for all the tasks is already reported in Table 2 of the Appendix. In the revised manuscript, we will extend this table to also include the achieved FDR values for LTT and PT as well. The results for LTT and PT were originally excluded, as the FDR control of these two methods has already been validated experimentally in their respective papers. Importantly, the experiments all demonstrate that LTT, PT, and RG-PT all satisfy the FDR constraint, as backed up by theory.
We thank the reviewer again for the constructive feedback. We believe the above clarifications and additions will significantly improve the clarity and rigor of the paper.
References:
-
Cattelan, M. (2011). Models for paired comparison data: A review with emphasis on dependent data. Statistical Science, 26(3), 412–433.
-
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273–286.
-
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345.
-
Diaconis, P., & Ylvisaker, D. (1979). Conjugate priors for exponential families. The Annals of Statistics, 7(2), 269–281.
-
Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian Theory. Wiley.
-
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian Data Analysis. Chapman and Hall/CRC.
-
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300.
Thank you for the detailed responses to my questions. I encourage the authors to incorporate the explanations for Equations (6-7) and (10-11) into the revised manuscript. I have adjusted my score to reflect these helpful clarifications.
After reviewing the author's response and the feedback from other reviewers, I have no remaining significant concerns.
We thank all reviewers for their thoughtful feedback and for engaging in constructive discussions during the rebuttal phase. We are pleased to see that our responses have satisfactorily addressed the raised concerns across all reviews. In particular, we have added new theoretical analyses, clarified key assumptions, proposed principled procedures for parameter selection, expanded experimental baselines, and justified methodological choices with appropriate references and ablations.
We will incorporate all clarifications and improvements outlined in our responses into the final version of the paper. We appreciate the reviewers’ recognition of the novelty and practical value of our framework, and we hope the Area Chairs will find the strengthened version a compelling and technically solid contribution to NeurIPS 2025.
This work introduces a new framework for multi-objective hyperparameter selection named Reliability Graph-based Pareto Testing (RG-PT). The selection method balances performance with cost. Relationships between hyperparameters are modeled as a Reliability Graph (RG), which is inferred from prior information and data using the Bradley-Terry model and non-negative Lasso. The method is. compared to competing baselines.
All reviewers voted for acceptance post rebuttal. The main reason being the following: 1/ The proposed method is sound and novel. The approach is evaluated on a diverse set of applications, including prompt engineering, language translation, and object detection. 2/ The core idea of leveraging prior knowledge to build a Reliability Graph that reflects the structure of the hyperparameter space is compelling; 3/ The method exhibits strong statistical guarantees. By embedding the hypothesis testing procedure in the graph traversal, the approach proves global FDR control under arbitrary p-value dependence, offering firm, theory-backed reliability; 4/ The paper is generally well-written and easy to follow.