Adaptive Labeling for Efficient Out-of-distribution Model Evaluation
Supervised data suffers severe selection bias when labels are expensive. We formulate a MDP over posterior beliefs on model performance and solve it with pathwise policy gradients computed through an auto-differentiable pipeline.
摘要
评审与讨论
This paper studies model evaluation under distribution shifts. It formulates a planning problem to select inputs for labeling from a large pool of unlabeled data. The paper then proposes an auto-differentiable planning framework to compute reliable policy gradients and update the sampling policy accordingly. To demonstrate the utility of this framework, the paper considers one-step lookahead policy based on their framework and shows that it outperforms active learning-based heuristics and REINFORCE policy gradients.
优点
- Clear presentation: The paper is well-structured, with figures (e.g., Figure 2 and Figure 3) illustrating the adaptive sampling framework, which helps in understanding the proposed method.
- Detailed explanations: The authors provide detailed explanations of the UQ module in Section 3.1, as well as Soft K-subset sampling and Smoothing UQ module in Section 3.3. Even for readers unfamiliar with these terms, the explanations are clear enough to convey their meanings.
缺点
- Computational inefficiency: Their method requires significant computational resources, especially when the posteriors aren’t in closed form. Due to these computational constraints, Section 4.2 only considers a simple toy setting and does not compare the results with active learning and REINFORCE policy gradients. Additionally, the paper only implements one-step lookahead based on their framework, leaving the question of how to efficiently conduct multi-step planning methods unresolved.
- Fixed and equal labeling cost assumption: The paper assumes that the modeler pays a fixed and equal cost for each label. It would be more interesting to consider variable labeling costs, which are more common in practice.
- Lack of theoretical guarantees: The paper does not provide any theoretical guarantees for their method.
问题
- Why not evaluate how close is to the ground truth ? Is always guaranteed to be a distribution concentrated around ?
- How should be chosen in Algorithm 3? How should the sampling size K\ be selected, and how do different choices affect the results?
局限性
See Weaknesses
We thank the reviewer for the thorough review and valuable feedback on our manuscript. Please refer to the global response for our detailed answers to the concerns raised and the changes we intend to make in the camera-ready version based on all the reviews. We also address the concerns raised point by point here.
Computational inefficiency: Their method requires significant computational resources, especially when the posteriors aren’t in closed form. Due to these computational constraints, Section 4.2 only considers a simple toy setting and does not compare the results with active learning and REINFORCE policy gradients. Additionally, the paper only implements one-step lookahead based on their framework, leaving the question of how to efficiently conduct multi-step planning methods unresolved.
Our work introduces a novel adaptive labeling formulation for evaluating model performance when existing labels suffer severe selection bias. As we highlight in Section 2, the adaptive labeling MDP is hard to solve because of continuous states and combinatorial actions. We make initial methodological progress by designing a one-step lookahead planning algorithm and demonstrate that even this approach can outperform existing active learning heuristics. However, this improvement comes at the cost of additional computational resources. Our formulation is thus appropriate in high-stake settings such as clinical trials where labeling costs far exceed computational costs, and we hope our new formulation spurs subsequent methodological innovations for multi-step lookaheads.
We conduct an additional experiment for planning with Ensemble+ as the UQ module to compare the performance of our algorithm with other active learning based heuristics. This comparison is conducted in a more complex setting than previously discussed in the paper. The results can be found in the attached PDF (included with global response). As shown, Autodiff 1-lookahead and REINFORCE outperforms the active learning heuristic, while the performance of the Autodiff 1-lookahead and REINFORCE is similar. Based on our extensive GP experiments in Section 4.1, we anticipate that our algorithm will perform better than REINFORCE in more complex scenarios for Ensemble . This presents an interesting research question for future exploration.
Fixed and equal labeling cost assumption: The paper assumes that the modeler pays a fixed and equal cost for each label. It would be more interesting to consider variable labeling costs, which are more common in practice.
Our framework can seamlessly accommodate variable labeling costs. Specifically, we define a cost function applied on the selected subsets and update the objective accordingly to have a term where is the penalty factor that controls the trade-off between minimizing variance and cost of acquiring samples.
Lack of theoretical guarantees: The paper does not provide any theoretical guarantees for their method.
We provide a theoretical explanation of why pathwise gradients outperform REINFORCE. In a tractable example we detail in the global response, we prove pathwise gradients can achieve a lower MSE than REINFORCE.
Why not evaluate how close is to the ground truth ?
Note that the modeler never has access to the ground truth . Similar to supervised learning losses that use as a proxy of , our objectives considers the modeler’s belief over based on ’s observed so far.
Even in our experiments (Section 4) the is not known in the closed form—it is only known through an evaluation dataset generated by it and it is difficult to compare (a distribution over functions ) to . In addition to comparing the variance of the L2 loss, we also show the difference between L2 loss estimated under to the true L2 loss estimated under (through the available evaluation set) in Figures 5 and 7.
Is always guaranteed to be a distribution concentrated around ?
If our model is well-specified, that is, the prior contains , then we are asymptotically guaranteed to concentrate around as we collect more samples. This is due to the well known results from Doob [1]. Empirically, this means that if the ML uncertainty quantification methodology has enough capacity, will converge to asymptotically. We will highlight this point in our revision.
How should be chosen in Algorithm 3? How should the sampling size be selected, and how do different choices affect the results?
Here is simply the number of samples in the pool set we want to query labels from. The size is assumed to be a given that is input from the analyst based on acquisition cost and labeling budget.
Reference: [1] Doob, J. L. (1949). Application of the theory of martingales. In Actes du Colloque International Le Calcul des Probabilités et ses applications (Lyon, 28 Juin – 3 Juillet, 1948), pages 23–27
Thank you for your response to my comments. I have read your rebuttal and will keep my score.
This work formulates the process of data labeling as a Markov Decision Process (MDP). It introduces three key components: K-subset sampling, an Uncertainty Quantification (UQ) module, and a differentiable simulator. These components enable the application of the REINFORCE policy gradient algorithm to solve the MDP. The proposed method demonstrates significant performance improvements on both synthetic and real-world data.
优点
-
The problem of data labeling is crucial in the field of machine learning. This work formulates the data labeling challenge as a Markov Decision Process (MDP) and addresses it using a Reinforcement Learning (RL) algorithm, providing valuable insights.
-
This study adapts three key components to make the REINFORCE algorithm suitable for this problem.
-
The proposed method achieves significant performance improvements on both synthetic and real-world data.
缺点
The organization of Section 4 could be improved. I suggest introducing the overall description of the proposed method and the pseudo-code for Algorithm 1 after the introduction of the three key components.
问题
I have no further questions. Please refer to the weaknesses mentioned above.
局限性
The authors have discussed the limitations.
We thank the reviewer for the thorough review and valuable feedback on our manuscript. Below, we address the specific questions raised by the reviewer. Additionally, please refer to the global response for the changes we intend to make in the camera-ready version based on all the reviews.
The organization of Section 4 could be improved. I suggest introducing the overall description of the proposed method and the pseudo-code for Algorithm 1 after the introduction of the three key components.
We thank the reviewer for their advice on improving the presentation of the paper. As mentioned in our global response, we will incorporate all their suggestions in the final version of the paper.
Thank you for addressing my questions. I appreciate the effort put into clarifying my concerns and keep my score.
This paper focuses on sampling efforts on out-of-distribution (OOD) examples to minimize uncertainty in the prediction model's performance over T periods. The main idea is based on a novel view of MDP, which considers beliefs as states and samples as actions. The authors also demonstrate how to make the entire pipeline differentiable for optimization. Experiments show that the proposed framework achieves better sample efficiency in reducing variance.
优点
- The approach of considering beliefs as states and samples as actions to minimize uncertainty in the prediction model's performance over T periods is both interesting and novel. Efficiently addressing OOD samples is also a very important problem.
- Experimental results provide significant evidence that Autodiff can efficiently reduce variance.
缺点
-
I do not fully understand some parts: -- Can the authors provide a more detailed formulation on: (1) how the G(u) is calculated? (2) Algorithm 2 returns \pi_{\theta}. Does the pseudo-label-based part also update the policy? How is this updating connected to the updating with ground truth labels acquired in line 6 of Algorithm 1?
-
In lines 102-108 of the related work, the authors mention several active learning methods in batched settings, highlighting that they focus on classification settings while this paper mainly focuses on regression settings. Given that these methods are not used as baselines, I wonder how different the active learning settings in classification and regression are. Can methods in classification settings be easily adapted to regression settings?
-What is the size of the training dataset?
问题
- in Line 270, what do mean by "We just consider a naive predictor, i.e., ψ(·) ≡ 0 for simplicity."
局限性
The paper includes a section on limitations.
We thank the reviewer for the thorough review and valuable feedback on our manuscript. Below, we address the specific questions raised by the reviewer. Additionally, please refer to the global response for the changes we intend to make in the camera-ready version based on all the reviews.
I do not fully understand some parts: -- Can the authors provide a more detailed formulation on: (1) how the is calculated? (2) Algorithm 2 returns . Does the pseudo-label-based part also update the policy? How is this updating connected to the updating with ground truth labels acquired in line 6 of Algorithm 1?
(1) As shown in line 164, given any posterior belief , is defined as . Depending on the form of and the belief , can either be derived analytically or estimated by drawing samples and then calculating the empirical variance of across these samples.
(2) We refer the reviewer to the diagram explaining our algorithm in the attached PDF (alongside the global response). Broadly, at each time step , the input to Algorithm 2 is the current posterior state , which is used to generate pseudo labels. These pseudo labels are then used to optimize the policy in Algorithm 2. Any update of the UQ module/posterior state within Algorithm 2 using these pseudo labels is temporary and the posterior state is reverted back to at the end of Algorithm 2. Next, we select batch according to the optimized policy from the Algorithm 2 and obtain the true labels . Finally, we update the UQ module/posterior belief using resulting in the new posterior state at time .
In lines 102-108 of the related work, the authors mention several active learning methods in batched settings, highlighting that they focus on classification settings while this paper mainly focuses on regression settings. Given that these methods are not used as baselines, I wonder how different the active learning settings in classification and regression are. Can methods in classification settings be easily adapted to regression settings?
As mentioned in the global response, we refer the reviewer to our active learning-based baselines in Section 4.1. In this section, we adapt common active learning heuristics in classification settings such as uncertainty sampling (static), uncertainty sampling (sequential), and random sampling, to regression problems we focus on.
What is the size of the training dataset?
We direct the reviewer to Section D for the experimental details, particularly noting the training dataset sizes mentioned on lines 485-486 and 511-512.
in Line 270, what do mean by "We just consider a naive predictor, i.e., ψ(·) ≡ 0 for simplicity."
Recall our objective is to evaluate the performance of the model . For simplicity, we assume a naive predictor/model which outputs for all the inputs. We want to emphasize that this problem is just as complex as having a sophisticated model , as it merely serves as an input to our algorithm through objective .
Thank you for detailed response. I will maintain my score.
This paper addresses the issue of selection bias in previously collected data when ground truth labels are expensive. It introduces a computational framework for adaptive labeling, aiming to provide cost-efficient model evaluations under severe distribution shifts. The problem is formulated as a Markov Decision Process (MDP), where states are defined by posterior beliefs about model performance. New batches of labels lead to "state transitions," resulting in sharper beliefs, with the goal of minimizing overall uncertainty on model performance. The proposed method optimizes the adaptive labeling policy using pathwise policy gradients via auto-differentiation through simulated rollouts, avoiding high-variance REINFORCE policy gradient estimators. The framework is adaptable to various uncertainty quantification methods and emphasizes the importance of planning in adaptive labeling. Empirical results on both synthetic and real datasets show that even a one-step lookahead policy significantly outperforms active learning-inspired heuristics.
优点
To my knowledge, formulating model evaluation as RL is a novel and potentially useful idea.
缺点
The presentation is extremely confusing. See my questions.
问题
- Line 102: Why do you mean by "Batched settings"?
- Line 107: What are lookahead policies?
- Line 134: What's the purpose of rewriting the likelihood function as ? What does this likelihood function mean?
- Line 160: What's "F_t-measureable"?
- Eq 2: Is uncertainty equal to Var? In the intro, you defined epistemic and aleatoric uncertainty, but what's the uncertainty you're trying to minimize in Eq2?
- Line 168: What's the definition of A?
- Line 179: Missing subject.
- Line 180: What's the definition of g(W_i, \theta)?
- Figure 4: Missing y-axis label.
- Line 285: Providing intuition of empirical results before presenting the experimental results seems weird to me. I suggest stating the intution after presenting the results.
- Figure 4 and 5: What is the difference in terms of purpose between them? What should readers takes away from the comparison on variance of mean squared loss and error of MSE on collected data? Line 314 refers to them but doesn't explain the details.
- Figure 6 and 7: Same issues with Figure 4 and 5.
Minor comments:
- The related works are a lot. Consider making it a section. It's too dense.
局限性
Yes, the limitation is discussed.
We thank the reviewer for the thorough review and valuable feedback on our manuscript. Below, we address the specific questions raised by the reviewer. Additionally, please refer to the global response for the changes we intend to make in the camera-ready version based on all the reviews.
Line 102: Why do you mean "Batched settings"?
In a batched setting, we select a group of inputs to be labeled in each step, rather than just a single input. This approach is particularly important because, in practical scenarios, labels are typically requested in batches rather than individually. Additionally, it makes the problem more challenging by inducing a combinatorial action space. We discuss this in Section 1 (Lines 47-48) and will elaborate further on the problem setup in the camera-ready version.
Line 107: What are lookahead policies?
Lookahead policies are a class of policies for approximately solving the MDPs (Markov Decision Processes). These policies make decisions by explicitly optimizing over a shorter horizon of the MDP rather than the entire horizon. Depending on the length of this shorter horizon, they are referred to as 1-step, 2-step lookahead policies. We will include the following relevant references in the main paper.
[1] D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996
[2] Y. Efroni, M. Ghavamzadeh, and S. Mannor. Online planning with lookahead policies. Advances in Neural Information Processing Systems, 2020
[3] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
Line 134: What's the purpose of rewriting the likelihood function as ? What does this likelihood function mean?
We want to highlight that the only source of randomness in the label , given and , is the noise . Furthermore, this randomness depends solely on and the distribution of the noise .
Line 160: What's "-measureable"?
-measureable states that the posterior state is known at time . This is a formal way of stating that the posterior state depends only on the prior and data acquired up to time .
Eq 2: Is uncertainty equal to Var? In the intro, you defined epistemic and aleatoric uncertainty, but what's the uncertainty you're trying to minimize in Eq2?
We aim to minimize the epistemic uncertainty, specifically by considering the variance of , where . This variance represents the uncertainty due to a lack of knowledge about the underlying , which is the epistemic uncertainty.
Line 168: What's the definition of A?
is a random variable distributed according to . It is introduced to highlight the difference between the REINFORCE estimator and the pathwise gradient estimator.
Line 179: Missing subject.
Thanks for pointing this out, we will address it in the main paper.
Line 180: What's the definition of ?
A random variable distributed according to can always be expressed as for some function , where is distributed independent of . We use a smoother approximation of to enable differentiability with respect to . For example, in Algorithm 2, we discuss how the standard Gumbel random variable can be used to generate a softer version of K-subset samples.
Figure 4: Missing y-axis label.
Thank you for bringing this to our attention. We will add the y-axis labels in the paper. For your reference, we have mentioned the labels in the figure description.
Line 285: Providing intuition of empirical results before presenting the experimental results seems weird to me. I suggest stating the intuition after presenting the results.
Thank you for the suggestion. We will move the intuition after presenting results in Section 4.
Figure 4 and 5: What is the difference in terms of purpose between them? What should readers take away from the comparison on variance of mean squared loss and error of MSE on collected data? Line 314 refers to them but doesn't explain the details.
Figure 4 demonstrates that the variance of the model’s L2 loss is reduced, which is the objective of the algorithm. In Figure 5, besides comparing the variance of the L2 loss, we also compare the accuracy of the updated L2 loss estimation after acquiring new points. This illustrates that the objective is appropriate: by minimizing the variance, we also achieve a better evaluation of the model.
Figure 6 and 7: Same issues with Figure 4 and 5.
This has been addressed above.
There are many related works. Consider making it a separate section, as the current content is too dense.
Thanks for pointing this out. We will make it a separate section.
Thanks for the clarification. I'm increasing my rating to "Borderline accept." I'm not comfortable with increasing the rating further since the significance of the empirical results is not clear to me (even after reading the new results).
We thank all the reviewers for the valuable feedback. We appreciate the constructive comments and will incorporate all discussions below in the camera-ready version of the paper.
Summary of contributions: Our work introduces a novel problem formulation for evaluating model performance when existing data suffer severe selection bias, where state-of-the-art methods such as importance sampling are insufficient. Specifically, we formulate the adaptive labeling problem as a Markov Decision Process (MDP) over states defined by posterior beliefs on model performance. The rest of the paper highlights the potential of our novel formulation using simple methods derived from it. As we highlight in Section 2, the adaptive labeling MDP is hard to solve because of continuous states and combinatorial actions. We design a tractable single-step lookahead planning algorithm and demonstrate even small lookahead outperforms existing active learning heuristics.
Our results underscore the benefits of formulating the adaptive labeling problem as an MDP and designing specific RL algorithms to solve it. As Reviewer PJGF correctly points out, this comes at the cost of additional computational resources. Our formulation is thus appropriate in high-stake settings such as clinical trials where labeling costs far exceed computational costs, and we hope our new formulation spurs subsequent methodological innovations. We thank Reviewers GtV8, CyPE, 4yoX, for recognizing the unusual yet exciting dimension of having the formulation as our main contribution.
To implement one-step lookahead, we move away from standard REINFORCE policy gradients that suffer high variance and propose a novel pathwise gradient based on auto-differentiation. As we summarize in the attached PDF, our approach smoothes the non-differentiable operations to enable back propagation. Despite the bias incurred by smoothing, we observe pathwise policy gradients outperform REINFORCE buoyed by low variance (Section 4). While a substantive methodological contribution, the computational efficiency of our algorithm is bottlenecked by that of the auto-differentiation framework and we highlight promising future directions of research in the Section 4.
Comparison with active learning: Reviewer CyPE asks for comparison with active learning heuristics, and we would like to point to Section 4.1 where we compared our algorithm to common active learning heuristics adapted from classification to regression settings, such as uncertainty sampling (static), uncertainty sampling (sequential), and random sampling.
Variable labeling cost: Our framework can seamlessly accommodate variable labeling cost. Specifically, we define a cost function applied on the selected subsets and update the objective accordingly to have a term where is the penalty factor that controls the trade-off between minimizing variance and cost of acquiring samples.
Experiments: We have conducted an additional experiment as suggested by Reviewer PJGF. Specifically, for planning with Ensemble+ as the UQ module, we compare the performance of our algorithm with other active learning based heuristics. This comparison is conducted in a more complex setting than previously discussed in the paper. Please refer to the attached PDF for the results. Based on the extensive GP experiments in Section 4, we expect that our results will extend to even more complex scenarios for Ensemble .
Theoretical insights: Incorporating Reviewer PJGF’s suggestion, we provide a theoretical explanation of why pathwise gradients can outperform REINFORCE, despite being biased due to smoothing. Recall that in our adaptive labeling framework, we deal with a combinatorial action space where actions are drawn according to a policy . The objective is to minimize , which requires estimating . We consider a tractable yet insightful setting where with and .
The REINFORCE estimator is a -sample approximation of . Note that can be rewritten as , where and . To enable the pathwise policy gradient, we consider a smoothed using sigmoid approximation, where is the temperature. Pathwise gradient estimator is the -sample approximation of .
Theorem: For and , there exists , depending on such that . Further, for any , as and , , while The same statement holds for .
Presentation: We have taken to heart reviewer comments on improving our presentation. We outline a roadmap for enhancing the presentation of the paper below:
(1) Section 1: As suggested by reviewer GtV8, we will create a separate section for related works.
(2) Section 3: We will first explain all the major components of our algorithm before introducing Algorithm 1 and 2. Additionally, we plan to include the figure (from the attached PDF) in the main paper.
(3) Section 4: We will reorganize this section and will provide intuition after we present our experimental results, as suggested by Reviewer GtV8. In addition, we will make active learning based heuristics a separate subsection to convey the results better.
The paper studies the problem of model evaluation under distribution shift. It formulates adaptive labeling as a planning problem to select inputs for labeling from a large pool of unlabeled samples. To show the efficacy of this approach, it considers one-step look-ahead policy and shows that it outperforms several heuristics based on active learning and REINFORCE policy gradient method. Although the authors found the problem studied in the paper interesting and the approach novel, there are serious concerns about the clarity of presentation, the computational inefficiency of the proposed approach, and the lack of theoretical guarantees. Despite all the above, they still found the paper (slightly) above the bar. I would recommend the authors to revise the paper, significantly improve its presentation and properly address the comments/questions raised by the reviewers (to include some of the responses during the rebuttals in the paper) for the next revision.