PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
5
4
5
4
3.3
置信度
创新性3.3
质量3.3
清晰度3.0
重要性3.3
NeurIPS 2025

Sample-Adaptivity Tradeoff in On-Demand Sampling

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

We study the tradeoff between sample complexity and round complexity in on-demand sampling.

摘要

We study the tradeoff between sample complexity and round complexity in *on-demand sampling*, where the learning algorithm adaptively samples from $k$ distributions over a limited number of rounds. In the realizable setting of Multi-Distribution Learning (MDL), we show that the optimal sample complexity of an $r$-round algorithm scales approximately as $dk^{\Theta(1/r)} / \epsilon$. For the general agnostic case, we present an algorithm that achieves near-optimal sample complexity of $\widetilde O((d + k) / \epsilon^2)$ within $\widetilde O(\sqrt{k})$ rounds. Of independent interest, we introduce a new framework, Optimization via On-Demand Sampling (OODS), which abstracts the sample-adaptivity tradeoff and captures most existing MDL algorithms. We establish nearly tight bounds on the round complexity in the OODS setting. The upper bounds directly yield the $\widetilde O(\sqrt{k})$-round algorithm for agnostic MDL, while the lower bounds imply that achieving sub-polynomial round complexity would require fundamentally new techniques that bypass the inherent hardness of OODS.
关键词
On-demand samplingmulti-distribution learningadaptivity

评审与讨论

审稿意见
5

The paper studies the problem of adaptive data collection for multi-distribution binary classification. Multi-distribution refers to the learner being evaluated on the error for data coming from across kk distributions. The adaptive data collection is formalised by a procedure over rr rounds where in each round the learner can choose the number of samples to collect from each of the distributions (and this can be based on samples collected in previous rounds). The aim of the paper is to understand the trade-off between rr and the total number of samples needed for ϵ\epsilon accuracy (on the multi-distribution error in a PAC sense). In the realizable case (where a perfect predictor exists), they provide upper and lower bounds that characterise the (near) optimal trade-off. In particular, O(log k) adaptive rounds is optimal to achieve the optimal sample complexity. In the more challenging agnostic case (where a perfect predictor does not exist), an algorithm is provided that achieves the near-optimal sample-complexity in a number of rounds that is independent of ϵ\epsilon. The optimization via on-demand sampling framework is introduced whose connection to the multi-distribution problem allows to motivate the difficulties in improving the round-complexity of the agnostic case, as well as establish general results on the trade-off between sample complexity and adaptivity.

优缺点分析

Strengths:

  1. The paper is generally well-written and easy to follow (except a few parts – see weaknesses). The algorithms are well explained and the proof sketches provided in the main text are nice and provide some useful intuitions.
  2. Realizable case: An upper-bound is established that captures a trade-off between adaptivity and sample-complexity. In particular, I believe the main point that can be concluded from the upper and lower bound is about the optimality (sufficiency + necessity) of logk\log k rounds to achieve the optimal sample-complexity, completing the characterization of optimality for adaptive sample-complexity.
  3. Agnostic case: An upper-bound is established that achieves the near-optimal sample-complexity with a round-complexity that is independent of the accuracy ϵ\epsilon, improving on prior work whose algorithms had round-complexity that depended polynomially on ϵ\epsilon.
  4. The algorithms are non-trivial extensions of algorithms from prior work.

Weaknesses:

  1. Some of the technical description is a bit unclear. There seems to be some missing notation in the description of Algorithm 1, e.g. qtmq_t^m is not defined (I guess this means m samples from qtq_t ?), qt+1q_{t+1} / qtq_t is not defined (only qt+1(j)q_{t+1}(j) is). Additionally, it is not clear where pp and εA\varepsilon_A are used in the algorithm, which then makes the following discussion of the technical overview a bit unclear. In particular, my understanding is that the learner AA is called to to learn a predictor qtq_t that minimizes the fraction of distributions (as weighted by qtq_t) where the weights qt(j)q_t(j) have been adjusted in previous rounds according to the errors being more or less than τ/2\tau/2 but I don’t see how the equation below l.142 fits in. The proof sketch of Theorem 2 for r=2r=2 provides some nice intuition but the explanation of the way diffidiff_i as uniformly random permutations is not clear. It is not clear to me what is meant by a uniformly random permutation of a copy of Θ(d)\Theta(d). Is diff1diff_1 a sample from a permutation of [1,,d][1,…,d] ? Is diff1diff_1 itself a permutation, though my understanding is that diffidiff_i is the dimension of subspaces that more or less partition the space, I.e.\ an integer. Also, if possible I think it would be clearer to avoid using Θ\Theta notation here and explicitly put the values.
  2. The lower-bound in Theorem 2 has a k1/rk^{1/r} dependence while the upper bound is k2/rk^{2/r}. However, this is enough to establish the necessity of the logk\log k rounds, which in my understanding is the main contribution this result allows to establish for the realizable case (and perhaps should be stressed more ?), especially since known algorithms with optimal sample-complexity already had a round complexity of logk\log k. Indeed, for the case where rr is a small constant, the upper-bound from Theorem 1 is an interesting result, but does not match the lower-bound.
  3. The high-level connection between optimization via on-demand sampling and MDL is discussed in lines 260-265 (and again in Section 5.3), however it is unclear how the results from each section are connected. What is the connection between Theorem 4 and proposition 1 ? And with this absence of connection, it is a bit unclear how the results from Section 5 fit in with the rest of the paper. In particular, I understand how Theorem 9 provides some insight into how well lazy-hedge can perform for MDL but it is unclear how Theorem 4 does this from an upper-bound perspective. More generally, while Section 3 has a clear structure that is easy to follow, it becomes a bit harder to follow the narrative through Sections 4 and 5. Also, in Section 4 a bit more discussion could be provided as currently the results are simply stated without much discussion.
  4. Proposition 1 provides an algorithm with near-optimal sample-complexity with a poly(k) round-complexity but no lower-bound is provided.

问题

  1. In the discussion following Theorem 1, I am not sure about the claim at l.130 that r=logkr = \log k is the “most sample-efficient”. Given the upper-bound in Theorem 1, the most sample efficient is when rr \rightarrow \infty - I guess what was meant is that r=logkr = \log k already allows to recover the order of the optimal sample complexity. But really if we ignore what we know to be the optimal sample-complexity, the bound in Theorem 1 suggests that we should set rr to match the second term of the bound. And at a higher level, what it says is that the adaptivity level (the value of rr) should be set as a function of k and d and in particular will behave somewhat differently when k>d or d>k. Have you thought about this ? And also how this might be relevant to settings where kk may be unknown, how hard would it be to extend these results ?
  2. In l.132, when r=5r = 5 is chosen in Theorem 1, why is it k\sqrt{k} in the sample complexity and not k2/5k^{2/5}, is this a typo ? I think there is also a typo l.157: it should be kd\sqrt{k}d not dk\sqrt{d}k, right?
  3. In weakness 4., it was mentioned that no lower-bound is provided for the round-complexity in the agnostic case. In section 5.3, it was discussed the improving the round-complexity required a different approach. Could you comment on whether you believe the k\sqrt{k} round-complexity could be improved and if so what the optimal round-complexity could be?

局限性

yes

最终评判理由

The paper is well-written and presents a substantial set of results, significantly improving upon prior work. There are a few weaknesses to the work but they are sufficiently minor to not be an obstacle to acceptance.

格式问题

NA

作者回复

Thank you for your detailed review! We start by clarifying the notations that you mentioned in the "weaknesses" part:

  • We will define the notations qtq_t and qtmq_t^m in Algorithm 1. Thank you for your comment!

  • Parameters pp and ϵA\epsilon_{\mathbb{A}} are used for setting the learning rate α\alpha (Lines 1 and 8) and the sample size mm for each round (Lines 2 and 5).

  • The equation below Line 142 is a consequence of Markov's inequality when hth_t has an O(τp)O(\tau p) error on the mixture distribution qtq_t.

  • The notation diff\mathrm{diff} stands for a length-kk vector (diff1,diff2,,diffk)(\mathrm{diff}_1, \mathrm{diff}_2, \ldots, \mathrm{diff}_k), which is in turn a uniformly random permutation of the following kk numbers: one copy of c1dc_1 \cdot d, k\sqrt{k} copies of c2d/kc_2 \cdot d/\sqrt{k}, and (kk1)(k - \sqrt{k} - 1) copies of c3d/kc_3 \cdot d / k. (The constant factors c1,c2,c3>0c_1, c_2, c_3 > 0 are chosen such that these kk numbers sum up to at most dd, and they are suppressed by the Θ()\Theta(\cdot) notation on Lines 161--162.)

We answer your questions in the following:

  1. Thanks for catching this! Yes, we meant that setting r=logkr = \log k is sufficient for the k1/rk^{1/r} factor to be O(1)O(1), so larger values of rr will not help (modulo constant factors).

Regarding setting rr so that the two terms match, that is a very nice observation and thanks for suggesting it! For example, when d=O(1)d = O(1), setting r=3r = 3 would be sufficient for the first term to be dominated by the second. What we had in mind was the dkd \gg k regime (e.g., the setup of Theorem 16 (formal version of Theorem 2)), in which case the first term would dominate.

We were unsure about the setting with an unknown kk: how would the learner sample from the distributions if the number is unknown?

  1. Yes, it is a typo (or at least, a loose bound) on Line 132: r=4r = 4 rounds should be sufficient. And yes, it should have been dkd\sqrt{k} on Line 157---thank you catching that!

  2. We believe that the right round complexity for the agnostic setting should be O(logk)O(\log k), mirroring the realizable setting. That is, our lower bounds for the OODS framework suggest a barrier in achieving this round complexity if the learning algorithm treats MDL as an optimization problem in a black-box way. A concrete starting point towards breaking this barrier would be the alternative MDL algorithms (e.g., the one of [Peng, COLT'24]) that use the samples in a more sophisticated way (beyond running ERM or evaluating empirical errors) and sidestep the OODS barrier.

评论

Thank you for your detailed response. Most of my points have been cleared up. I still don't fully understand the eq. below line 142, is it saying the weights distribution errors should be less than p rather than just minimized ? I would encourage the authors to add some more detail explaining what it means / represents in the final version. I also would encourage the authors to think about clarifying the part mentioned in the third weakness. For the unknown kk, I guess you are right it does not make sense. Otherwise, I am happy to keep my score.

评论

Thank you for your suggestions! For the equation below Line 142, the following is a more detailed explanation, which we will incorporate into the final version:

Algorithm 1 calls the learner A\mathbb{A} to minimize the error of predictor hth_t on a mixture of distribution D1,D2,,DkD_1, D_2, \ldots, D_k, which can be viewed as a weighted average of the errors of hth_t on the kk distributions. This would then imply that, on most of the kk distributions, the error of hth_t is low.

In more detail, the learner A\mathbb{A} learns predictor hth_t for the mixture qt=j=1kqt(j)Djq_t = \sum_{j=1}^{k}q_t(j) \cdot D_j. By our choice of the sample size mm (on Line 2), hth_t has an error ϵA<τp\le \epsilon_{\mathbb{A}} < \tau p on qtq_t (except with probability δA\delta_{\mathbb{A}}). Equivalently, if we sample a random index j[k]j \in [k] according to (qt(1),qt(2),,qt(k))(q_t(1), q_t(2), \ldots, q_t(k)), the expected error of hth_t on DjD_j---Ej[err(ht,Dj)]\mathbb{E}_j[\mathrm{err}(h_t, D_j)]---is at most τp\tau p. By Markov's inequality, we have Prj[err(ht,Dj)>τ]τpτ=p\Pr_j[\mathrm{err}(h_t, D_j) > \tau] \le \frac{\tau p}{\tau} = p, which is equivalent to the inequality below Line 142.

审稿意见
4

This paper works on reducing the round complexity for multi-distributional learning problem. Multi-distribution learning (MDL) focuses on learning multiple distributions with on-demand sampling. Recent work established near-optimal sample complexity for MDL. In this work, the authors try to reduce the round complexity with lazy updates. They provide a trade-off between the round complexity and the sample complexity for the realizable case, and prove a sub-linear round complexity for the agnostic case.

优缺点分析

Strengths:

  1. The major contribution of this work is to establish the trade-off between the adaptivity and the sample complexity. This represents a valuable contribution to related fields.
  2. In technique, the major contribution is a novel application of a variant of the AdaBoost algorithm with a particular notion of margin. The construct of the lower bound is also a significant technical contribution.

Weakness:

  1. The results in the agnostic setting are relatively weak due to the absence of statistical lower bounds.

问题

  1. Can you provide some intuition for why the lower bound is not polynomial in ϵ\epsilon? Is there any method to combine your arguments with the classical Ω~(d+kϵ2)\tilde{\Omega}( \frac{d+k}{\epsilon^2}) lower bound (without limitation on round complexity) to obtain a better rate?
  2. I wonder whether there is substantial difficulty in extending the upper bounds for realizable setting to the agnostic setting. In particular, fix rkr\leq \sqrt{k}, could you prove a sample complexity upper bound better than the trivial bound dk/ϵ2dk/\epsilon^2 for the agnostic setting?

局限性

The theoretical results only apply to the specific parameter regimes.

最终评判理由

My major concerns are resolved, so I would like to keep my evaluation as 4.

格式问题

I did not find any significant formatting issues.

作者回复

Thank you for your review! We answer your questions in the following:

  1. While the lower bound in Theorem 2 is stated for constant ϵ\epsilon (and thus does not have an ϵ\epsilon dependence), it is easy to derive an Ω(1ϵdk1/rrlog2k)\Omega(\frac{1}{\epsilon} \cdot \frac{dk^{1/r}}{r\log^2 k}) lower bound for general ϵ\epsilon via the standard argument below. (The dependence would be 1/ϵ1/\epsilon instead of 1/ϵ21/\epsilon^2 since the hard instance that we use is realizable.) We focused on the constant-ϵ\epsilon regime for brevity. We will add the following discussion to our paper and point to it after the statement of Theorem 2.

We start with the construction for constant accuracy parameter ϵ0=0.01\epsilon_0 = 0.01. Then, we obtain a new MDL instance by "diluting" each data distribution by a factor of 100ϵ100\epsilon: we scale the probability mass on each example by a factor of 100ϵ100\epsilon, and put the remaining mass of 1100ϵ1 - 100\epsilon on a "trivial" example (,0)(\bot, 0), where \bot is a dummy instance that is labeled with 00 by every hypothesis. Learn this new instance up to error ϵ\epsilon is equivalent to learning the original instance (before diluting) up to error ϵ/(100ϵ)=0.01=ϵ0\epsilon / (100\epsilon) = 0.01 = \epsilon_0. Intuitively, the example (,0)(\bot, 0) provides no information, and the learning algorithm has to draw Ω(1/ϵ)\Omega(1 / \epsilon) samples in expectation to see an informative example. This leads to a lower bound with a 1/ϵ1/\epsilon factor.

  1. Yes, Theorem 20 and the discussion in Appendix C.4 imply an O~(k/C)\widetilde O(\sqrt{k/C})-round algorithm for agnostic MDL with sample complexity O~(C(d+k)/ϵ2)\widetilde O(C \cdot (d + k) / \epsilon^2) for any C[4,k]C \in [4, k]. For example, setting C=k0.8C = k^{0.8} gives an O~(k0.1)\widetilde O(k^{0.1})-round algorithm with sample complexity O~(k0.8(d+k)/ϵ2)\widetilde O(k^{0.8} \cdot (d + k) / \epsilon^2), which is generally better than the trivial bound of dk/ϵ2dk/\epsilon^2.
评论

Thank you for the detailed response. I would like to keep my score at the moment.

审稿意见
5

This paper studies an adaptive learning process in the Multi-Distribution Learning (MDL) setting, and characterizes a tradeoff between the sample complexity and round complexity. Specifically, it studies to what extent an adaptive sample collection can achieve a near-optimal sample complexity while maintaining a beneficial round complexity. Under the realizable setting, it characterizes the sample complexity as dkΘ(1/r)/ϵdk^{\Theta(1/r)}/\epsilon, where rr is the round complexity, using a variant of the classical AdaBoost algorithm. In the agnostic case, it shows that with O~(k)\tilde{O}(\sqrt{k}) rounds, a sample complexity of O~((d+k)/ϵ2)\tilde{O}((d+k)/\epsilon^2) can be achieved through an revised algorithm from that of [ZZC+24] (by allowing laze sampling with a cap). From a theoretical perspective, the paper also introduces the framework of Optimization via On-Demand Sampling that naturally connects to the multi-distributional learning problems, which implies both the upper and lower bounds for the round complexity with specified sample overhead.

优缺点分析

The paper considers an important problem of multi-distributional learning under the adaptive setting, and provides a theoretical foundation for studying the tradeoff between the sample and round complexity. It studies the problem in both realizable and agnostic settings, providing both upper (through well-designed algorithms) and lower bounds (through rigorous analysis). The paper is very well presented with original and insightful discussions. The work is very influential in adaptive learning settings, and could benefit in this area of research, including on-demanding sampling, multi-arm bandit problems, continual learning, to name a few.

问题

Did authors consider other boosting algorithms that might further reduce the sample complexity?

局限性

Yes.

最终评判理由

The paper is of high quality. During the discussion period, the authors provided further clarification on its related work. I have checked other reviews, and all reviewers showed support for acceptance. Hence, I believe the paper is definitely above the acceptance bar.

格式问题

No concerns.

作者回复

Thank you for your review! For the realizable setting, in Appendix A (Lines 584--607), we discuss using other boosting algorithms based on recursive boosting [Schapire, 1990] and boost-by-majority [Schapire and Freund, 2012]. They yield similar quantitative bounds in terms of tradeoff between sample complexity and round complexity, but can be favorable in the regime of constant rounds. For example, as mentioned in Lines 597--598, only 3 rounds of adaptive sampling suffice to achieve a sample complexity of O~(dk/ϵ)\widetilde{O}(d\sqrt{k}/\epsilon).

审稿意见
4

This work studies on-demand sampling, where there are KK distributions and the goal of Learner is to learn a hypothesis that minimizes the worst-case error. They consider two classical settings in statistical learning: the realizable setting and the agnostic setting. They present both the upper and lower bounds results. For this work, they study the fundamental trade-off between sample complexity and round complexity, which is very important.

优缺点分析

Strength: the studied trade-off is very fundamental.

Weakness: for some part about the presented Algorithm 1, it is not clear. The details are in the box below.

问题

  1. What is adaptive sampling, or how to decide in each round, whether it is adaptive sampling or non-adaptive sampling? And what is full adaptivity and full non-adaptivity? I hope to see more background knowledge of adaptive sampling.

  2. Is my understanding right the provided results in the realizable setting can only improve the existing non-adaptive sampling results? From the content below Theorem 1, it seems it cannot improve the results shown in [BHPQ17, CZZ18, NZ18].

  3. Can you give more explanations and intuitions about the agnostic results such as how to get rid of the ϵ\epsilon in your new results? And why the round complexity does not need to be dependent on ϵ\epsilon? For me, the existing results make more sense.

  4. For Algorithm 1, how to decide the number of samples that are distributed according to each of the K distribution in each round? Can it be varied among distributions?

  5. The discussion below Theorem 1, it will make it look nicer if, the two special cases, the case where r=logkr = \log k and r=O(1)r = O(1), are presented in Table 1.

  6. Also, for the discussion below Theorem 1, it seems that when r=logKr = \log K, it cannot recover the existing results. Why claim it is almost recovered? I think it only improves the non-adaptive sampling results.

  7. Algorithm 1 inputs round rr, can it be chosen adaptively, for example, based on the past collected data?

  8. Maybe I missed something in Algorithm~1. what are qtmq_t^m and DjnD_j^n ?

  9. In algorithm 1, in step 3, it seems we do not know DiD_i right? So, how to compute q1q_1?

  10. For Theorem 2, it is a minimax lower bound or a lower bound for this specific algorithm? If it is a minimax lower bound, it seems to be contradicted with the existing results in [BHPQ17, CZZ18, NZ18] in the realizable setting.

局限性

N/A

最终评判理由

There are some space for the improvement in terms of presentation. So, I keep my current score.

格式问题

No

作者回复

Thank you for your review! We answer your questions in the following:

  1. The term adaptive sampling refers to that the sampling is done in multiple rounds (see Lines 104--107). The learner does not get to "decide" whether each round is adaptive or not---every round counts toward the round complexity.

Put in a different way: If the learner wants the second round of sampling to be non-adaptive (i.e., independent of the samples drawn in the first round), they could have merged the first two rounds into a single round.

"Full adaptivity" means that there is no limit on the number of rounds. "Full non-adaptivity" means that there is only one round.

  1. The algorithms of [BHPQ17, CZZ18, NZ18] all require r=Θ(logk)r = \Theta(\log k) rounds. In comparison, our results for the realizable setting (Theorems 1 and 2) apply to a general value of rr (between 11 and Θ(logk)\Theta(\log k)) and not just the fully non-adaptive case. As mentioned on Lines 47--50, little was known in the regime that 2rlogk2 \le r \ll \log k.

  2. To see why the ϵ\epsilon-dependence can be removed, it would be the most informative to check Proposition 2 and its proof sketch, which we outline in the following. The round complexity of the algorithm in [ZZC+24] has an ϵ\epsilon-dependence because the Hedge algorithm needs (logk)/ϵ2(\log k) / \epsilon^2 iterations, and each iteration requires one round of sampling. In comparison, the box version of our LazyHedge\mathsf{LazyHedge} algorithm only samples when the weight assigned to one of the kk distributions substantially increases (say, by a factor of 22). Since the weight of each distribution starts at 1/k1/k and is at most 11, each weight can double at most logk\log k times. There are kk distributions in total, so the round complexity of the algorithm is at most klogkk \log k, which is independent of ϵ\epsilon.

  3. In each round tt, the numbers of samples to be drawn from the kk distributions are computed from the mixing weights qt(1),qt(2),,qt(k)q_t(1), q_t(2), \ldots, q_t(k). On Line 5 of Algorithm 1, mm samples are drawn from the mixture distribution qt=i=1kqt(i)Diq_t = \sum_{i=1}^{k}q_t(i) \cdot D_i. Drawing a single sample from qtq_t is equivalent to the following two steps: (1) Sample j[k]j \in [k] according to the distribution specified by (qt(1),qt(2),,qt(k))(q_t(1), q_t(2), \ldots, q_t(k)), i.e., jj is set to each j[k]j' \in [k] with probability qt(j)q_t(j'); (2) Draw a sample from DjD_j. By repeating this mm times, Algorithm 1 obtains the number of samples that need to be drawn from each DiD_i.

  4. Thank you for this comment!

  5. Setting r=logkr = \log k in Theorem 1 gives a sample complexity of O(dlogkϵ+klog(k)log(k/δ)ϵ)=O~((d+k)/ϵ)O(\frac{d\log k}{\epsilon} + \frac{k\log(k)\log(k/\delta)}{\epsilon}) = \widetilde O((d + k) / \epsilon), recovering the existing results in the first row of Table 1. (Note that the k2/rk^{2/r} factor reduces to O(1)O(1) when r=logkr = \log k.)

  6. As stated, rr is a parameter of the algorithm and is specified before drawing any samples. It is an interesting direction for future work to design learning algorithms that set rr adaptively, and thereby using as few rounds as possible on each MDL instance.

  7. qtq_t is the mixture distribution specified by the mixing weights (qt(1),qt(2),,qt(k))(q_t(1), q_t(2), \ldots, q_t(k)), i.e., qt=i=1kqt(i)Diq_t = \sum_{i=1}^{k}q_t(i) \cdot D_i. For a distribution DD and positive integer mm, DmD^m denotes the product distribution formed by mm copies of DD. That is, SDmS \sim D^m is a shorthand for drawing mm independent samples from mm. We will clarify these notations---thank you for catching these!

  8. Note that q1q_1 is only used on Line 5, where Algorithm 1 draws mm independent samples from q1q_1. As mentioned in the answer to Question 4, the algorithm does not need to know DiD_i to sample from q1q_1---to sample from q1=1ki=1kDiq_1 = \frac{1}{k}\sum_{i=1}^{k}D_i, it suffices to draw an index jj from {1,2,,k}\{1, 2, \ldots, k\} uniformly at random and sample from DjD_j.

  9. Theorem 2 is a minimax lower bound that holds for all learning algorithms. It does not contradict the results of [BHPQ17, CZZ18, NZ18], since the algorithms in these work all require at least logk\log k rounds. At r=logkr = \log k, the lower bound in Theorem 2 reduces to Ω(d/log3k)\Omega(d / \log^3 k), which is consistent with the prior results (namely, the O~(d+k)\widetilde O(d + k) sample complexity for constant ϵ\epsilon).

评论

Thank you for the detailed response. I would like to keep my score.

最终决定

The reviewers agree that the paper studies a fundamental tradeoff. Moreover, the paper is well-written and presents a substantial set of results. The Algorithms are well explained as well as the proof sketches. And the results altogether for the realizable and agnostic case make an interesting and deep contribution, which is very relevant to the field. I will therefore accept.