6.4

/10

Poster5 位审稿人

最低6最高7标准差0.5

2.6

置信度

正确性2.8

贡献度3.2

表达2.4

NeurIPS 2024

Sample-Efficient Agnostic Boosting

Udaya Ghai,Karan Singh

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

TL;DR

Improved sample complexity for agnostic boosting

摘要

关键词

boosting; sample complexity; learning theory; reinforcement learning

评审与讨论

审稿意见

评分: 7置信度: 12024-06-14

The paper studies the agnostic boosting problem. That is the learner is given access to a weak learner that given $m_0$ samples form any distribution $D'$ returns a hypothesis $h$ from some baseset $\mathcal{B}$ such that it is $corr_{D'}(h)\geq \gamma corr_{D'}(h^*)-\varepsilon_0$ with probabilty $\delta_0$ , where $h^*$ is the hypothesis in a hypothesis set $\mathcal{H}$ with the largest $corr_{D'}$ . Furthermore the learner is given access to samples from a unknown target distribution $D$ . The goal of the learner is now to create a hypothesis $\bar{h}$ (not necessary in $\mathcal{B}$ or $\mathcal{H}$ ) such that $corr_{D}(\bar{h})\geq \max_{h\in\mathcal{H}}corr_{D}(h)-\varepsilon$ with probabiilty $1-\delta$ , with as few samples from $D$ and calls to the weak learner. The paper present a new learning algorithm for solving this problem, which uses (with log factors not included) $O(\log(|B|)/(\gamma^3\varepsilon^3))$ many samples and $O(\log(|B|)/(\gamma^2\varepsilon^2))$ many calls to the weak learner. This is an improvement in the number of samples need from $D$ of a factor at least $O(\gamma\varepsilon)$ over previous know results, this improvment cost a factor of $\log(|B|)$ more calls to the weak learner compared to the best know which uses $O(1/(\gamma^2\varepsilon^2))$ . However they also have a second algorithm that uses $O(\log(|B|)/(\gamma^3\varepsilon^3)+\log^3(|B|)/(\gamma^2\varepsilon^2))$ many samples from $D$ and $O(1/(\gamma^2\varepsilon^2))$ many calls to the weak learner. The highlevel explanation of this better bound on the sample complexity is that the best previous known algorithm was running $O(1/\gamma^2\varepsilon^2)$ rounds and in each round using $O(\log(|B|)/(\gamma^2\varepsilon^2)$ new samples from $D$ . However by reusing samples from previous rounds an agumentening the samples they are able to only use $O(1/(\gamma\varepsilon))$ samples per round and in $O(\log(|B|)/(\gamma^2\varepsilon^2))$ rounds. They also extend there results to infinite classes $\mathcal{B}$ . Furthermore the paper show how there new algorithms also have implications for Agnostic learning halfspaces and Boosting for reinformencment learning. They also evaluate their first algorithm against the previous best known algorithm and claim that they demonstrate a improvement empirically.

优点

Originality: They claim the paper has three innovations, reusing samples from previous rounds by argumenting them smartly, using a martingal argument to not use a uniform convergence results that they claim would result in a $O(1/(\varepsilon^4))$ sample bound, and using a different branching idea compared to previous work when. Given the discrebtion of the previous work they the authors provide the ideas seems novel and different from previous approaches.

Quality: The main contains no full proofs so I have not checked the technical soundness of the paper this is also my reason for my confidence score. I have included questions below on the proof sketch. In the introduction it is said that "Finally, in preliminary experiments in Section 7, we demonstrate that the sample complexity improvements of our approach do indeed manifest in the form of improved empirical performance over previous agnostic boosting approaches on commonly used datasets." do you compare to more than one previous approach? Is it exactly the algorithm 1 you run? As the experiments are preliminary I would also suggest maybe using a word like "suggest" instead of "demonstrate" in the sentence above. As also mentioned in the limitations the theoretical work is the primary contribution of the paper.

Significance: The theoretical results seems significant and a nice improvment over previous work in the sample complexity and a interresting step towards "matching" the samples complexity of agnostic learning which has dependency $O(1/\varepsilon^2)$ instead of now $O(1/\varepsilon^3)$ (if it is possible to achieve a sample complexity $O(1/(\varepsilon^2 \gamma^2))$ in the agnostic boosting case). As they state the experiments are preliminary and I also personally think that the theoretical results are more than enough to make the paper significant in it self.

缺点

The only small weakness I can come up with is the $log(|B|)$ factor in the oracle sample complexity of the algorithm 1 (and they mention themself) which is taken care of in algorithm 2 for most parameters of $log(|B|),\varepsilon,\gamma$ (not considering hidden $log$ factors in the notation).

问题

The comment given below Definition 1, do [KK09] and [BCHM20] also have the additive $\varepsilon_0$ term so they have to set $\varepsilon_0 = \varepsilon$ ? (So my question is: is the comment "necessary" also meant for [KK09] and [BCHM20] as well?).

Line 208 is it $- \phi ' ( H_t, h^{*})$ or what is $H$ in $-\phi '(H,h_t^{*})$ ?

Line 8 algorithm 1, what is $D^{'}_t$ - I guess nevermind the line 9 says $\hat{D}^{'}_{t}$ is that also the case for Line 8? And on line 8 to me the "with as" sounds a bit off (disclaimer: I'm not a native english speaker so please take this into considerations for this suggestion).

Theorem 4: Is there no dependence on $\delta$ or $\delta_{0}$ in the sample complexity? Or are they suppressed in the notation as the footnote suggests?(if this is the case why is $\log|B|$ not suppressed in the notation as well?)

Proof sketch of Theorem 4 questions start

Line 238 why $\eta \approx 1/\sqrt{T}$ ? (So $\eta = O(\sqrt{1/T})=O(\gamma\varepsilon/\sqrt{\log|B|})$ is in the statement of Theorem 4?) ´ Line 239.5 is $\varepsilon_{gen} \approx \varepsilon$ ?

Line 241 is it $\phi'_{D}(H_{t},W_t/\gamma)$ in the last equation?

Line 244 is it $-\phi '_{D}(H_{t},-sign(H_t))$ in the last equation?

Line 245-246 why is this relaxed criteria sufficient for the proof of Theorem 4? And could that be explained in stead in line 240-244 instead of the thing which is not atractable? Proof sketch of Theorem 4 questions end.

Line 281-282 "there is an algorithm using that uses the weak learner $\mathcal{W}$ " sounds off to me (disclaimer: I'm not a native english speaker so please take this into considerations for this suggestion).

What is the reason for not comparing to [BCHM20] in the experiments?

局限性

Yes.

作者回复

2024-08-07

We thank the reviewer for their time, and their positive appraisal of our work. We address the reviewer’s questions below.

Epsilon0: Here essentially following the precedent set by KK and BCHM, we state our main results (e.g., Theorem 4) for a fixed value of weak learner’s tolerance $\varepsilon_0$ . This term is also present in KK (see Theorem 1 therein) and BCHM (see Theorem 8 therein). Only for the purpose of compiling the results in Table 1, we set $\varepsilon_0=\gamma \varepsilon$ so that the results stated correspond to the standard agnostic learning notion of $\varepsilon$ excess error, and that all algorithms including ERM may be compared.

Delta0/Delta in Theorem 4: $\delta_0$ does appear additively in the failure probability in Theorem 4. We suppress a $\log 1/\delta$ factor in the sample complexity bounds. The reason for displaying $\log |\mathcal{B}|$ prominently stems from this quantity being the capacity of the hypothesis class. Although written with a logarithm, this quantity is not small since in applications the hypothesis class is routinely chosen to be at least moderately expressive (say, with cardinality exponential in the feature space dimension).

Comparisons to BCHM: We note that the primary contribution in BCHM was a unifying theoretical framework connecting the online-statistical and realizable-agnostic divides. For individual cases (i.e., for the realizable or the statistical agnostic case), BCHM acknowledges the bounds obtained do not match the best known ones. Similarly, given the nature of the work, BCHM does not provide any experiments nor a reference implementation, and it was not clear to us whether the algorithmic framework is a practically viable one. This was our rationale comparing our approach exclusively to Kalai-Kanade, which is also the state of the art in theory (as mentioned in Table 1).

Following the reviewer’s suggestion, we have now added comparisons to BCHM too, and expanded our benchmarking to 6 datasets. The algorithm outperforms both KK and BCHM on 18 of 24 instances. The results can be found in the PDF attached to the common response to all reviewers.

Remaining points: We will switch to ‘suggests’ as you recommend; this is a good suggestion. Lines 208, 241 and 244 should be indeed as you suggest; should include $H_t$ as the first argument. Line 8 in the algorithm should also be $\widehat{D}^{'}\_t$ , as you point out. Thanks!

Proof Sketch: Our intent for the proof sketch was to convey the plausibility of a $1/\varepsilon^3$ result, upto the exclusion of other factors. Hence, $\lesssim$ inequalities only hold up to constants and polynomial factors in $\gamma, \log |\mathcal{B}|$ . Hence, while in the proof sketch $\eta\approx\frac{1}{\sqrt{T}}\approx\varepsilon$ , the formal theorem and proof specify them correctly including dependencies in $\gamma, \log |\mathcal{B}|$ . We should have been explicit about this expectation. (Indeed, $\varepsilon_{Gen} \approx \varepsilon$ , as you note.) The relaxed branching criteria to accommodate the challenge mentioned in Line 244 is not straightforward in terms of logical flow (requires case work in Lines 504-513), although it is entirely elementary in terms of its mathematical content. Hence, we were reluctant to dive into it, but we’ll attempt to frame this argument with the extra page available post acceptance to sketch the argument succinctly.

KK08: Kanade, Varun, and Adam Kalai. "Potential-based agnostic boosting." Advances in Neural Information Processing Systems 22 (2009).

BCHM20: Brukhim, Nataly, Xinyi Chen, Elad Hazan, and Shay Moran. "Online agnostic boosting via regret minimization." Advances in Neural Information Processing Systems 33 (2020): 644-654.

2024-08-09

Thanks for your response and good luck!

2024-08-10

Thank you for your prompt acknowledgement. We especially appreciate your fine-grained feedback about the proof sketch and your eye for notational inconsistencies, which we can only assume came at significant expense of your time.

审稿意见

评分: 6置信度: 32024-06-28

The authors propose a method of classifier boosting that improves the sample complexity bound in agnostic settings. To achieve it, they propose an algorithm carefully reusing the sample across iteration for updating pseudo labels. They also demonstrate the usefulness of the method with the agnostic halfspace learning and the reinforcement learning, as well as experimental results.

优点

Well motivated.
Having significant technical novelties.

缺点

It lacks discussions on the comparison with prominent competitors in terms of the trade off between the sample- and time- complexities: SGD-like implementations of ERM.
Hard to follow the technical part and understand which portion of the algorithm is new and important for the improvement in the sample complexity. In particular, one could use the intuitive explanation on the role of each nontrivial step in the algorithm such as
- stochastic branching in Line 5,
- sampling $\eta'$ in Line 5.II,
- stochastic relabelling in Line 5.II,
- branching condition in Line 9, and so on.
Also, the proof sketch (Section 5) could be more streamlined, unifying it with the aforementioned intuitive explanations.

问题

On the note right after Theorem 6, the time complexity of ERM: Can we just employ an SGD-like algorithm + hinge loss to achieve the same order of the sample complexity as ERM with a manageable time complexity, like $O(n /\epsilon^2)$ ?
Section 6.2: How does it compare to the non-boosting results in the RL literature?
Section 7: The baseline accuracy in Table 3 barely degrades as the label noise increases. Even far from that, it goes up with the Spambase data. How do you explain this phenomenon?
Also in Section 7:
- What is the meaning of $\pm$ in the table? standard deviation?
- Did you separate the data for the accuracy estimation and the grid search?

局限性

The limitations are adequately addressed.

作者回复

2024-08-07

We thank the reviewer for their time, and their appreciation of the technical novelties of our work.

Comparisons to ERM via SGD: The boosting framework (including the celebrated Adaboost algorithm) is founded on the premise that it is often easy to construct weak learners both in theory and practice, crucially in the absence of a computationally efficient ERM oracle. Indeed, when ERM is efficiently implementable for a given hypothesis classes (i.e., logistic regression with cross-entropy loss), the boosting methodology (not just our work) offers no advantages; e.g., ERM has a better sample complexity as Table 1 (or nearly matching in the realizable/Adaboost’s case). Note that the corresponding applications of boosting also follow this, i.e., routinely used weak learner classes are non-convex and not amenable to SGD, e.g., decision stumps, binary parity function. Similarly, for the classic problem of agnostically learning halfspaces (notice the 0/1 loss), neither our results nor those in KK08 and KKMS08 are achievable via SGD + hinge loss on linear classifiers. In fact, no fully polynomial time algorithm is known for this problem. Hence, we find this comparison not fully relevant to the problem at hand, and we sincerely request the reviewer to consider raising their score.

Motivating the algorithm design: Thanks for bringing this up. Following your suggestion, we’ll meticulously carry out readability improvements for the section that motivates the algorithm’s design. We will reframe it to explain and motivate the algorithmic choices we make in a top-down manner, along with a description of the potential-based framework. This way the changes we make will be easier to highlight. Currently, we highlight the new components (including the data reuse scheme using relabelling and the new branching scheme, both crucial) of our approach in the contributions section (notably, Lines 84-141), but we understand that the description may be too dense and perhaps too early.

Non-boosting RL results: The closest point of comparison for RL using a supervised learning oracle (a stronger assumption that assuming a weak learning oracle as we do) over the policy space comes from the Conservative Policy Iteration Algorithm due to Kakade and Langford (KL02). It needs $1/\varepsilon^3$ samples in the episodic model and $1/\varepsilon^4$ samples in the $\nu$ -rollout model; both of these are better than our results by a factor of $\varepsilon$ albeit in a stronger oracle model. Notice this ERM-to-boosting gap is larger in BHS20, where the boosting results are worse by a factor of $\varepsilon^2$ . In this sense, our results improve upon the best known results for boosting for RL, and reduces the boosting methodology’s gap for RL to supervised learning.

Section 7: A key point, as stated in the caption of Table 3, is that the label noise is only added during training. Hence, the bayes-optimal classifier remains equally effective on the original dataset regardless of label noise. The fact that the ML classifier barely degrades with as much as 20% label noise is a testament to the strength of the agnostic boosting approach. The case of Spambase is an anomaly we do not fully understand. We suspect that this effect may be due to the regularization effect of label noise; the feature vectors in Spambase are notably sparse. The numbers following $\pm$ indeed represent the endpoints of a 95% confidence interval for each entry, each estimated using out-of-sample data.

KK08: Kanade, Varun, and Adam Kalai. "Potential-based agnostic boosting." Advances in Neural Information Processing Systems 22 (2009).

KL02: Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267-274. 2002.

BHS20: Brukhim, Nataly, Elad Hazan, and Karan Singh. "A boosting approach to reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 33806-33817.

KKMS08: Kalai, Adam Tauman, Adam R. Klivans, Yishay Mansour, and Rocco A. Servedio. "Agnostically learning halfspaces." SIAM Journal on Computing 37, no. 6 (2008): 1777-1805.

2024-08-08

Thanks for the thoughtful and informative rebuttal.

Let me clarify one thing: on the first point, specifically in the setting of Theorem 6, I agree that "ERM + hinge loss + SGD" with the weak hypothesis class does not converge to the optimal weak learner w.r.t. 0-1 loss.

However, "ERM + hinge loss + SGD" does converge to the Bayes optimum if we consider stronger hypothesis classes (possibly with regularization) that are convex and containing the Bayes optimum in their closures, as the hinge loss is a calibrated surrogate loss function. Since the proposed method also looks for a good hypothesis beyond the weak learners, I think it is interesting to compare it with "ERM + hinge loss + SGD" beyond the weak hypothesis learning.

2024-08-10

Thank you for engaging with us. We think it has been helpful in pinpointing the source of disagreements regarding expectations from the present work.

We would like to iterate that our work – applications included – is situated in the agnostic model, to which a substantial fraction of the literature on learning theory is dedicated. While in the realizable case (even with noisy labels) when working with a hypothesis class $\mathcal{H}$ , the reviewer’s remarks are undoubtedly correct, and (S)GD + calibrated loss does converge to the best in-class hypothesis with respect to the 0/1 loss (which is also the Bayes-optimal classifier by the realizability assumption). However, the premise of agnostic learning model lies in the desire to compete (up to a vanishing in sample-size error) with the best in-class hypothesis, crucially in absence of any distributional assumptions on y|x whatsoever. Here, the bayes-optimal classifier can be arbitrarily complex (and not approximable by $\mathcal{H}$ ). Conversely, if the Bayes-optimal classifier were in $\mathcal{H}$ (approximately), the learning problem would belong to the (near, respectively) realizable setting. There are also results that attempt to weaken the realizability assumption. For example, FCG21 (which indeed works by GD on a convex surrogate, as the reviewer suggests) is a recent work of this sort that provides learning guarantees of the form $C \times (\text{best in-class accuracy})^{1/2} + \varepsilon$ with polynomial sample complexity, which is meaningful whenever the best in-class hypothesis has a small error (zero error corresponding to perfect realizability). Nevertheless, these bounds are incomparable to agnostic learning guarantees of the form best in-class accuracy + $\varepsilon$ (whether the learned hypothesis is in-class is an orthogonal matter, that of proper vs. improper learning), a stronger guarantee that also requires substantially more samples. In conclusion, near realizability of the bayes-optimal classifier puts us in a different learning model altogether with incomparable guarantees. However, we are happy to post a remark mirroring this point in the manuscript.

On the note right after Theorem 6, the time complexity of ERM: Can we just employ an SGD-like algorithm + hinge loss to achieve the same order of the sample complexity as ERM with a manageable time complexity, like $O(n /\epsilon^2)$ ?

Finally, we think there’s very strong evidence to bet against the reviewer’s original hunch (quoted above). There are known statistical query lower bounds (referenced to in e.g FCG21, which apply to a very large class of algorithms including (S)GD regardless of the parameterization) requiring $\exp(poly(\varepsilon^{-1}))$ queries for agnostic learning of halfspaces with Gaussian marginals (we operate with uniform).

FCG21: Frei, Spencer, Yuan Cao, and Quanquan Gu. "Agnostic learning of halfspaces with gradient descent via soft margins." In International Conference on Machine Learning, pp. 3417-3426. PMLR, 2021.

Thanks once again for your time and feedback.

P.S. As a final remark – perhaps this is what the reviewer had intended – one can try to approximate the best in-class (crucially, not the bayes-opt which can be arbitrarily complex) hypothesis by an expressive hypotheses class (e.g., in an orthogonal basis of functions like the Fourier basis). But this is what we (and previous works including KK and KKMS) effectively do. Our proof proceeds by approximating half spaces by low degree terms in the fourier basis to show that parity functions act as a weak learning class for the former (Line 654).

2024-08-14

P.S. As a final remark – perhaps this is what the reviewer had intended – one can try to approximate the best in-class (crucially, not the bayes-opt which can be arbitrarily complex) hypothesis by an expressive hypotheses class (e.g., in an orthogonal basis of functions like the Fourier basis). But this is what we (and previous works including KK and KKMS) effectively do. Our proof proceeds by approximating half spaces by low degree terms in the fourier basis to show that parity functions act as a weak learning class for the former (Line 654).

Yes, this is what I intended, at least in the last reply. And I am curious how, despite their similarity, do they compare each other in terms for their sample/time complexities. But anyway, it becomes clear that my original concern is not impactful as I saw it. Thanks for your the extended discussion so far. I will update my score.

审稿意见

评分: 6置信度: 32024-06-28

This work proposes a new agnostic boosting algorithm whose sample complexity is shown to be polynomially better than that of previous methods. Compared to existing techniques, the advantage of the proposed algorithm comes mainly from the novel concept of reusing samples across multiple rounds of boosting. Through their analysis, the authors offer other theoretical insights that may be of independent interest.

优点

Good writing style. Well above average, with few typos and overall high signal-to-noise ratio.
Noteworthy rigour in the theoretical analysis. Catching the subtle stochastic dependencies missed by [KK09] is good sign of diligence. (I did not check details exhaustively, though. Specially those in the appendices.)
Substantial contribution. Up to some caveats, this work improves the sample complexity of agnostic boosting by factor of $εγ$ , which is particularly valuable considering that it leaves the SOTA only another $εγ$ factor away from ERM's complexity.
Some empirical results are provided. This is a good practice, even if the main focus is theoretical.
(Bonus) I tend to be sceptical towards citing web links, but [Pin20] was a nice one.

缺点

I will start by the experiments. Note that I can appreciate the work for its theoretical contributions alone and see the experiments as a welcome extra. With that being said:

Information on the computer resources used for the experiments seems to be missing. This is in contrast with Question 8 of the paper checklist (at 840):

For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

To that, the authors reply "Yes. Please see the appendix." (at 843 and 844), but I could not find this information in the appendix, the main text, or the supplementary material.
Without information about the computational cost of the experiments, it is hard to understand the limitation to only three datasets. The experiments in [KK09] include 5 more datasets and, even though I can see the experiments brought by the authors here are larger (e.g., 30-fold rather than 10-fold cross-validation), hardware has got much better since [KK09].
From the description of the experiments, it is clear that they generated much more data than what is present in the text. More of this data should be available, even if only in the supplementary material.
At line 296, "less stark" is a bit too generous. The baseline performed consistently and significantly better than the proposed method in the presence of the noise. While I praise the authors for being upfront about this, the phenomenon should be sufficient to motivate further experimentation and a deeper discussion.

Moving to the theoretical side, I believe the main issues lie in the presentation of the work. To be clear, this does not concern the writing style, which I appreciated, but rather the clarity and contextualization relative to prior works.

While the intro is clear at almost all points, clarity seems to progressively decrease up to (the crucial!) section 5. I understand that much of this comes from the complexity of the subject itself. I understand the big challenge, but given the absolute importance of section 5, I believe the authors underperformed there. A similar note applies to the paragraph starting at 205: I am confident there is some significant "curse of knowledge" at play there. Still, a criticism such as this is too costly to substantiate, so it will not significantly affect my evaluation (directly).
The contextualization should be better. A paper not only ought to make it possible for the reader to infer precisely what parts of the contributions are original, it should make the delineation easy. Specially for a paper that demands heavier background, it is crucial to be very clear about the extent of your contributions, not only by stating them explicitly, with absolute clarity but also by explicitly indicating which ideas come from previous works. Below, I list the reasons why I believe what the authors offer (most notably the explicit enumeration of contributions in Section 1.1) is insufficient in this regard.
1. I believe that the implications for agnostic learning for halfspaces should be presented as a case for the relevance of the problem, rather than as a contribution. The translation of results in the setting of the paper to that of agnostically learning halfspaces is already present in [KK09] and seems to require minimal extra effort. Still, the authors promote it as a contribution quite strongly, with multiple occurrences of the claim in the text and its highlight as Contribution 4 (at 131).
2. A similar note may apply to the application to boosting for reinforcement learning (Contribution 5 at 134). I preferred to ask the authors directly about this.
3. Similarly, I found it necessary to ask the authors about the exact nature of Contribution 3. See question below.
4. The point mentioned in Contribution 2 (113) should be further discussed. If I understand correctly, when taking the oracle complexity into account, the bound from Theorem 8 remains incomparable to the one from [KK09] (see, e.g., Table 1) given the cubic dependence on $\log \lvert \mathscr{B} \rvert$ . The authors should elaborate on this matter. For example, on one hand, understanding this logarithmic factor to be negligible (as the authors suggest in the caption of Table 1) weakens the case for Contribution 2, since it only improves the oracle complexity by this exact factor. On the other hand, taking it as important jeopardizes the contribution of Theorem 8 as the factor shows up (twice) in the final sample complexity bound.
5. The authors should highlight more the main technical challenges arising from modifying the algorithm by [KK09]. In my opinion, the contribution would be fair even if the challenges were minor so, if authors thought this was the case (and I am not saying it is), they could be explicit about it and still make a good case for the paper.
6. [BCHM20] looks a bit off within in Table 1 since its bound is strictly worse than that of [KK09]. This is not a serious issue, but it is worth a brief comment.
The need for $\varepsilon_0$ and $\delta_0$ is worth more discussion than what is offered around 166.

NOTE: I apologize if my attempt to remain objective ended up making me sound harsh.

Overall, I tend towards acceptance, but with reservations. I find it quite feasible that I misunderstood some point, and that the authors can clarify important points in the rebuttal, so there is a fair chance I end up raising my initial score.

Minor issues and suggestions

While reading the paper, I took some notes that the authors may find useful. The authors are free to ignore those.

You may want to have a look at the (very recent) works on parallel boosting [1] and [2]. I'm not sure the connection is deep, but they also employ some notion of "sample reuse".
[23] missing "it"
"excess of $ε$ " is better than " $ε$ -excess" in many ways, if you think about it
[Table 1] Put "inefficient" within parentheses
[Table 1] Rephrase the sentence starting with " $\mathscr{B}$ is the base...". It's quite confusing.
[All tables] Use booktabs for better tables. Also, the tips in its documentation can be very useful
[41 (and many other places)] In American usage, colons are usually followed by a capital letter
[66] "these" is ambiguous
[67] "A weak learner" $\to$ "A $\gamma$ -weak learner" (in particular, solves undefined usage shortly after)
[73] missing hyphen
[85] remove extra comma
Floating objects (figures, tables, etc.) ended up inconveniently far from their first reference in the text
[100] double "of"
[103] "making a step “backward” via adding in a negation of the sign of the current ensemble" is a bit too heavy to get this early in the text
[124] remove extra "to"
Having hyperlinks to algorithm lines would be nice
Number theorems and lemmas by section (e.g., "Theorem 4.2" rather than "Theorem 5")
[138] "MDP" is still undefined
[151] fix "boostng"
[161] A pedantic reader may need a quick definition of $\Delta(\;\cdot\;)$
[everywhere] use \colon rather than : to define functions (it gives the right spacing)
Consider TikZ for Figure 1 to have better control over the appearance (compared to GeoGebra) and vector graphics by default. Perhaps this extra control to can save some space. Here is a quick draft as an example:

\begin{tikzpicture}
    \draw[color=gray, thin, dotted] (-3.1, -0.6) grid (4.1, 4.1);

    \draw[->] (-3.2,0) -- (4.2,0);
    \draw[->] (0,-0.7) -- (0,4.2);

    \draw[color=red, domain=-2.075:2.6] plot (\x, {2 - \x}) node[right] {$2 - z$};
    \draw[color=blue, domain=-2.075:4, smooth] plot (\x, {(\x + 2) * exp(-\x)}) node[above] {$(z + 2) e^{-z}$};
\end{tikzpicture}

Many mathematical equations are left without proper punctuation. Those are integral part of the text and should be treated as such.
[Definition 1] Consider using $\sup$ rather than $\max$ for robustness
[176] "a functions" $\to$ "a function"
[Algorithm 1] "Agonstic" $\to$ "Agnostic"
[Algorithm 1] Number all lines
[Algorithm 1, first line of line 5] probably missing a "uniform"
[198] Use \coloneqq
[208] no need for the apostrophe to pluralize
This is a bit more important. $h^*$ only gets defined at 482 but its used many times earlier.
[237] double period
[239] The use of $\pm$ in the next equation is undefined and confusing
[240] "linearity" $\to$ "the linearity"
[Table 3] omit the 0s to save space. E.g., "0.12" $\to$ ".12". This should solve the overfull hbox.
[483] dangling "]" below
[577] "semi norm" $\to$ "seminorm"
[578] "size" is ambiguous in this context
[Theorem 19] Missing \operatorname for "corr"
[582 and 583] Inconsistent parenthetization
[598] Missing \operatorname for "Cov"
[Random] In my humble opinion, this work would be better presented by first focusing on the method by [KK09] and building up from there, gradually discussing the modifications and their implications. Of course, this is also a matter of style, and the change could be quite costly at this point.

[1] Karbasi, A. and Larsen, K.G., 2024, March. The impossibility of parallelizing boosting. In International Conference on Algorithmic Learning Theory (pp. 635-653). PMLR.
[2] Lyu, X., Wu, H. and Yang, J., 2024. The Cost of Parallelizing Boosting. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (pp. 3140-3155). Society for Industrial and Applied Mathematics.

问题

What hardware was used for experiments? How long did them take to run?
Why include $T$ in the grid search of the experiments? Wouldn't it suffice to fix the highest value? The fact that the proposed method, in some sense, "rolls back to the best model" (see last line of Algorithm 1) makes that seem even more reasonable.
At 250, what do you mean by "the sign of $\Delta_t$ is indeterminate"?
How original is Contribution 5? How much of it is present in previous works? How much effort is needed to fit your results into existing frameworks to obtain Theorem 7? (I already had a look at appendices G and H, but I am still not totally sure.)
Similarly, how original is Contribution 3? Are the authors aware of any previous works applying similar strategies ( $L_1$ covering number based bounds $\to$ Theorem 20)? (To be clear, I do value the nice connection. I am simply trying to size precisely its originality.)
At 116, what do you mean by "The resultant sample complexity improves known results for all practical (i.e., sub-constant) regimes of ε"? I would imagine that applications are actually prone to employ larger values of ε; above, say, the machine precision.

局限性

I think my criticisms regarding the experiments (1-4) are pertinent here. Moreover, I got the impression that the authors are particularly capable of tackling the presentation issues. They demonstrated to have a good grasp of the subject and a good writing skills. I suspect that, with some effort, they can make this and future papers excel in clarity and appeal.

评论- Rebuttal by Authors (continued..)

2024-08-07

Sign of DeltaT: Given the equation below Line 249, if $\Delta_t$ ’s were somehow non-negative, then since $\mathbb{E}\_{t-1}[\Delta_t]<\Delta_{t-1}$ , it would form a supermartingale sequence, and one could apply supermartingale concentration results. The comment is in place to emphasize that although $\Delta_t$ is martingale-like, by itself, it is not a martingale, supermartingale or submartingale.

KK08: Kanade, Varun, and Adam Kalai. "Potential-based agnostic boosting." Advances in Neural Information Processing Systems 22 (2009).

BCHM20: Brukhim, Nataly, Xinyi Chen, Elad Hazan, and Shay Moran. "Online agnostic boosting via regret minimization." Advances in Neural Information Processing Systems 33 (2020): 644-654.

BHS20: Brukhim, Nataly, Elad Hazan, and Karan Singh. "A boosting approach to reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 33806-33817.

AGHM21: Alon, Noga, Alon Gonen, Elad Hazan, and Shay Moran. "Boosting simple learners." In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 481-489. 2021.

作者回复

2024-08-07

We thank the reviewer for their thorough review, and for their positive comments regarding the rigor and the significance of our work.

Experiments: The runtime used for all experiments was a Google Cloud Engine VM instance with 2 vCPUs (Intel Xeon 64-bit @ 2.20 GHz) and 12 GB RAM. The python code takes a bit under 2 hours to reproduce the results as presented in the table. Here, we would like to note that the code for experiments and reference implementations of the algorithms in KK are not available publicly. Hence, for some datasets (e.g., pendigits, letters), we are unsure how KK converts these multiclass datasets (as listed on UCI repo) to binary classification tasks. Given that a description of the same is missing in the text of KK and its appendices, we skipped these. Within the remaining options, we chose datasets that were moderate sized, on which we could iterate fast to get a working version of the algorithm in KK. Similarly, BCHM is a more recent work on the subject, but does not report experimental results. We are happy to include the complete log obtained from cross-validation trails. Finally, the case of Spambase is an anomaly we do not fully understand. We suspect that this effect may be due to the regularization effect of label noise; the feature vectors in Spambase are notably sparse.

Following the reviewer’s suggestion, we have now expanded our benchmarking to 6 datasets, and added comparisons to the algorithm from BCHM too. The algorithm outperforms both KK and BCHM on 18 of 24 instances. The results can be found in the PDF attached to the common response to all reviewers.

Variable T: If $S$ , the number of fresh samples sampled every round, were a constant, the reviewer’s comment would be unreservedly true. However, given a fixed size dataset of size $m$ , the choice of $T$ , the number of rounds of boosting, also implicitly dictates $S$ , the number of fresh samples one can draw every round, since $S\approx m/T$ . Hence, given a fixed data budget, versions of the algorithm run for longer $T$ ’s aren’t simply versions with smaller $T$ ’s subject to more rounds of boosting. This is true for the algorithm in KK too.

Contributions vs Implications: This point is well made. We are happy to list the results on agnostic learning of halfspaces and boosting for reinforcement learning as implications of the improved agnostic boosting result instead of contributions, although here we must note that the result as presented in BHS is not entirely black-box. That this interpretation was intended originally too can be evidenced in ‘we apply’ and ‘by applying’ on lines 131 and 137, along with citations to works that first sketched out the connection.

Sample-complexity Oracle Complexity Tradeoff in Contribution 2: We think each bound in Theorems 4 and 8 make their mathematical/theoretical utility evident. To make a case for their practical relevance, Here’s our interpretation of the presented results: almost invariably both in practice and in theory, boosting is used with simple base hypotheses classes with small value of $\log |\mathcal{B}|$ (see AGHM21 for cases and examples). With a modest inflation in running time – which is still polynomial in PAC learning terms – Theorem 4 guarantees an unconditionally better sample complexity. We certainly think of Theorem 4 as the main contribution of the present work (and hence, it’s positioned in the main paper). Now, let’s fix $\gamma=\Theta(1)$ for simplicity of argument. If, however, this inflation is running time is unacceptable for some reason (due to, say, binding computational constraints, large datasets), Theorem 8 is offers the same oracle complexity and running time as the previous works, yet an improved sample complexity as long as $\varepsilon \leq 1/\log |\mathcal{B}|$ . With large datasets at hand, one is presumably aiming for small $\varepsilon$ , and this condition is met. Similarly, in resource-constrained settings, where $\mathcal{B}$ is not very expressive to save on computation, $\log |\mathcal{B}|$ is small, making this truth value of this condition very plausible, if not probable. This is what we mean by sub-constant $\varepsilon$ ; it’s a relative statement about the magnitude of $\varepsilon$ in comparison to other problem parameters – here, $\log |\mathcal{B}$ | – not an absolute one.

Technical Presentation: We agree that Lines 205-216 and Section 5 are too dense. We’ll meticulously carry out readability improvements for this section that motivates the algorithm’s design. The current rendition effectively lays out our proof strategy, instead we will reframe it to explain and motivate the algorithmic choices we make in a top-down manner, along with a description of the potential-based framework. We were severely space-constrained, and in expanding these descriptions, we hope an additional page available post acceptance will aid us.

Technical Challenges: Currently, we highlight the new components (including the data reuse scheme using relabelling, avoiding uniform convergence on the boosted class ala BCHM, and the new branching scheme, all crucial) of our approach in the contributions section (notably, Lines 84-141) along with their necessity, but we understand that the description may be too dense and perhaps too early.

BCHM vs KK: We note that the primary contribution in BCHM was a unifying theoretical framework connecting the online-statistical and realizable-agnostic divides, including the first result for online agnostic boosting. For individual cases (i.e., for the realizable or the statistical agnostic case), BCHM acknowledges the bounds obtained do not match the best known ones.

Parallel Boosting: We are happy to include a pointer to these results. Thanks!

2024-08-09

Thanks for the thoughtful rebuttal. My impression is that the authors got particularly rigorous reviewers (and survived!) and, hence, there should be particularly valuable lessons to be learned from the feedback they got (you should be getting averages closer to 7 given how skilled you appear to be). Keep that in mind. Still, the authors should be proud of their work. Congratulations.

2024-08-10

Thank you for your encouraging response. We are confident that incorporating your constructive feedback (which we sincerely value) will enhance both the readability and the appeal of the present work.

审稿意见

评分: 6置信度: 32024-07-07

This paper continues the study of agnostic boosting for binary classification tasks. They provide a new agnostic boosting algorithm that, by re-using samples across multiple rounds of boosting, is more sample-efficient than existing algorithms. As corollaries, they derive improved sample complexities for agnostically learning half-spaces and boosting for RL. Finally, the authors run experiments and show that their agnostic boosting algorithm outperforms existing algorithms on commonly used datasets.

优点

The technical contributions are solid, the algorithmic ideas are interesting, and the proof techniques are non-trivial. In particular, I liked the idea of sample reuse across boosting steps.
Overall, I feel that this a nice contribution to the literature on agnostic boosting

缺点

My main issue with this paper is its presentation.

Lemma 9 is referenced/linked a lot in the main text, however it is in the Appendix. I think it would be a good idea to put at least an informal version of Lemma 9 in the main text if it is that important.
In my opinion, Section 5 (sketch of analysis) is very dense. I feel that the sketch was a bit too detailed for it to be a sketch but also not detailed enough for it to be a full proof. I think it would be better to move Lemma 2 to the Appendix, expand more on the description of the algorithm (lines 205 - 216), make Section 5 more high-level and pointing out the exact techniques used to prove Theorem 4, and expand more on the implications of your results for half spaces and RL.
Since the proposed agnostic boosting algorithm follows the potential-based framework of [KK09], I think it would be nice to give the readers how these algorithms work in general. For example, it would be nice to elaborate in the main text what the "potential-based framework" for agnostic boosting mentioned in Line 102 actually is (providing an algorithmic template would be even better).
As is, Section 5 doesn't provide me much intuition about: (1) where/how sample reuse is saving me and (2) why the author's chose the potential function they chose. Overall, I think it would be better to make Section 5 less technical, but provide the reader with more intuition about the algorithmic choices.

问题

It would also be nice if you empirically compared your algorithm to the agnostic booster from [BCHM20].

局限性

No Limitations

作者回复

2024-08-07

We thank the reviewer for their time, and for their appreciation of the work’s technical contributions.

Technical Presentation: Thanks for the suggestion. We plan to implement the following changes to address your concerns:

Lemma 9 is indeed important both for understanding the algorithm and the analysis in that it outlines the invariant the sample reuse scheme maintains. In our current presentation, its informal statement comes too late: in equation (2) below line 239, and briefly on line 208. We’ll include an informal version of the same early in the paper, as you recommend.
We agree that Lines 205-216 are too dense. We’ll meticulously carry out readability improvements for this section that motivates the algorithm’s design. The current rendition effectively lays out our proof strategy, instead we will reframe it to explain and motivate the algorithmic choices we make in a top-down manner, along with a description of the potential-based framework.
Apart from those listed in Lines 182-192, the principal motivation of the potential lies in its second-order differential continuity and that through it Lemma 3 is true. We’ll work to include the proof of Lemma 3 in the main body. Regarding sample reuse, briefly, the algorithm in Kalai-Kanade performs $T=1/\varepsilon^2$ rounds of boosting, and uses $1/\varepsilon^2$ fresh samples for each (as is standard in learning theory), totalling to $1/\varepsilon^4$ . Also, we note an indiscriminate use of all past samples results in a bound no better than that in Kalai-Kanade; this is effectively the recipe in BCHM. In contrast, while we serve $1/\varepsilon^2$ samples to the weak learner, due to a careful sample reuse scheme, only $1/\varepsilon$ of these need to be fresh. We were severely space-constrained, and in expanding these descriptions, we hope an additional page available post acceptance will aid us.

KK08: Kanade, Varun, and Adam Kalai. "Potential-based agnostic boosting." Advances in Neural Information Processing Systems 22 (2009).

BCHM20: Brukhim, Nataly, Xinyi Chen, Elad Hazan, and Shay Moran. "Online agnostic boosting via regret minimization." Advances in Neural Information Processing Systems 33 (2020): 644-654.

2024-08-08

Thanks for your response.

2024-08-10

Thank you for your prompt acknowledgement. We hope that we have been able to allay a portion of your concerns with respect to our presentation and empirical evaluations.

审稿意见

评分: 7置信度: 32024-07-18

The paper proposes a new agnostic boosting algorithm that improves the sample complexity of a previous algorithm by a factor of $\gamma^{-1}\epsilon^{-1}$ at the expense of an increase in the computational (oracle) complexity by a factor of logarithmic size or VC dimension of the base learning class. Here, $\epsilon$ is the desired additive closeness to a risk minimiser and $\gamma$ the multiplicative approximation ratio to a risk minimiser provided by the weak learner. Alternatively, the oracle complexity can be reduced to that of the previous algorithm at the expense of an increased sample complexity that is not directly comparable to that of previous algorithms.

优点

The theoretical contribution of a rate improvement in sample complexity for agnostic boosting is significant. This is achieved by a sophisticated procedure and corresponding analysis that draws from a range of theoretical insights, especially the extension to infinite base learning classes. The inherent complexity of this contribution is managed by a very rigorous and mature presentation.

缺点

The proposed procedure has a very large number of parameters. I believe the proofs provide concrete valid settings for those parameters as a function of the desired PAC guarantees. However, it seems that the practical accessibility / usability of the procedure would benefit from a a more explicit treatment of those choices and/or the elimination of some degrees of freedom at least in the main text version of the algorithm.

Generally, despite the rigorous and detailed presentation, the authors should be aware that their paper is still a very challenging read and I have some worry that the boosting literature becomes increasingly in-penetrable for the wider machine learning community. Any improvement of the general state of affairs that the authors can provide here would be strongly appreciated.

Minor points

The references in the first sentence should be revised. It does not make sense that the 1990 boosting article resolves a question first posed in work from 1994.
In line 97, it seems to me you mean "inevitably" instead of "invariable".
Typo in line 151: "boostng"

问题

Why do the main results in the table focus on the finite base learning class when the extension to the infinite case via VC dimensions are available?
What exactly is the ERM baseline? The identification of an empirical risk minimiser in $\mathcal{B}$ ? If so, is it correct to give its complexity as $\infty$ in the table (at least for the case of finite $|\mathcal{B}|$ )
To what degree is the algorithm distribution-specific? The definition of the weak learner asks the sample complexity to be valid for all distributions.

局限性

The limitations of the proposed procedure that go beyond those that are usually present in this branch of literature are the trade-off between sample and oracle complexity as well as a larger number of parameters. The first is discussed very explicitly while the second seems to be mostly a presentation issue as mentioned above.

作者回复

2024-08-07

We thank the reviewer for their thorough review, and for their positive comments regarding the significance of our work and the presentation.

Large number of parameters: Our current presentation is set up to prioritize clean statements of learning guarantees; having independently denoted constants reduces the chances of conflating the impact of each in the proof. The reviewer is absolutely correct in identifying that this exposes too many knobs in the algorithm to a practitioner. To rectify this, we will include a guide to practical adaptations of the algorithm. As the reviewer notes, for many of these, the theory does provide strong clues (the link between $\eta$ and $\sigma$ ). Briefly, in practice, given a fixed dataset with $m$ data points, there are two parameters we think a practitioner should concern herself with: the mixing parameter $\sigma$ and the number of rounds of boosting $T$ , while the rest can be determined (or at least guess well) given these. The first choice $\sigma$ dictates the relative weight ascribed to reused samples across rounds of boosting, while $T$ , apart from determining the running time and generalization error, implicitly limits the algorithm to using $m/T$ fresh samples each round.

We address the questions raised by the reviewer below.

Ostensible error in chronological order: This apparent mismatch was indeed well sighted. However, the 1990 paper resolving a question of the 1994 article is correct in that it is an artifact of the journal publication process.

Finite classes in Table 1: Our decision to present the results for finite classes in Table 1 was primarily motivated by the desire to cause the least disruption of flow given that we state the fundamental theorem of statistical learning in its simplest form for finite classes too in the introduction (and Table 1 follows immediately after), and closely follow the precedent set by earlier work (e.g., the boosting for reinforcement learning we build and improve upon). We are happy to include the variants for infinite classes instead.

ERM: We take ERM to be a procedure that picks the best hypothesis in the hypothesis class $\mathcal{H}$ , the same class against which our boosting algorithm’s agnostic learning guarantee holds. As noted in the caption of Table 1, the base class $\mathcal{B}$ , the class the weak learner outputs from, can in general be quite a bit smaller (e.g., decision stumps when the class $\mathcal{H}$ consists of decision trees); we account for this factor in the stated sample complexity. Given a weak learner for the class $\mathcal{B}$ , it’s not clear how an ERM oracle for $\mathcal{H}$ might be constructed, and hence, the oracle complexity is marked as ``inefficient/infinity''. Alternatively, here we could list a running time of $\mathcal{H}$ , but that’s distinct from oracle complexity. Since, boosting is motivated by settings where direct ERM is computationally intractable, we think the current attribution is fair. Thanks for raising this; this deserves a more explicit note.

Distribution specificity: As noted on Line 168, our boosting algorithm falls within the distribution-specific boosting framework put forward by KK08, since it does not modify the marginal distribution on feature vectors. We avoid this deliberation in the definition itself for simplicity of presentation, i.e., to avoid stating a "weak learner for feature distribution $\mathcal{D}\_{\mathcal{X}}$ " every time and ''for any distribution $\mathcal{D}$ with marginal $\mathcal{D}_\mathcal{X}$ '' in the learning guarantees. In this, we also follow Kalai-Kanade’s lead.

KK08: Kanade, Varun, and Adam Kalai. "Potential-based agnostic boosting." Advances in Neural Information Processing Systems 22 (2009).

作者回复

2024-08-07

We thank all the reviewers for their comments and editorial suggestions. We hope that we have adequately addressed each reviewer’s questions in the space for one-to-one responses. In this unified response, we hope to highlight two changes of common interest.

Experiments: Following Reviewer 3LK5 and m6d1, and xmA9’s suggestions, we have now expanded our benchmarking to 6 datasets, and added comparisons to the algorithm from BCHM too. The algorithm outperforms both KK and BCHM on 18 of 24 instances. The results can be found in the attached PDF.

Here, we would like to briefly highlight that neither the benchmarking suite nor a reference implementation is publicly available for KK08, and that BCHM20 does not report on experiments altogether. The typical focus of agnostic boosting results has been on the theory. While our result also falls in this camp, we hope that the reference implementations we make available (including those of previous algorithms) will be of use to the community.

Readability Improvements: Upon revision:

We plan to add a guide to practical adaptations of the algorithm. For many hyperparamters, the theory does provide strong clues (the link between $\eta$ and $\sigma$ ). Briefly, in practice, given a fixed dataset with $m$ data points, there are two parameters we think a practitioner should concern herself with: the mixing parameter $\sigma$ and the number of rounds of boosting $T$ , while the rest can be determined (or at least guess well) given these. The first choice $\sigma$ dictates the relative weight ascribed to reused samples across rounds of boosting, while $T$ , apart from determining the running time and generalization error, implicitly limits the algorithm to using $m/T$ fresh samples each round.
We agree that Lines 205-216 are too dense. We’ll meticulously carry out readability improvements for this section that motivates the algorithm’s design. The current rendition effectively lays out our proof strategy, instead we will reframe it to explain and motivate the algorithmic choices we make in a top-down manner, along with a description of the potential-based framework.
Currently, we highlight the new components (including the data reuse scheme using relabelling, avoid uniform convergence arguments on the boosted class, and the new branching scheme, all crucial) of our approach in the contributions section (notably, Lines 84-141), but we understand that the description may be too dense and perhaps too early. We were severely space-constrained, and in expanding these descriptions, we hope an additional page available post acceptance will aid us.

KK08: Kanade, Varun, and Adam Kalai. "Potential-based agnostic boosting." Advances in Neural Information Processing Systems 22 (2009).

BCHM20: Brukhim, Nataly, Xinyi Chen, Elad Hazan, and Shay Moran. "Online agnostic boosting via regret minimization." Advances in Neural Information Processing Systems 33 (2020): 644-654.

最终决定Accept (poster)

2024-09-25

A nice paper with a worthy contribution, with all reviewers unanimously judging its contribution as good or excellent and recommending acceptance. I suggest to add the new experiments using the +1 page.

Reading the paper, it looked very much to me that the idea could be used not just for the convex surrogate used, but perhaps for the convex surrogate of any proper loss (in the theory of losses for class probability estimation, see the work of Reid and Williamson in several JMLR). It would be interesting to see how the sample demands evolve as a function of the property of the partial losses (intuitively, there should be one, but I could not see it). It is interesting because this trick to reuse examples could be very useful not just in the context of the paper.