On Agnostic PAC Learning in the Small Error Regime
Improved sample complexity bounds for agnostic binary classification in the tau-based model!
摘要
评审与讨论
The paper continues a line of work studying the sample complexity of PAC learning, when it is parameterized by the best achievable (test) loss in the class, thus "interpolating" between realizable and agnostic PAC learning. The paper provides an algorithm which improves prior work in the small-yet-positive error regime, thus answering one of the questions posed by Hanneke, Larsen, and Zhivotovskiy (FOCS '24).
优缺点分析
The paper continues an interesting recent line of work in learning theory. The paper makes a clear contribution, though arguably slightly incremental. Nonetheless, it seems that it will be of interest to experts in the relevant related works (which I am not quite), and it offers some new ideas in its analysis which are likely applicable in future works on this topic. The authors also do a good job in my opinion explaining where their techniques fall short, thus helping future work possibly overcoming them and fully resolving the open problems posed by Hanneke et al..
The main drawback of this paper in my opinion is its very dense writing style, which at times makes it a bit hard to read. I recommend the authors spacing out mathematical definitions and nonstandard notation introduced throughout the analysis.
Overall I find this a solid theoretical contribution, based on some new analysis, leading to a slightly incremental result. Therefore I lean towards recommending acceptance.
问题
- The authors mention that a feature of their algorithm is that it does not need to know in advance. Is this true for prior work on this model as well? This should be clarified.
Some minor comments:
-
Throughout the paper, the grammar is occasionally slightly odd (e.g, the 1st sentence ending abruptly). I suggest the authors revising the writing, and as mentioned earlier, even more so in terms of the math.
-
Line 64: what is ? Should probably be fixed to .
-
The abstract formatting here on openreview has a bug (does not exit math mode appropriately), should be fixed.
局限性
Yes.
最终评判理由
After taking into consideration the authors' response as well as other reviewers' opinions, I maintain my score.
格式问题
We thank the reviewer for carefully reading the paper and for their detailed feedback. We address their points in turn.
The authors mention that a feature of their algorithm is that it does not need to know in advance. Is this true for prior work on this model as well?
We thank the reviewer for raising this important point. Previous algorithms in the literature are indeed also oblivious to the knowledge of . Our intent with our original description was to clarify that our improved algorithm preserves this desirable feature, but is clear that we should have explicitly stated that many previous algorithms do so as well. We will update the next version of the paper to clarify this point.
The main drawback of this paper in my opinion is its very dense writing style, which at times makes it a bit hard to read. I recommend the authors spacing out mathematical definitions and nonstandard notation introduced throughout the analysis. ... Throughout the paper, the grammar is occasionally slightly odd (e.g, the 1st sentence ending abruptly). I suggest the authors revising the writing, and as mentioned earlier, even more so in terms of the math.
We admit that there is certainly room for improvement in the readability of the paper, as also noted by reviewers 1ZdR and e2wi. We found the paper somewhat difficult to write, including especially Section 3, owing to the technicality involved in the analysis as well as the page limit of the submission. Should the paper be accepted to the conference, we will be sure to use the additional page of the camera ready version to improve the spacing and presentation of notation throughout the paper, at the reviewer’s suggestion. We also intend to include additional intuition and motivation in Section 3, as well as to include the algorithm’s full pseudocode in the main body of the paper. (As opposed to merely in the appendix, where it has currently been deferred.) We believe these various modifications will make the section considerably less dense and more legible for the reader. We thank the reviewer for raising this important point.
Line 64: what is ? Should probably be fixed to .
Correct, we will be sure to update this in the next version of the paper.
The abstract formatting here on openreview has a bug (does not exit math mode appropriately), should be fixed.
We thank the reviewer for bringing this to our attention. We will fix this mistake once we are permitted to update the abstract on OpenReview (unfortunately, we currently cannot).
I thanks the authors for their response. Having read it as well as the other reviews, I remain positive toward recommending the acceptance of the paper.
This paper investigates the sample complexity of binary classification in the agnostic PAC model, focusing on a fine-grained analysis where the excess error is parameterized by τ, the error of the best-in-class hypothesis. The work directly addresses and resolves a key open question from a recent FOCS https://www.google.com/search?q=2024 paper by Hanneke, Larsen, and Zhivotovskiy, which asked for the optimal error rates in the small error regime where τ is on the order of d/m. The authors propose a novel and computationally efficient learning algorithm based on a sophisticated aggregation of Empirical Risk Minimization (ERM) classifiers. The main result is an algorithm that achieves an error rate of 2.1τ + O(√(τd/m) + d/m), which matches existing lower bounds and is therefore optimal in the τ = O(d/m) regime. The algorithm introduces an innovative tie-breaking mechanism that uses an ERM classifier trained on the "region of disagreement" between two independent voting-based classifiers. This work significantly advances our understanding of the fundamental limits of agnostic learning.
优缺点分析
Strengths: High Impact and Significance: The paper makes a substantial contribution by resolving a timely and important open problem posed in a paper from a top-tier conference (FOCS https://www.google.com/search?q=2024). Characterizing the optimal sample complexity in this fine-grained agnostic model is fundamental to learning theory, and this work provides a near-complete answer.
Technical Novelty: The proposed learning algorithm is highly innovative. While it builds on prior techniques like sample splitting and voting, the introduction of a tie-breaker ERM trained on the region of disagreement is a clever and powerful idea. The analysis required to bound the error of this complex, multi-stage classifier in the agnostic setting is non-trivial and technically deep.
Addresses Multiple Open Problems: Beyond resolving the primary question about sample complexity, the paper makes significant progress on two other open problems from Hanneke et al. cite_start. The proposed learner is based on majority voting and is computationally efficient (in terms of ERM oracle calls), demonstrating that such learners can be optimal in this setting.
Clarity and Presentation: The paper is exceptionally well-written. The introduction provides excellent motivation and clearly situates the work within the existing literature. The proof sketch in Section 3 is detailed and gives the reader strong intuition for the complex analysis that follows in the appendices.
Weaknesses:
The Multiplicative Constant: The primary limitation, which the authors are commendably upfront about, is that the leading constant on τ is 2.1, not 1. This prevents the result from completely settling the complexity of agnostic learning for the entire range of τ. While this is more of a direction for future work than a flaw, it is the main reason the problem is not fully resolved.
Algorithm Complexity: The final "best-of-both-worlds" algorithm is quite intricate, involving a three-way sample split, multiple recursive splitting schemes via Algorithm 2, two independent voting classifiers, a tie-breaker ERM, and a final model selection step on a hold-out set. A high-level block diagram illustrating this data flow could significantly improve the reader's ability to grasp the overall architecture of the learner.
Intuition Behind Constants: The proof sketch explains that the final 2.1τ term comes from adding the bounds of two different error events, one contributing roughly 1.1τ and the other τ. A bit more intuition in the main text about why the tie-breaking scheme leads to this specific decomposition would be beneficial.
问题
Questions for Authors:
-
The tie-breaker ERM is trained on a "hard" subset of data where the primary classifiers disagree or lack confidence. The analysis shows that this ERM still achieves an error close to the optimal h* on this conditional distribution. Could you provide more intuition for why a standard ERM, known to be suboptimal, performs so well on what seems to be a more difficult data distribution?
-
The switch from 3 recursive calls in the initial approach (Section 3.1) to 27 calls in the improved approach (Algorithm 2) is a critical step in the analysis. Is there a deeper principle behind the number 27, or is it simply "a number large enough for the probability math to work out"? How sensitive is the final constant of 2.1 to this choice?
-
The final learner uses a hold-out set to select between your proposed algorithm and the one from Hanneke et al. cite_start. This is a standard and effective method. Did you explore possibilities for a more deeply integrated "single" algorithm that smoothly handles all regimes of τ without explicit model selection?
局限性
No limitations
格式问题
OK
We appreciate the reviewer’s time and thoughtful assessment of our paper. Below, we respond to the reviewer’s questions in order:
Algorithm Complexity: The final "best-of-both-worlds" algorithm is quite intricate, involving a three-way sample split, multiple recursive splitting schemes via Algorithm 2, two independent voting classifiers, a tie-breaker ERM, and a final model selection step on a hold-out set. A high-level block diagram illustrating this data flow could significantly improve the reader's ability to grasp the overall architecture of the learner. and Intuition Behind Constants: The proof sketch explains that the final 2.1τ term comes from adding the bounds of two different error events, one contributing roughly 1.1τ and the other τ. A bit more intuition in the main text about why the tie-breaking scheme leads to this specific decomposition would be beneficial.
We thank the reviewer for suggesting the idea of a block diagram to provide a high-level overview of the steps in our final algorithm. Should the paper be accepted to the conference, we will try to find space for it in the final version of the paper (currently it is in Appendix B removed from the main due to space constraints). We will also try to include additional intuition explaining where the factors and come from.
We thank the reviewer for the great question. We take them in order.
The tie-breaker ERM is trained on a "hard" subset of data where the primary classifiers disagree or lack confidence. The analysis shows that this ERM still achieves an error close to the optimal h* on this conditional distribution. Could you provide more intuition for why a standard ERM, known to be suboptimal, performs so well on what seems to be a more difficult data distribution?
Answer:
First a high-level explanation and then a more formal explanation. The ERM performance on the "hard" subset is actually as one would expect for ERM. However, the probability of it being asked to make a tie-break is small, thus the overall probability of the tie-breaker both being called and failing ends up being smaller than the overall probability of an ERM failing when trained just on the "plain" data.
More formally (omitting the factor in the following sketch): The probability of either or failing to classify with a good margin (more than of their voters being incorrect) is , and thus the ERM deciding the tie-break will be trained on samples.
Now let be the event that either or fails to classify with a good margin, i.e., the case where the tie-breaker is used, so has probability . Since, is trained on examples from the conditional distribution it has error where we use that is not a minimizer of the conditional distribution, so we can use it as an upper bound (we are here using the agnostic learning gaurantee of ERM, see for instance [1] Theorem 6.7).
Now, by the law of conditional probability, we have that the probability of the event that either or has small margin, i.e., happens, and the tie-breaker ERM fails is
where the second-to-last inequality follows from and the last inequality from .
[1] Shalev-Shwartz, S., Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms.
The switch from 3 recursive calls in the initial approach (Section 3.1) to 27 calls in the improved approach (Algorithm 2) is a critical step in the analysis. Is there a deeper principle behind the number 27, or is it simply "a number large enough for the probability math to work out"? How sensitive is the final constant of 2.1 to this choice?
Answer 2 The choice of more than recursive calls is to improve the constant coming from the event contributing in the simple analysis down to , as it enables splitting the error into two events: (1) the event where both classifiers and fail with a large margin, and (2) the event where one of the classifiers fails with a small margin, and the tie-breaker is called and fails.
From this split, we have to be able to say that the probability of one of the classifiers failing with a small margin (more than voters being incorrect) is small (that is, the probability in the answer to Question 1 should be of the order stated in Question 1). Without going into too much detail, this requires a more fine-grained analysis of the margin error of the recursive algorithm - basically we want the number of recursive calls to be such that, given an margin error of , one of the recursive calls fails with this margin and some of the voters in the remaining classifiers also fail - which a larger number of recursive calls guarantees.
The number 27, as foreseen by the reviewer, was chosen somewhat arbitrarily (just large enough). The general trade-off for a larger number of recursive calls is that the constant in front of the error term of order in the current version of the paper becomes smaller, but at the cost of a larger constant in the term in the final bound. Implying that we could not just choose the number of splits arbitrarily large to get down to , so we chose as it allowed for a constant fairly close to .
The final learner uses a hold-out set to select between your proposed algorithm and the one from Hanneke et al. cite_start. This is a standard and effective method. Did you explore possibilities for a more deeply integrated "single" algorithm that smoothly handles all regimes of τ without explicit model selection?
Answer: We thought a bit about this; however, we did not figure out a better approach than the hold-out set method presented in the paper.
Dear Reviewer Y9zH,
What do you think about the answer by the authors? Have your questions been answered satisfactorily? Do you have any follow-up questions? Please do not leave the questions for the last minute.
I am nudging you because you are the only reviewer still recommending that this paper is borderline. It would be best if all the recommendations are aligned given that you all view the paper in a favorable way already. Everyone else is recommending 5 (Accept).
Thank you for your service
Dear Reviewer Y9zH,
This is the very last day to explain to the authors how you view the rebuttal that they have provided to your comments.
Please see also my confidential comment to you.
Sincerely,
The AC handling this submission
For binary classification in the setting of agnostic PAC learning, it it shown in Hanneke, Larsen and Zhivotovskiy [HLZ] that ERM algorithm is sub-optimal in the sense that there is an extra factor comparing to the lower bound on excess risk, where is the error incurred by optimal hypothesis. [HLZ] proposed a learner that removes factor in the regime where . This paper also achieves optimal rate when , thus partially (due to a multiplicative constant in front of ) solved unanswered problem.
优缺点分析
Strength: the paper clearly presents the problem with necessary preliminaries and the arguments for their claims are logically sequenced. It helps answer the question on complexity of agnostic binary classification problem. While the algorithm is rooted in the previous works, it is well-adapted along with refined analyses, allowing them to obtain the desired rate.
Weakness: there might be a better way to present the section for sketch of the proofs.
问题
Although the explanations in section 3 are clear, it might be better to present the main steps that lead to the ultimate result and provide some motivation for the quantities in the main step. Since the paper relies on large amount of notation, it is not easy to keep track of the idea while reading through it. Also, in the first approach, have you considered switching the concrete numbers to alphabets so that readers have a better sense which choices of these constants lead to undesired multiplicative factor?
局限性
yes
最终评判理由
I appreciate the authors' commitment to improving the readability of the paper in the future version. The concerns are resolved in their response. My evaluation remains positive, and I keep the rating unchanged.
格式问题
no formatting issues found.
We thank the reviewer for carefully reading the paper and for their thoughtful feedback.
Regarding the readability of Section 3, we admit that there is certainly room for improvement. We found this section particularly challenging to write, due to the technicality involved in the analysis as well as the page limit of the submission. Should the paper be accepted to the conference, we will be sure to use the additional page of the camera ready version to provide additional intuition and motivation in Section 3, at the reviewer’s suggestion, as well as to include the algorithm’s full pseudocode in the main body of the paper. (As opposed to merely in the appendix, where it has currently been deferred.) We believe these various modifications will make the section considerably less dense and more legible for the reader. We thank the reviewer for raising this important point.
We will also examine replacing some concrete numbers in the analysis with additional variables, while being careful not to unintentionally exacerbate the previous issue by introducing additional notation for the reader to keep in mind.
Thank you for your response. I do not have any further questions at this time.
This paper studies PAC learning in the agnostic case, where the error is parametrized by \tau = err(h*_D), the optimal error of a hypothesis in the class H for each particular distribution D. Recent prior work by Hanneke, Larsen, and Zhivotovskiy in FOCS 2024 showed that, when taking \tau into account, ERM is not an optimal agnostic PAC learner. That paper gave a tight algorithm in the regime where \tau > d/m, where d = VC(H) and m is the sample complexity. This paper gives an algorithm in the regime where \tau = O(d/m). This paper's algorithm's correctness guarantee is error
c\tau + O(\sqrt term + (d+\log(1/\delta))/m)
for c <= 2.1, whereas the lower bound (e.g. from L. Devroye, L. Gy¨orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, Chapter 14) the same, but with c = 1 (and Omega(stuff) instead of O(stuff)). The authors leave as an open question to close the gap in the "c" parameter.
This paper's approach is inspired by [Hanneke 2016]. The algorithm splits the sample into subsamples and runs ERM on some randomly chosen fraction of subsamples. More specifically, the algorithm splits the samples into three portions, S1, S2, S3. On S1, it randomly chooses subsamples on which to run ERM, and then aggregates the results into a statistic estimating the proportion of ERM hypotheses that agree on any particular (x,y). On S2, the algorithm does the same. On examples (x_3, y_3) in S3, the algorithm sees if the statistics from the previous steps (using S1 and S2) disagree --- if so, that (x_3, y_3) is not discarded. The final training portion is to run an ERM on the (non-discarded) examples in S3. This final ERM is used to break ties, if the classifiers from using samples S1 and S2 do not have high agreement (on an input x). The authors consider this approach (running the base classifier twice, and using a 3rd ERM classifier for disagreements) as a "main insight" (line 107).
In Appendix E, the paper combines this new algorithm with the learner in [Hanneke et. al. 2024] by running the two algorithms on two separate samples, and using a third to pick the returned hypothesis with lower error. This with high probability achieves the minimum (incomparable) lower error bound from the two algorithms.
优缺点分析
Quality: The paper's main focus, an agnostic learning algorithm with optimal error asymptotics in the regime where \tau = d/m, is well supported. The theorems and lemmas in this paper appear to be completely proven and technically sound.
Clarity: At a high level, the authors explain the proof approach well. First, the authors consider a simpler algorithmic approach achieving error 15 \tau + stuff. Next, the authors discuss how to modify the first algorithm to achieve error 2.1 \tau + stuff. These sections are broken down into subsections. However, at the subsection level, the explanations are dense and can be difficult to parse. The arguments are very notation heavy and it is difficult to see the overall flow of the argument in each subsection. The submission packs a lot of technical content into the main body of the paper, whereas I would prefer moving more technical content to the appendices in favor of additional connecting phrases within proofs, using more space to give intuitions for notable proof steps. As a result, it was difficult for me to verify the correctness of the proofs (line-by-line).
Significance and originality: The submission addresses an open problem from a recent FOCS 2024 paper, essentially extending those results (when \tau > d/m) into a different parameter regime (when \tau = d/m). The paper claims that their idea of running two classifiers, and using an ERM on their disagreement regions to break ties, is a new idea that may find further applications. The particular subsampling used builds on prior work, but may also be interesting for the community. I am optimistic about these techniques being interesting for the community, although I am not familiar enough with the area to affirmatively say that the results (i.e., minus the techniques) are "4:excellent".
问题
---The technique referred to on line 111 (running one algorithm A twice on two samples S1, S2, and then using an ERM on sample S3 to break ties) seems like it could be reinterpreted as follows: We run an (overall accurate) algorithm A that isn't stable on all inputs. On stable inputs, accuracy of A implies accuracy of the aggregate algorithm. On non-stable inputs, we train a special classifier (ERM on sample S3) that has good accuracy on these inputs that are "hard" for a "stable" algorithm. Specifically, I suppose "stable" here means "many ERMs on subsamples will tend to output hypotheses that classify this point correctly".
Did ideas from the stability literature influence your approach? Are there other high-level connections between this problem/approach and stable algorithms? (For example, work on subsampling?) Are there plausible technical connections between this problem (which isn't overtly about algorithmic stability) and the stable algorithms/learning theory literature?
---I believe the biggest weakness of this paper is the writing (clarity), which may detract from the community's ability to build on the technical contributions of this paper. This seems to stem from
- the submission page limit
- trying to put fully rigorous arguments in the main body proof sketch (section 3), and
- not having enough intuition-explaining paragraphs in the proof sketch.
For example, something similar to Appendix B's "summary of approach" should appear in the main body of the paper (for at least the 15\tau algorithm). It would make many analysis sections, including the notation section (3.1, line 171) much easier to parse. One can bootstrap the algorithm from the notation, splitting algorithm, and analyses in the main body, but it would probably be easier for everyone if the whole pseudocode (of the 15\tau algorithm) were present, say at line 170, with an accompanying intuition explanation.
---Minor comments:
Line 81: typo
170: this intuition uses notation that won't be defined until the next page. Maybe it would help to include explanations of the sets represented by the notation, since this is its first use. For example, how does a reader say S_i, |_| out loud?
172: similarly, what is T? Why is it necessary for the algorithm to have T? It would help to explain these when they first appear, rather than differing the explanation until the recursion is mentioned later in the analysis.
186: I don't believe L_D (without alpha superscript) has been defined yet
189: I think Algorithm 1 could be better explained if a picture of a sample, divided up into its pieces, with the notation in the figure, accompanied the pseudocode. Similarly for Algorithm 2's splitting later. I understand that for space/time reasons, this was not important to include, but I think it would add to the final "full" paper (perhaps a version without page limits).
208: Was h^* defined somewhere?
210: What does S_i, intersection refer to?
232: maybe more precise to say "at least"
533: direction -> direct
691: typo in "Recall that..." (A to be)
692: spacing typo
局限性
Yes
最终评判理由
Having read the other reviews and the authors' rebuttals, I have no further questions, and I continue to recommend my original evaluation.
The authors responded positively to ideas suggesting improving the clarity of the paper. This is great. In addition, I assume the authors will make a reasonable effort to ensuring their talks/posters/etc. are also clear, as they did with this submission.
格式问题
Several equation lines are longer than the width of the page. I assume this was to consolidate space for submission.
We thank the reviewer for carefully reading the paper and for their detailed feedback. We now address their questions in turn.
Did ideas from the stability literature influence your approach? Are there other high-level connections between this problem/approach and stable algorithms? (For example, work on subsampling?) Are there plausible technical connections between this problem (which isn't overtly about algorithmic stability) and the stable algorithms/learning theory literature?
Great questions. The reinterpretation/restatement of our algorithm is correct, and the “stability” perspective is an interesting manner of interpreting the engine behind the algorithm’s success. In particular, the majority votes (i.e., each algorithm trained on and ) intuitively serve the dual purposes of “covering each other's mistakes” and of providing a “certainty score” for their predictions (i.e., given by the number of constituent voters which agree with the majority). This certainty score allows the algorithm to identify difficult or uncertain points, for which the third classifier can break the tie.
Unfortunately, we cannot currently demonstrate that our approach is necessary stable in the sense of [1], owing to the fact that the classifiers in the majority vote (trained on and ), have overlapping samples, meaning that all such classifiers could conceivably change by altering even one example in or . Furthermore, the tiebreak classifier trained on could then also change, despite the fact that an example in was not altered.
That said, we do believe that it is intuitively fair to think of our algorithm as a technique for stabilizing the alternative, which is ERM. We admit that we were not explicitly motivated by ideas from the stability literature, but we appreciate the reviewer bringing our attention to these intriguing connections. We will update the next version of the paper with a remark describing these previous points.
I believe the biggest weakness of this paper is the writing (clarity), which may detract from the community's ability to build on the technical contributions of this paper. This seems to stem from the submission page limit, trying to put fully rigorous arguments in the main body proof sketch (section 3), and not having enough intuition-explaining paragraphs in the proof sketch.
We thank the reviewer for the concrete suggestions for improving the readability of Section 3, and we admit that there is certainly room for improvement. We found this section particularly challenging to write, due to the technicality involved in the analysis as well as the page limit. Should the paper be accepted to the conference, we will most definitely use the additional page of the camera ready version to include the full algorithm pseudocode in the main body of the paper (as the reviewer suggests) and to generally improve the motivation, intuition, and pacing of Section 3.
We also thank the reviewer for the various minor comments/edits which they aptly noted – we will be sure to correct them in the next version of the paper.
[1] Stability and Generalization Olivier Bousquet, André Elisseeff, 2002.
Thanks to the authors for the rebuttal. Having read the other reviews and the authors' rebuttals, I have no further questions, and I continue to recommend my original evaluation.
The authors responded positively to ideas suggesting improving the clarity of the paper. This is great. In addition, I assume the authors will make a reasonable effort to ensuring their talks/posters/etc. are also clear, as they did with this submission.
This paper studies PAC learning in the agnostic case, where the error is parametrized by a quantity that corresponds to the error of the best hypothesis. A recent result by Hanneke, Larsen, and Zhivotovskiy [HLZ] showed that empirical risk minimization is suboptimal for agnostic learning if we treat the performance of the best hypothesis as a parameter and moreover they proposed an algorithm that was optimal when . This paper complements the [HLZ] result by providing an efficient learner that is optimal when . The authors make progress on two more questions from [HLZ]. These are important results in learning theory and the paper has merit to be accepted in NeurIPS.
The authors provided good answers to the reviewers' questions and the evaluations continued to be positive. Three of the reviewers maintained their original "Accept" rating. Unfortunately, the fourth reviewer (Reviewer Y9zH) did not interact with the authors after their response, but on one hand the author was positive about the paper to begin with, and on the other hand the response by the authors was reasonable.
So, overall, I think the paper stands out as it makes an important and interesting contribution for agnostic learning over the traditional ERM approaches. There was a good discussion and the authors have clarified several parts that the reviewers asked. Therefore, I am recommending spotlight acceptance.