PaperHub
6.1
/10
Poster4 位审稿人
最低2最高4标准差0.8
2
3
4
4
ICML 2025

Sample-Optimal Agnostic Boosting with Unlabeled Data

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

Agnostic Boosting algorithms with sample complexity matching that of ERM, given additional unlabeled data (often this is free!)

摘要

关键词
boostingagnostic learningweak learningsample complexitysemi supervised

评审与讨论

审稿意见
2

This work proposes an agnostic boosting algorithm that seeks to improve sample complexity by incorporating unlabeled data. The central idea is to design a potential function whose gradient can be split into two distinct parts—one that depends only on the model’s output (reflecting the feature information) and another that depends solely on the labels. This separation enables the algorithm to use a large amount of unlabeled data to estimate the feature-related component, while relying on a smaller set of labeled data to estimate the label-related component. The authors provide a theoretical analysis showing that, under certain conditions and with the availability of additional unlabeled data, the number of labeled examples required can be reduced to levels comparable to those achieved by standard empirical risk minimization techniques. They also discuss extensions to improve the efficiency of unlabeled data usage and address challenges such as distribution shifts between labeled and unlabeled data. Experimental results on several datasets are presented to demonstrate the practical performance of the method.

给作者的问题

Could you explain intuitively what role unlabeled samples play in the algorithm (rather than just from the perspective of optimizing the potential function), and why their use helps reduce the reliance on labeled data? Besides, the experimental evaluation appears somewhat limited. Could you include further empirical comparisons with some recent boosting approaches?

论据与证据

The paper’s claims are supported by its theoretical analysis. However, although some experiments are provided, they do not include comparisons with the latest agnostic boosting methods.

方法与评估标准

The evaluation criterion is well-suited for the problem, similarly to previous studies [Kanade & Kalai, 2009; Ghai & Singh, 2024].

理论论述

In the review process, I examined several key proofs in the paper to ensure their correctness. Overall, the proofs are written clearly, and I did not find any obvious errors.

实验设计与分析

The experimental section only compares against the method from [Kanade & Kalai, 2009] and does not consider other agnostic boosting approaches such as [Brukhim et. al. , 2020] and [Ghai & Singh, 2024]. Moreover, the experiments are conducted on only a few simple datasets, which is insufficient to demonstrate the effectiveness of the proposed method. Additionally, the proposed approach appears to incur higher computational overhead, yet the experiments do not include any running-time comparison.

补充材料

I reviewed the supplementary material. Specifically, I examined Appendix A, which contains the detailed proofs for improved unlabeled sample efficiency and the data reuse scheme, and Appendix B, which provides additional analysis on the algorithm's robustness under covariate shift.

与现有文献的关系

The paper builds on established potential-based boosting [Kanade & Kalai, 2009]. Its key innovation—decomposing the potential function’s gradient into label-dependent and feature-dependent components to leverage unlabeled data—aligns with ideas from semi-supervised learning. In doing so, it extends prior work by showing how unlabeled data can be used to improve sample efficiency in the agnostic setting. However, most of the proof techniques in this work build upon previous studies, resulting in relatively incremental contributions from a theoretical view.

遗漏的重要参考文献

Given that the paper appears to cite a comprehensive range of related works in the field, it is likely that the authors have covered the essential literature in agnostic boosting approaches.

其他优缺点

Strength: This paper presents a modification of the potential-based agnostic boosting method by incorporating unlabeled samples, achieving improved complexity for labeled samples. The idea of leveraging unlabeled data is interesting and could offer valuable insights for further research in the field.

Weakness: The paper’s contribution is rather limited. Its main framework and proof techniques are based on the previous work of Potential-based Agnostic Boosting [Kanade & Kalai, 2009] and leverage data reuse from [Ghai & Singh, 2024] to improve sample complexity. In terms of proof methodology, no novel techniques are introduced. Although the work achieves improvements in the complexity for labeled samples, it does not provide enough experimental evidence to demonstrate the advantages of such semi-supervised approach in terms of effectiveness or efficiency.

其他意见或建议

The paper’s formatting could be further refined. For instance, Algorithm 2 is placed in the middle of page 7, which leaves excessive blank space at both the top and bottom of the page.

作者回复

We thank the reviewer for their comments.

Runtime overhead over PAB: We remark that both in theory and in practice, our algorithm is no slower than the PAB algorithm of Kanade et al. To achieve ε\varepsilon-excess error, the PAB algorithm performs 1/γ2ε21/\gamma^2 \varepsilon^2 rounds of boosting in each of which the weak learner is fed VC(B)/γ2ε2VC(\mathcal{B})/\gamma^2 \varepsilon^2 samples; the same is true for us (see the setting of parameters in the theorem statement). In terms of implementation, the PAB algorithm samples VC(B)/γ2ε2VC(\mathcal{B})/\gamma^2 \varepsilon^2 samples fresh from the population distribution; we sample half this number from a pool of previously sampled labeled data points, and the other half from freshly sampled unlabeled data points. Therefore, the running times of PAB and Algorithm 1 are remarkably similar.

Novelty of results and techniques: While our proof techniques build upon prior work (and the exposition is carried out so that this is as clear as possible, and so that our changes are conspicuous), we believe our contribution goes beyond incremental improvements. The conception that unlabeled data can specifically ameliorate the sample complexity gap in agnostic boosting is novel and opens new directions for semi-supervised boosting research. This connection wasn't previously recognized despite its simplicity in hindsight.

Next, we come to techniques: Despite employing a variety of other techniques, agnostic boosting algorithms (e.g., Kanade et al, Ghai et al) almost invariably use the Madaboost potential (Domingo et al), and in fact so do hundreds of other robust boosting algorithms. Our key innovation is orthogonal to other existing algorithmic techniques, which is why it can be layered on top of Kanade et al and Ghai et al. We introduce a new potential function, crucially which via linearity of expectation, can be written so that it decomposes into two parts, one estimable based on labeled data, and the other based on unlabeled data. Furthermore, it is essential the first labeled data part (formally, its derivative) does not depend on the ensemble whose value is being assessed; this allows us to reuse the same set of labeled data across all rounds of boosting (unlike Kanade et al) without needing a uniform convergence argument (as in BCHM) or a martingale argument (in Ghai et al).

This new potential function violates some key tenets of prior work. For example, the centerpiece of the Madaboost potential is that it (formally, its functional derivative) does not downweight points that are misclassified by the ensemble. This property is best captured in the definition of a conservative relabeling in Kanade et al. Intuitively, it makes sense; one wants not to withdraw any focus from wrongly classified data points. Not only is this false for us, it is incompatible with the requirement of having a separable potential function, which we have just described. This can be seen in Figure 1– the derivative of Madaboost for negative domain is always -1, but for us this is not true, and our potential curves upward between [-1, 0]. Here, our proof technique, despite seeming syntactic similarity, has been modified to handle this.

Finally, it is worth pointing out that the setting of covariate shift requires a number of changes to keep track of the progress being made in each round of boosting. Foremost among these is the potential function, which is no longer the population (expectation) version of a scalar potential. We keep track of the progress on the labeled and unlabeled distributions separately.

Range of experiments: The primary contribution of our work is theoretical, as Reviewer P5aa also notes. Our main innovation is the design of a potential function whose use in the potential based boosting framework permits the use of unlabeled samples to make learning more sample-efficient. Since this is an improvement that can be composed with other innovations, like the sample reuse scheme in Ghai et al, our experiment setup thus is an ablation study meant to verify the hypothesis that additional unlabeled samples can enhance the learning performance in agnostic boosting. And in it, Algorithm 1 which is a close modification of PAB from Kanade et al is compared to PAB. We strongly feel such ablation studies which measure the improvement induced by individual algorithmic techniques is the right way to make and measure progress on foundational problems, as opposed to a leaderboard approach. Finally, we remark that our datasets are of comparable sizes to ones considered in Kanade et al.

Intuition: We hope that our self contained 1.5 page proof can aid future readers. But crudely, the potential is such that for unlabeled data, the relabel distribution shifts towards negative examples when Ht(x)H_t(x) is positive and positive examples when Ht(x)H_t(x) is negative. This acts as a regularizer against high certainty predictions on data without labels.

审稿意见
3

The main contribution of this paper is presenting a new agnostic boosting algorithm. In particular, this algorithm is computationally efficient assuming an oracle access to weak learners and improves the sample complexity of previous algorithms in a certain way. Moreover, the authors demonstrated an application of their new result in learning of half-spaces and reinforcement learning. Further, their theoretical results are supported by a few experiments.

给作者的问题

I will happily increase my score if the authors can clarify my previous question. In addition, do you think there is any connection between your method and the reduction of https://proceedings.mlr.press/v178/hopkins22a.html that also uses unlabeled data?

论据与证据

I verified the correctness of the proof of their main theorem. Moreover, the improvement of their main result, findings about the covariate shift, applications, and experiments make sense to me.

方法与评估标准

The evaluation criteria are natural.

理论论述

I verified the correctness of the proof of their main theorem.

实验设计与分析

The experimental results make sense to me.

补充材料

I took a look at the supplementary material. I mainly read the part related to Improving unlabeled sample efficiency.

与现有文献的关系

Boosting is a well-known learning theoretic technique. Any good result in this context will be valuable. This paper improves upon the previous results in terms of sample complexity.

遗漏的重要参考文献

I think the paper contains most of the primary references. However, listing the references in the related work section is not the best way to discuss the related works. I suggest explaining a few more related ones in more depth. Additionally, I think it is good to cite https://proceedings.mlr.press/v178/hopkins22a.html as they have also used unlabeled data to prove their theorem in the agnostic regime.

其他优缺点

Consider the following way of thinking: (1) Definition 2.1 implies the existence of the weak learner in the realizable setting. (2) We can boost weak learners in the realizable setting to get a PAC learner in the realizable setting. (3) We can now apply any agnostic to realizable reduction. For instance, we have: https://proceedings.mlr.press/v178/hopkins22a.html and https://arxiv.org/abs/1610.03592. What is the statistical advantage of your method? In terms of the novelty of using unlabeled data in proving theorems for the agnostic regime, while I agree that it is a new idea in this context, I do not consider it an "unexpected" result.

其他意见或建议

I think it is good to discuss the limitations of your approach in other settings, such as multiclass with an unbounded label space.

伦理审查问题

作者回复

We thank the reviewer for their comments.

Statistical Gains over Realizable-Agnostic Reduction: At the outset, we find this to be a loaded question (i.e., one that in our view has an unfair presupposition built in). If one were to discard computational constraints, one can perform ERM on the target class with VC(H)/ε2VC(\mathcal{H})/\varepsilon^2 samples, and achieve an optimal sample complexity. This, via classical VC results, is unimprovable; for each H\mathcal{H}, there’s a distribution for which this is tight. The reduction outlined by the reviewer – which we are certainly happy to cite and highlight – also gets 1/ε2\propto 1/\varepsilon^2 sample complexity. Getting the class-specific dependencies right via this approach takes additional work and has been pursued in a recent preprint that we have been recently made aware of (which we can not link here because they cite our work). However, note that this reduction requires pruning the hypothesis class by enumeration while considering all possible labellings of VC(H)/ε2VC(\mathcal{H})/\varepsilon^2 samples, and hence takes exponential time.

Over such computationally inefficient, and in the case of ERM, also statistically optimal, exponential time approaches, we offer no statistical improvements. But here we remark that neither does the celebrated Adaboost algorithm. Boosting is a question that arose in computational learning theory (i.e., in Valiant-lore as opposed to statistical learning aka Vapnik-land), and its primary purpose has since its origin been and continues to be computationally efficient reductions, which we also pursue. Like all known boosting results, and unlike exponential time approaches, our results make a polynomial number of calls to the weak learning oracle – in fact we are no worse than any other agnostic booster in this regard quantitatively – and perform polynomial work. Now it is natural to ask if these computational benefits come at a statistical cost, and this is the question we make progress on.

To summarize, the purpose of our work is to give computationally efficient boosting algorithms that are yet better than known ones in terms of sample requirements. It is thus unsurprising that we are no better statistically than exponential time approaches (such as ERM or exponential time agnostic-realizable reductions).

Novelty: Our intention was always to say that the consideration of unlabeled samples is new in the agnostic boosting context. Indeed outside it, there’s a massive body of work on semi-supervised learning. Here, we also thank the reviewer. The works of Hopkin et al and David et al are definitely worth pointing out in the broader statistical learning context; we will add a discussion on this line of work. However, both our context and our techniques owing to mandates of computational efficiency are disjoint from Hopkins et al, to the best of our reading. Unlike Hopkins et al, we do not coarsen the hypothesis space by considering all possible labellings of some set of points. Our intuition and proof is based on synthesizing a potential function in the framework of Kanade et al, whose parts can be independently estimated on labeled and unlabeled data, respectively.

In light of this clarification regarding the positioning of our paper (and of boosting broadly), we would like to gently petition the reviewer to consider raising their score.

PS. This is a minor point in the grand scheme of things. But the realizable-boosting reduction, to the best of our reading, would also break the distribution-specific nature of the weak learners (discussed on page 3), since Adaboost does not preserve the marginal distribution over features.

审稿意见
4

This paper explores how unlabeled samples can be useful in the task of agnostic boosting, in which access to a weak learner must be leveraged to construct a more accurate learning algorithm. The task of agnostic boosting has previously been considered only in settings with access to labeled samples, and the state-of-the-art algorithms have a sample complexity that scales with 1/ϵ31/\epsilon^3, significantly behind the 1/ϵ21/\epsilon^2 dependence achieved by standard ERM.

The authors present an algorithm that still has a 1/ϵ31/\epsilon^3 dependence on the number of unlabeled samples, but only a 1/ϵ21/\epsilon^2 dependence on the number of labeled examples, matching the sample complexity of ERM in terms of the number of labeled samples. This improvement is achieved through careful analysis and targeted modifications to existing potential-based agnostic boosting algorithms found in the literature.

In many settings, unlabeled samples can be obtained at a much lower cost than labeled data. From this point of view, this new algorithm represents a substantial advancement in our understanding of the sample complexity of agnostic boosting.

给作者的问题

  1. I'd appreciate an explanation of the experimental setup that addresses my confusion described above in the Methods and Evaluation section.

论据与证据

The paper is primarily a theory paper, and all theorems and claims are backed up with rigorous proofs.

方法与评估标准

I was a bit confused by the description of the experimental setup, in particular how the samples were allocated for the two algorithms (did PAB receive fewer samples per round, or did it have fewer rounds of boosting, but the same number of samples per round?) I would appreciate if the authors could clarify exactly how the two algorithms use the labeled and unlabeled examples.

理论论述

I examined the proof of Theorem 3.1 and found it sound. While I didn't scrutinize all proofs in detail, the other results appear convincing due to their clear links to established techniques and findings in the boosting literature. However, I would defer to another reviewer who has read the remaining proofs thoroughly.

实验设计与分析

I did not carefully check the soundness of the experiments beyond reading the method description, so would defer to other reviewers on this point. I think that the paper is still quite strong without these experiments.

补充材料

I consulted the appendix for additional experimental details to clarify my confusion about the description in the main text. However, the information provided there did not resolve my confusion.

与现有文献的关系

This paper contributes to the extensive literature on agnostic boosting, a topic that has garnered significant interest in the ML community from both theoretical and practical perspectives. In my view, the paper offers a fresh perspective by demonstrating the substantial benefits that can be derived from unlabeled samples. To the best of my knowledge, this observation is novel in the context of agnostic boosting and has the potential to open up new avenues of research in the field.

遗漏的重要参考文献

I'm not aware of any missing references that need to be discussed.

其他优缺点

  • Strengths: The novel use of unlabeled data and the surprising power drawn from it offers a fresh perspective to the boosting literature, and possibly other areas as well.
  • Weaknesses: The paper was quite dense to read through. While much of this is due to the technical argument, I think there are some small notational and expositional changes that could be made to improve the reader experience. I've made a few notes on small notational changes in the "other comments or suggestions" section.

其他意见或建议

Some minor typos/comments

  • italicized question in first paragraph of introduction (provide -> provides)
  • line 55: "agnostic boosting algorithm"?
  • line 206: "nonnegligeable" -> "non-negligible"
  • Typo in statement of Thm 3.1 ("T = O(/...)")
  • It is unclear from the statement of theorem 3.1 that ϵ0\epsilon_0 and δ0\delta_0 are parameters for the weak learner. This should be restated explicitly.
  • I feel that the hidden log factors should be more explicitly presented. I suggest:
    • Adding parenthetical clarifications in the introduction.
    • Replacing the current O\mathcal{O} notation with O~\tilde O, as the current symbol is easily mistaken for standard O, which caused initial confusion in interpreting the statements and proofs.
  • The identical titles for Algorithms 1 and 2 are confusing. Distinct, descriptive titles for each would improve clarity and readability.
  • In Section B of the appendix, I think there is an incorrect theorem reference on line 795, i.e. it should be Proof of Theorem 5.1
作者回复

We thank the reviewer for their thoughtful comments and thorough review. We will execute all the suggested editorial suggestions upon revision.

Experimental setup: Each dataset is split into a labeled sample pool and an unlabeled sample pool; the labels for the unlabeled sample pool are discarded. The precise ratio of this split is given in the footnote on page 8. The contract in our comparison is that Algorithm 1 and PAB across the course of entire execution have access to the same pool of labeled samples. Algorithm 1 additionally can access the unlabeled pool. Our experimental setup thus is an ablation study meant to verify the hypothesis that additional unlabeled samples can enhance the learning performance in agnostic boosting. The ability to achieve lower error when given access to the same pool of labeled samples indicates that the learning is now more label-sample-efficient.

Through its execution PAB, the algorithm from Kanade et al, has access to only the labeled pool. PAB is an iterative boosting algorithm that does not reuse samples between its rounds. Hence, the number of samples available in each round of boosting in PAB is the number of labeled samples / number of rounds of iterations. In contrast, following the description in Algorithm 1, our algorithm via subsampling can reuse the same set of labeled samples across multiple rounds, although unlabeled samples used every round must be fresh.

审稿意见
4

This work designs boosting algorithms in the agnostic setting. Their main contribution are novel algorithms in a previously unexplored direction: Can unlabeled samples reduce the number of labeled samples required for boosting? This paper gives a positive result by providing several algorithms that achieve different trade-offs between the number of labeled and unlabeled samples required. The first algorithm requires VC/ε2VC/\varepsilon^2 labeled samples and VC/ε4VC/\varepsilon^4 unlabeled samples and makes 1/ε21/\varepsilon^2 calls to the weak learner (Theorem 3.1). The second algorithm includes this to log(B)/ε2\log(B)/\varepsilon^2 and log(B)/ε3\log(B)/\varepsilon^3 respectively while making log(B)/ε2\log(B)/\varepsilon^2 calls to the weak learner (Theorem 4.1). The final algorithm retains this sample complexity while reducing the number of weak-learner calls to 1/ε21/\varepsilon^2 as in Theorem 3.1.

The authors also explore the implications of their results in (1) efficiently learning half spaces (following similar observations by Kanade and Kalai (2009)) – leading to a faster running time provided unlabeled samples are available) and in reinforcement learning (where they reduce the number of reward-annotated episodes required).

Post rebuttal update

I read the authors response, maintain my original rating, and support acceptance.

给作者的问题

None

论据与证据

Yes, the proofs in the paper are supported by formal proofs.

方法与评估标准

Yes, the proposed methods and evaluation criteria seem meaningful.

理论论述

I skimmed all the proofs, but did not carefully verify their correctness.

实验设计与分析

No, I did not check the soundness of the experimental design, except checking that the baselines they use are meaningful.

补充材料

I skimmed some of the proofs in the supplementary material. But did not check them for correctness.

与现有文献的关系

The most closely related prior works to this paper are Ghai and Singh (2024) and Kanade and Kalai (2009). The key innovation in this work is to use unlabeled samples in boosting and, as a consequence, reduce the number of labeled samples required. This tradeoff is not explored by either prior work, and to the best of my knowledge has not been explored earlier in the literature. In terms of the number of labeled samples used, this paper improves the previous state-of-the-art guarantee from 1/ε31/\varepsilon^3 to 1/ε21/\varepsilon^2. While at the same time, ensuring that the total number of labeled and unlabeled samples matches the total number of labeled samples required by the previous state-of-the-art guarantee.

遗漏的重要参考文献

To the best of my knowledge, no essential references were missing. Although, I am not an expert on boosting methods.

其他优缺点

None

其他意见或建议

The paper is generally well-written and easy to follow. A couple of suggestions are: First, it would be good to introduce γ\gamma somewhere in the introduction. Currently, it is used in Section 1.1 but only defined later. Second, while they are very common, I think it would still be good to define the VC dimension and other technical terms either in the paper or the appendix.

Typos:

  1. In Theorem 3.1, the value of “T” has a typo. Missing 1 in the numerator.
  2. Theorem 4.1 has a typo around “calls to the weak learner, and samples”
作者回复

We thank the reviewer for their thorough review. Both the suggestions – introducing γ\gamma early and defining the VC dimension – are well noted, and we will execute these upon revision.

最终决定

This paper studies agnostic boosting with a focus on improving the label complexity. In particular, it leverages unlabeled samples for that purpose. The authors give an algorithm with improved label complexity, whose unlabeled sample complexity is similar to prior algorithms.
To achieve this, the authors carefully combine existing potential-based agnostic boosting algorithms from prior work. Two applications are presented (though the one on halfspace learning over the hypercube does not improve the state-of-the-art). Most reviewers agreed that this is a solid contribution. One reviewer argued that the technical contribution may not be sufficient, as both high-level ingredients of the result pre-existed in prior work. In summary, this appears to be a contribution that meets the bar for ICML.