PaperHub
6.8
/10
Spotlight5 位审稿人
最低6最高7标准差0.4
7
7
7
7
6
3.0
置信度
正确性3.2
贡献度3.4
表达3.2
NeurIPS 2024

Learning Noisy Halfspaces with a Margin: Massart is No Harder than Random

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We give a simple efficient algorithm for learning halfspaces with Massart noise.

摘要

We study the problem of PAC learning $\gamma$-margin halfspaces with Massart noise. We propose a simple proper learning algorithm, the Perspectron, that has sample complexity $\widetilde{O}((\epsilon\gamma)^{-2})$ and achieves classification error at most $\eta+\epsilon$ where $\eta$ is the Massart noise rate. Prior works (DGT19, CKMY20) came with worse sample complexity guarantees (in both $\epsilon$ and $\gamma$) or could only handle random classification noise (DDKWZ23,KITBMV23)--- a much milder noise assumption. We also show that our results extend to the more challenging setting of learning generalized linear models with a known link function under Massart noise, achieving a similar sample complexity to the halfspace case. This significantly improves upon the prior state-of-the-art in this setting due to CKMY20, who introduced this model.
关键词
pac learninglearning halfspacesmassart noisesgdrobust learning

评审与讨论

审稿意见
7

The paper studies learning half-spaces in the Massart noise setting. That is for γη>0\gamma\eta >0 and η(x)η\eta(x)\leq \eta for all xsupp(Dx)x\in supp(D_{x}), half space ww^* and such that P(wxγ)P( | w^* x | \geq \gamma) and P(sign(wx)y)=η(x)P ( sign( w^* x ) \not= y)=\eta(x) for all xsupp(Dx)x\in supp(D_{x}) - given such an instance and ε>0\varepsilon>0 the paper now states the problem of finding a half spaces ww such that P[sign(wx)y]η+ε]P[sign(wx)\not=y]\leq \eta+\varepsilon] using as few samples and computations as possible. The paper achieves a new bound on the number of samples needed to solve this problem of O~(γε)2\tilde{O}(\gamma\varepsilon)^{-2} (suppressing factors of log(1/δ)\log(1/\delta)). This is an improvement over previous results which uses γ3ε5\gamma^{-3}\varepsilon^{-5}, γ4ε3\gamma^{-4}\varepsilon^{-3}. They also show it in a more general setting for σ\sigma odd and non decreasing where P[sign(wx)y]=1σ(wx)2P[sign(w*x)\not= y]=\frac{1-\sigma(wx)}{2} for all xsupp(Dx)x\in supp(D_{x}). They obtain a sample complexity bound of (O~)((γε)2)(\tilde{O})((\gamma\varepsilon)^{-2}) here.

优点

Originality: The paper seems to combine many ideas from different papers and in this way get a novel result and improvement over previous results. Related work is well cited.

Quality: The main of the paper includes a full proof of the Massart noise and a proof sketch of the general setting. The proof of the Massart noise case on a first quick read seems ok. The proof sketch is ok - I have not read the appendix with the full proof so I will not comment on the technically soundness of the proof. The authors also address the limits of their work and state interesting future work.

Clarity: The paper is very well written and explains very well the ideas behind the proof.

Significance: The improvement in sample complexity is a polynomial improvement in (γε)1(\gamma\varepsilon)^{-1} so I would say significant.

缺点

Weaknesses: Not anything significant - related to the question below do the other papers also use η\eta in the error bound instead of E[η(x)]ηE[\eta(x)]\leq \eta? If yes why not use E[η(x)]E[\eta(x)]?

问题

\section*{Questions}

Line 18-20: I really like how it was PrxDxPr_{x\sim D_{x}} was used as it made the randomness explicit - I didn't at first get that the randomness in the PrPr (line 20) was over yDy(x)y\sim D_{y}(x) I don't know if other readers would have the same experience.

Line 72 : what error does previous work in massart noise setting use? If other- why not use the same?

Line 102 η(x)\eta(x)?

Not consistent whether it is PrPr_{\cdot} or just PrPr

Table: Good table. Is the running time not interesting?

The real numbers RR and natural numbers NN didn't look like I'm used to seeing them - I don't know if this is on purpose if it is don't mind changing it - I just wanted to let you know in case that something has complied weirdly.

Line 221: should it be Bd×{1,1}B^d\times \{-1,1\}?

Line 229: why isn't it a vector? Can you give the derivation of this expression?

Line 238: hw=sign((w..h_{w'}=sign((w.. shouldnt it be hw=sign((w..h_{w'}=sign((w'..

Line 261: what does the lη(w,x,h)l_\eta(w*,x,h) mean? Happens more places in this proof.

Line 263: Please show fact (1), or should it be lη(ywx)l_\eta(-yw*x). If facts 1) and 2) follow from the fact stated at the beginning of the proof from [DGT19b] maybe consider moving these together.

Line 276: it says ww^{\star} instead of ww^*

Line 298-299: The third line in the equation missing xtx^t

Line 289 algorithm line 7: this makes wtw^t and xtx^t independent right? Which is used in the proof of lemma 3 last inequality of the large equation combined with lemma 2?

Line 307: why say anything about contradiction - doesn't it just follow from the inequality afterward?

Line 313-316: I guess It doesn't matter that the last badge isn't necessarily of size TT? Or is it necessarily of size TT?

Line 329: what is BB?

Would it be interesting to make the computational cost independent of d with the cost of polynomial factors in 1/γε1/\gamma\varepsilon?

局限性

They address the limitations of the work.

作者回复

Thank you for your careful reading, and all of your detailed feedback. We are glad that you found our explanations clear. Regarding improved dependences on E[η(x)]\mathbb{E}[\eta(\mathbf{x})], prior work has shown that such guarantees are likely computationally intractable for statistical query based algorithms (such as ours, and all previous Massart learners) under standard complexity assumptions: see, for example, the citations [CKMY20, DK22, DKMR22, NT22]. Therefore, to give a polynomial-time algorithm, we focus on achieving error η+ϵ\eta + \epsilon. We discuss this point in Lines 73-76 of our submission. This is consistent with previous works in the Massart model, re: your question about Line 72.

We now address the reviewer's other more specific questions.

Line 102: This should be η\eta, as stated, as it is about the noise rate upper bound. Table: All runtimes are nearly-linear in the input size, but we will add such a remark.

Line 229: This was a typo, and there should be an extra factor of x\mathbf{x} due to chain rule, so it is a vector. Thank you for noticing this.

Line 261: This was a typo, it should say η(ywx)\ell_\eta(-y\mathbf{w}^\star \cdot \mathbf{x}), and we will fix this.

Line 263: Yes, as you point out, the expression should be about w\mathbf{w}^\star.

Line 289: This is correct, and we will note this for clarity.

Line 307: Good catch, we will make this simplification.

Line 313-316: In our application of Algorithm 1 (Theorem 3), we ensured even divisibility.

Line 329: BB was intended to be the complement of AA (as in Lemma 2); we will clarify this.

Computational cost: We expect that achieving independence of dd is unachievable, as one needs to at least read in the samples. However, we do agree that it is interesting to see if fast dimensionality reduction techniques could be used to speed up the low-order terms of our algorithm's runtime; thank you for this suggestion.

[CKMY20] Sitan Chen, Frederic Koehler, Ankur Moitra and Morris Yau. Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability. NeurIPS 2020

[DK22] Ilias Diakonikolas, Daniel Kane. Near-Optimal Statistical Query Hardness of Learning Halfspaces with Massart Noise. Conference on Learning Theory, 2022

[DKMR22] Ilias Diakonikolas, Daniel Kane, Pasin Manurangsi, and Lisheng Ren, Cryptographic Hardness of Learning Halfspaces with Massart Noise. NeurIPS 2022

[NT22] Rajai Nasser, Stefan Tiegel. Optimal SQ Lower Bounds for Learning Halfspaces with Massart Noise. Conference on Learning Theory, 2022

评论

Thanks for your response. My understanding of the change due to the technical issue in terms of Table 1: the Massart noise column is now two columns: Massart with sample complexity (γε)2(\gamma\varepsilon)^{-2} and Massart GLM with sample complexity γ2ε4\gamma^{-2}\varepsilon^{-4} for your row. If this is correct please “fill in” the sample complexity that [DGT19b] and [CKMY20] get for Massart and Massart GLM (or other relevant references). To this end please remind me if ε\varepsilon is always less than γ\gamma.

评论

Table 1 in our paper is currently only about Massart Halfspaces. We are happy to add one more column with the sample complexity of Massart GLM. This column will have the sample complexity of γ2ϵ4\gamma^{-2}\epsilon^{-4} in our row, as the reviewer notes.

In regards to the bounds obtained by prior work:

  1. [DGT19b] do not study this model.
  2. [CKMY20] consider this model with an additional restriction on σ\sigma being LL-Lipschitz and σ(wx)γ|\sigma(\mathbf{w}^*\cdot\mathbf{x})|\geq \gamma for all x\mathbf{x}. In this regime, they obtain a sample complexity of at least L4γ4ϵ6L^4\gamma^{-4}\epsilon^{-6}. In this regime, our result translates to a bound of L2γ2ϵ4L^2\gamma^{-2}\epsilon^{-4} which is a strict improvement.

Comparison between ϵ\epsilon and γ\gamma:

In general, these parameters are incomparable depending on situation (one is about a suboptimality guarantee, and one is a geometric assumption about the distribution). However, there are natural scenarios where one would prefer a worse ϵ\epsilon dependence (as in our updated result) compared to a worse γ\gamma dependence:

  1. γ\gamma is a high-dimensional parameter, so it is natural to have it depend on the dimension dd. In particular, for natural models (e.g. DD is TV-close to the uniform distribution over the unit sphere) we really do have margin about 1/d1/\sqrt{d}.
  2. ϵ\epsilon measures error in the 0-1 output space, so it is natural to have it not depend on dd (i.e. treat it as a constant / free parameter).

Concretely, in the margin-free setting, state-of-the-art algorithms all apply pre-processing techniques [BFKV96,DV04,DKT21] that transform an underlying distribution to one with margin effectively equal to Ω(poly(1/d))\Omega(\text{poly}(1/d)). In these cases, a worse dependence on 1/ϵ1/\epsilon may be preferred over a bad dependence on 1/γ1/\gamma.

[BFKV96] Avrim Blum, Alan M. Frieze, Ravi Kannan and Santosh S. Vempala. A Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functions. FOCS 1996

[DV04] John Dunagan and Santosh Vempala. A simple polynomial-time rescaling algorithm for solving linear programs. STOC 2004

[DKT21] Ilias Diakonikolas, Daniel M. Kane and Christos Tzamos. Forster decomposition and learning halfspaces with noise. NeurIPS 2021

评论

Arrh good - to this end, so [CKMY20] achieves sample complexity and runtime of Ω(L4γ4ε6)\Omega(L^4\gamma^{-4}\varepsilon^{-6}) - I think line 113 only says something about runtime?(sorry if im mistaking).

评论

Yes, that was a mistake in our description -- Line 113 should say the sample complexity of [CKMY20] is Ω(L4γ4ϵ6)\Omega(L^4 \gamma^{-4} \epsilon^{-6}), not its runtime. This is directly comparable to our updated sample complexity of O~(L2γ2ϵ4)\widetilde{O}(L^2 \gamma^{-2} \epsilon^{-4}). Both runtimes pay an overhead in the dimension dd compared to the sample complexity, since they perform vector operations. Thank you for this catch.

评论

Thanks for answering my questions and clearing up my misunderstandings - good luck!

审稿意见
7

The paper considers the problem of PAC-learning γ\gamma-margin halfspaces under η\eta-Massart noise. The paper provides an efficient algorithm achieving error η+ϵ\eta+\epsilon with sample complexity O~(1/(ϵ2γ2))\tilde{O}(1/(\epsilon^2\gamma^2)). The individual dependence of the sample complexity on ϵ\epsilon and γ\gamma appears to be optimal for efficient algorithms in the light of lower bounds provided in previous work.

The algorithm is a form of stochastic-gradient descent where the loss function is the LeakyReLu loss. The gradient is carefully weighted with an appropriate weight in order to yield the desired results.

The paper also provides an algorithm for a more general model, the generalized linear model (GLM), under Massart-noise.

The paper is generally very-well written.

优点

The problem of learning half-spaces under Massart noise is a fundamental problem that has received a lot of attention recently. This paper provides a simple algorithm with a low sample-complexity that is probably optimal.

The paper also provides generalizations to the GLM models.

缺点

I haven't found any significant weaknesses.

Typos:

  • Page 3, line 97: “observe that that if” -> “observe that if”

问题

  • In Lemma 3, the parameters λ\lambda and TT were chosen so that P[ET]1/2\mathbb{P}[\mathcal{E}_T]\leq 1/2, which necessitated the outer loop j[N]=[log2(2/δ)]j\in[N]=[\log_2(2/\delta)] to get success probability at least 1δ1-\delta. Wouldn't it be possible to choose the constant factors in λ\lambda and TT differently to directly get P[ET]δ\mathbb{P}[\mathcal{E}_T]\leq \delta?

局限性

Since this is a theoretical paper, I do not see any potential negative societal impact.

作者回复

We appreciate that you found our paper well-written. Thank you for your typo suggestions; we will fix these in a revision. Re: your question in Lemma 3, we chose this parameter tradeoff because it yields a log(1δ)\log(\frac 1 \delta) overhead in our runtime. As you suggest, it is also possible to directly obtain a failure probability of δ\delta in one shot. However, this would require taking a number of steps polynomial in 1δ\frac 1 \delta, which is a worse overall tradeoff.

评论

Thank you for the clarification. I remain my score.

审稿意见
7

This submission studies the problem of learning halfspaces and generalized linear model in the Massart model under a margin assumption. In particular, for the case of halfspaces, the submission gives an efficient algorithm achieving (conjecturally optimal) error η+ε\eta + \varepsilon using only O~(γ2ε2)\tilde{O}(\gamma^{-2} \varepsilon^{-2}) samples, where γ\gamma is the margin parameter, and η\eta the upper bound on the noise rate. This matches what is known in the more benign RCN model. All previously known efficient algorithms in the Massart model require a number of samples that is polynomially worse in ε\varepsilon and γ\gamma.

优点

The Massart model is a semi-random noise model motivated by the question whether known (efficient) algorithmic approaches for the fully random model, in this case random classification noise (RCN), are overfitting to the specific aspects of the model. The work of [DGT19] designed the first efficient algorithm to non-trivially learn halfspaces in this harsher noise model (without margin assumption). The current work shows that under a margin assumption, this is possible with as few samples as known efficient algorithms for the more benign RCN model needs. All prior works were loose by polynomial factors in γ\gamma and ε\varepsilon.

On a technical level the submission combines previous algorithmic approaches with a regret-minimization scheme leading to an elegant analysis and ultimately a better sample complexity.

[DGT19]: I. Diakonikolas, T. Gouleakis, and C. Tzamos. Distribution-independent PAC learning of halfspaces with massart noise

缺点

The presentation of the submission is generally solid but can be improved in several aspects. Below are comments aiming at this. Especially the comments related to the technical overview are important in order to understand why the submission is able to improve over previous work. I'm willing to increase my score to 6 or 7 if the authors agree to address the comments below and in particular, explain in their rebuttal how they would address the comments for the technical overview.

Technical Overview

I like the structure of the technical overview and that it tries to highlight differences with previous works (this also shows expertise of the authors). Unfortunately however, in several places the writing is unclear (see below). I find it nice that the proof for the halfspace result is so short that it fits in the main body. Nevertheless, I would find a more extensive technical overview with commentary much more illustrative then including the full proof (the many formulas make this hard to read and understand what is going on).

Here are some specific comments about the technical overview:

  • lines 129-136: It is not clear from the discussion why this approach incurs a too high sample complexity on a quantitative level.
  • lines 153-162: It is not at all clear from the discussion why the proposed update rule improves the dependence on γ\gamma
  • lines 172-174: You say that this approach also works for the case of massart halfsapces. Why did you not follow this approach? From the discussion it seems it would significantly simplify a part of the analysis. Is there anything else that breaks?

Introduction

  • lines 46-58: Before diving into the details of the fine-grained aspects and to set the stage, it would be helpful to briefly explicitly recall what is known in the general Massart model (without margin assumption) -- e.g., error η\eta is possible and this is likely optimal
  • when you say "learn up to error ε\varepsilon" in the above paragraph, do you mean error η+ε\eta + \varepsilon?
  • I believe the work [DKMR22] is not mentioned at all, this should be added
  • Similarly, when talking about impossibility results in the agnostic model, the work [Tie23] (see also [DKMR22]) should be mentioned
  • for some of the citations, the arxiv version is cited. Why not cite the conference version?

[DKMR22]: I. Diakonikolas, D. M. Kane, P. Manurangsi, and L. Ren, Cryptographic Hardness of Learning Halfspaces with Massart Noise

[Tie23]: S. Tiegel, Hardness of Agnostically Learning Halfspaces from Worst-Case Lattice Problems

问题

See last part of above section.

局限性

yes

作者回复

Thank you for your many helpful comments, and suggested references. We agree with all of your suggestions regarding the introduction and citations, and will fix them in a revision.

We agree additional care can be taken to clarify the technical overview, emphasizing conceptual points and quantitative gains over formulas and full proofs.

Lines 129-136: Re: the ϵ\epsilon dependence, the number of iterations of both of our methods scales as 1ϵ2\frac 1 {\epsilon^2}, but to implement each iteration of [CKMY20], they need to rejection sample from a region. The [CKMY20] termination condition only guarantees this region has at least ϵ\epsilon probability mass, yielding a 1ϵ\frac 1 \epsilon overhead. Re: the γ\gamma dependence, this is similar to the following comment on cutting plane methods, which require certificates with much smaller failure probabilities than the first-order regret minimization approach we use. The [CKMY20] method is cutting plane-based, so their certificates are less sample efficient.

Lines 153-162: The key difference is that cutting plane methods require high-probability guarantees on separating oracles being valid (as they are not robust to occasional errors); standard probability boosting techniques require 1γ2\gtrsim \frac 1 {\gamma^2} samples for sub-Gaussian concentration to kick in. On the other hand, the projected gradient regret minimization method we use is tolerant to any unbiased estimator with a second moment bound, so it only needs one sample per iteration, leading to a better 1γ\frac 1 \gamma dependence.

Lines 172-174: At the time of submission, we made the presentation decision to use a reweighting which adds γ\gamma to the denominator rather than implementing the push-away operation (in the halfspace case), as it is conceptually simpler and adds less overhead to the proof. In light of the error we noted in the meta-comment, this distinction is no longer relevant, and in our revision we will include our new corrected proof for Massart GLMs.

We plan to add discussion of these important points to the technical overview, and more generally include more comparisons to previous approaches, mentioning in greater detail why they incurred suboptimal sample complexities. In particular, we will spell out the bottlenecks to improving cutting plane methods (such as [CKMY20]'s approach) in more detail, as well as other previous approaches in the literature. We hope that this response was clarifying, and that our revision plan elevates our paper in your viewpoint.

[CKMY20] Sitan Chen, Frederic Koehler, Ankur Moitra and Morris Yau. Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability. NeurIPS 2020

评论

Thank you for the detailed response! I think the above discussion makes the technical contribution of the submission over previous work much clearer. I have increased my score accordingly. I also appreciate (and acknowledge) the authors official rebuttal regarding the error and fix in the Massart GLM case.

审稿意见
7

This paper focused on the fine-grained analysis of learning γ\gamma-margin halfspace with massart noise. The authors designed a new certificate vector gg by dividing the gradient vector of leaky ReLU by wx+γ|w^\top x| + \gamma, and showed that when the hyperplane hw(x)h_w(x) has large 0-1 loss 0,1(w)η+ϵ\ell_{0,1}(w)\geq \eta + \epsilon, the certificate vector aligns well with the direction of www - w^*, i.e., g(w.w)ϵg^\top(w. - w^*)\geq \epsilon, hence a perceptron-like algorithm enjoys fast convergence to the optimal hyperplane hw(x)h_{w^*}(x). In the end, the authors showed that the proposed Perspectron algorithm requires only O((γϵ)2)O((\gamma\epsilon)^{-2}) samples, matching with the sample complexity of learning margin halfspaces with RCN. The authors also extended the technique to massart GLM problem and achieved a similar O((γϵ)2)O((\gamma\epsilon)^{-2}) sample complexity.

优点

I enjoyed reading this paper due to its clarity and fluency in presentation. The authors did a good job explaining the intuitions and ideas. The result of this paper is very interesting to me, as I have always thought learning halfspaces with massart noises is much harder than with RCN, and this paper provided a surprising result that shows learning with massart noise can be achieved with similar sample complexity. The method (new certificate vector) that the authors proposed is also interesting, and can be informative for future research.

缺点

No serious weakness of this paper, but there are several typos here and there. For example line 263 should be ww^*, there should be an xx in the end of line 270, etc.

问题

I am not very familiar with massart GLMs. If I understand correctly, it seems the goal of massart GLM is still trying to find a classification hyperplane, but under this massart GLM model, the massart noise η(x)\eta(x) is bounded by a function of wxw^*\cdot x. If so, isn't massart GLM a sub-class of massart model, as we have further constraints on η(x)\eta(x) (in addition to simply restricting η(x)η\eta(x)\leq \eta), hence perhaps implying that massart GLM is simpler than massart model?

局限性

The authors have addressed the limitations adequately.

作者回复

Thank you for your encouraging review, and we will make sure to address your typo catches. Re: your question about Massart GLMs, we will definitely provide more clarifying discussion on this point, as it is somewhat confusing. In particular, in a Massart GLM (following Definition 2), the optimal decision rule (evaluated by zero-one loss) is given by a halfspace, because the error rate is always 12\le \frac 1 2. However, the model is more general than Definition 1, because it allows a non-uniform maximum noise rate depending on wx\mathbf{w} \cdot \mathbf{x} (in the halfspace model, the maximum noise rate is uniformly η\eta). In summary, both of the following are true: Definition 1 is a special case of Definition 2, but the Bayes optimal prediction rule under Definition 2 is a halfspace (our algorithm also returns a halfspace).

评论

I thank the authors for their response and corrections. I would like to remain my evaluation.

审稿意见
6

The paper considers the problem of learning halfspaces with a margin, under the Massart noise model. The Massart noise model generalizes the Random Classification Noise (RCN) Model: while in the RCN model, each label is flipped with fixed probability \eta, in the Massart noise model the the probability of flipping the label can be a function of the covariates bounded above by \eta. The paper asks the question of whether one can design learning algorithms under the Massart noise model, matching the sample complexity under the RCN model, and answers it positively. Specifically, the paper proposes a proper learning algorithm, with sample complexity 1/(\epsilon^2 \gamma^2), matching the state of the art under the RCN model (\epsilon is the error of the algorithm, \gamma is the margin). The results are also extended to the case of generalized linear models.

优点

  • The paper makes a concrete improvement to the sample complexity of learning halfspaces under Massart Noise, a classic problem in learning theory.

  • The proposed algorithm is natural and simple, and the main ideas of the paper are explained well.

缺点

The set of people interested in the fine-grained complexity of learning under Massart Noise might not be very broad. So the significance of the results seems moderate.

问题

NA

局限性

Yes

作者回复

Thank you for your reviewing efforts; we are glad you found our algorithm natural and simple, and that it was explained well. We mention that beyond our main technical result, from a conceptual standpoint, our paper advances a line of work giving faster learning algorithms under realistic noise models, likely to be of broad general interest. We are optimistic that the insights of our paper may lead to follow-up developments, of simpler and more noise-robust algorithms in more general settings, e.g., learning noisy multi-index models (which includes fine-tuning a neural network as a case).

评论

I thank the authors for their response. After reading the response as well as other reviews, I am inclined to keep my original score.

作者回复

We thank the reviewers for their positive feedback on our paper!

We would like to point out a technical issue in the current proof of Theorem 4 for learning Massart GLMs, which we found after the submission of our paper. The issue can be resolved through a concise extension of the proof of Theorem 3 (which we provide below), albeit with a worse sample and runtime complexity scaling as O~(1/(ϵ4γ2))\tilde{O}(1/(\epsilon^4 \gamma^2)). This sample complexity is still an improvement to the comparable prior work [CKMY20] on Massart GLMs across all parameters, see discussion in Lines 107-115.

Our main statement and proof on Massart halfspaces (Theorem 3) is correct and remains unchanged.

Despite the additional factor of 1/ϵ21/\epsilon^2 in the sample complexity of learning Massart GLMs, the qualitative value of our result remains: a simple SGD-based algorithm simplifies and improves the best known algorithms for Massart GLMs with known activations.

Our revised approach for Massart GLMs is in line with our overall message, by showing that the Perspectron algorithm (Algorithm 1), after a slight parameter modification, achieves state-of-the-art guarantees for a more general noise model, in addition to halfspaces (Definition 1). This emphasizes the value of our simple algorithmic approach.

Technical issue with current proof

Lemma 6, Part 2 is incorrect. It is only true when w\mathbf{w} is a unit vector. The issue stems from the fact that the added "Push-away" term is potentially unbounded when w\|\mathbf{w}\| is small. Intuitively, it is impossible to induce a large margin in an unnormalized small direction.

Working Fix

As a consequence of this error, our analysis for Massart GLMs using the Push-away operation fails to go through. We propose a simple fix achieving a sample complexity of O~(1/(γ2ϵ4))\tilde{O}(1/(\gamma^2\epsilon^4)). This still improves over prior work, but does not match our halfspace result.

We now present the details of our fix. Instead of the push-away operator, we use wxwx+γϵ2ϵ|\mathbf{w}\cdot\mathbf{x}| \to |\mathbf{w}\cdot\mathbf{x}|+\frac{\gamma\epsilon}{2-\epsilon} as the modified denominator of the separating hyperplane in Lemma 5, thus bounding the norm of the step in our iterative method by O(1/ϵγ)O(1/\epsilon \gamma). Combining this with the iterative method leads to the new sample complexity bound. We will add the complete proof in the final version and can also attach a pdf containing it if requested by the reviewers. We now present the proof of correctness for the new separating hyperplane, by highlighting the changes to the steps in the proof of Lemma 5.

Lemma. Let DD be an instance of σ\sigma-Massart GLM model with margin γ\gamma and 0-1(w)optRCN+ϵ\ell_{\text{0-1}}(\mathbf{w})\geq \text{opt}_{\text{RCN}}+\epsilon. Then, we have that

E(x,y)D[(σ(wx)y)wx+αγx](ww)ϵ\mathbf{E}_{(\mathbf{x},y)\sim D}[\frac{(\sigma(\mathbf{w}\cdot\mathbf{x})-y)}{|\mathbf{w}\cdot \mathbf{x}|+\alpha\gamma}\mathbf{x}]\cdot (\mathbf{w}-\mathbf{w}^*)\geq \epsilon, where α=ϵ2ϵ\alpha=\frac{\epsilon}{2-\epsilon}.

Proof. We borrow notation from the proof of Lemma 5. The only change from Lemma 5 is in how we lower bound g(x)g(\mathbf{x}). The analysis for the case xA\mathbf{x}\in A is identical as it does not depend on the normalization used in the denominator of the separating hyperplane. We now analyse the case xA\mathbf{x}\notin A. There are two subcases.

First, suppose wxwx|\mathbf{w}\cdot\mathbf{x}|\leq |\mathbf{w}^*\cdot \mathbf{x}|. Let c(x):=wxwxc(\mathbf{x}):=\frac{|\mathbf{w}^*\cdot\mathbf{x}|}{|\mathbf{w}\cdot\mathbf{x}|}. In this case, we have that g(x)β(x)wx+wxwx+αγ=β(x)1+c(x)1+αγwxc(x)β(x)1+c(x)1+αc(x)β(x)(2ϵ)g(\mathbf{x})\geq\beta(\mathbf{x})\cdot\frac{|\mathbf{w}\cdot \mathbf{x}|+|\mathbf{w}^*\cdot\mathbf{x}|}{|\mathbf{w}\cdot\mathbf{x}|+\alpha\gamma}=\beta(\mathbf{x}) \cdot \frac{1+c(\mathbf{x})}{1+\alpha\frac{\gamma}{|\mathbf{w}^*\cdot\mathbf{x}|}c(\mathbf{x})}\geq \beta(\mathbf{x})\cdot\frac{1+c(\mathbf{x})}{1+\alpha c(\mathbf{x})}\geq \beta(\mathbf{x})(2-\epsilon), where the third inequality follows from wxγ|\mathbf{w}^*\cdot\mathbf{x}|\geq \gamma, and the final inequality follows from 1+c1+cα2ϵ\frac{1+c}{1+c\alpha}\geq 2-\epsilon for any c1c\geq 1 and α=ϵ2ϵ\alpha=\frac{\epsilon}{2-\epsilon}. Thus, g(x)σ(wx)+β(x)ϵg(\mathbf{x})\geq |\sigma(\mathbf{w}^*\cdot\mathbf{x})|+\beta(\mathbf{x})-\epsilon.

Finally, consider the case where wxwx|\mathbf{w}\cdot\mathbf{x}|\geq |\mathbf{w}^*\cdot \mathbf{x}|. Here, we have that g(x)=(σ(wx)+β(x))wx+wxwx+ϵγσ(wx)+β(x)g(\mathbf{x})=(|\sigma(\mathbf{w}\cdot\mathbf{x})|+\beta(\mathbf{x}))\cdot\frac{|\mathbf{w}\cdot\mathbf{x}|+|\mathbf{w}^*\cdot\mathbf{x}|}{|\mathbf{w}\cdot\mathbf{x}|+\epsilon\gamma}\geq |\sigma(\mathbf{w}^*\cdot\mathbf{x})|+\beta(\mathbf{x}) as wxγ|\mathbf{w}^{*}\cdot \mathbf{x}|\geq \gamma.

Thus, we have proven that g(x)σ(wx)±β(x)ϵg(\mathbf{x}) \ge |\sigma(\mathbf{w}^*\cdot\mathbf{x})|\pm \beta(\mathbf{x}) - \epsilon pointwise, where the ±\pm depends on whether xAx \in A. We now repeat the steps from Lemma 5 to complete the proof, by comparing g(x)g(\mathbf{x}) to the zero-one error at x\mathbf{x}, as done in Lemmas 1 and 2.

最终决定

This paper improves the state of the art sample complexity for learning halfspaces with massart noise, where the noise in each example is upper bounded by \eta. They match the \eta + \epsilon learning guarantee established in previous works [DGT19,CKMY20] with an improved sample complexity, and also give some extensions of their result to the more challenging GLM + massart noise setting studied by prior work. All of the reviewers agreed this advances the start of the art in this area with a simple and clean approach.