A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise
摘要
评审与讨论
The paper considers the problem of PAC learning halfspaces with margin in the presence of Massart noise. The paper provides an algorithm that well-balances sample and computational efficiency. Specifically, the dependence of the algorithm on both and is near-optimal.
优点
- The studied problem is fundamental in the area of PAC learning, and the paper provides a significant progress on it.
- The paper is well written, and the main techniques are well explained.
缺点
- Results might be somewhat weaker than presented (see questions), and some phrasings in this context are too vague (for example "there is evidence that...").
- The natural agnostic extension of the problem is not discussed.
问题
Questions:
- I'm not sure about the statement "...we essentially settle this question..." in line 71. As far as I understand, the optimal computational sample complexity might be and not . Either way, the provided upper bound is impressive enough to recommend for acceptance.
- Is there a specific reason not to discuss the agnostic case? Is it usualy considered in Massart noise problems?
Suggestions:
Consider writing what is in the abstract.
局限性
Yes.
We thank the reviewer for the time and effort in reading our paper and the positive assessment. We respond to each point raised by the reviewer below.
(Weakness 1): Results might be somewhat weaker than presented (see questions), and some phrasings in this context are too vague (for example "there is evidence that...").
Response: We thank the reviewer for this feedback. We will make our statements of prior information-computation tradeoffs precise. In the submitted version, we did not discuss these results in detail due to space limitations. Please see response below for more details.
(Weakness 2): The natural agnostic extension of the problem is not discussed.
Response: For distribution-free agnostic PAC learning, the learning problem we study is known to be computationally intractable (even for weak learning). Specifically, it is is NP-Hard [1] for proper learning and cryptographically/SQ-hard for improper learning [2,3]. These hardness results have historically been one of the motivations for studying the Massart model.
(Question 1): I'm not sure about the statement "...we essentially settle this question..." in line 71. As far as I understand, the optimal computational sample complexity might be and not . Either way, the provided upper bound is impressive enough to recommend for acceptance.
Response: As the reviewer points out, the best known lower bound on the computational sample complexity of the problem is . This lower bound is quadratic in both parameters of interest but does not quite match our upper bound. While we believe that a lower bound of order exists, this remains an open problem. Finally, we note that even for the easier model of Random Classification Noise (RCN), the best known efficient algorithm has sample complexity (established recently in [DDK+23a]).
(Question 2): Is there a specific reason not to discuss the agnostic case? Is it usually considered in Massart noise problems?
Response: See Response to Weakness 2 above.
Suggestions: Consider writing what is in the abstract.
Response: Thank you. We will add this in the final version.
[1] Hardness of Learning Halfspaces with Noise, Venkatesan Guruswami, Prasad Raghavendra, FOCS 2006
[2] Complexity Theoretic Limitations on Learning Halfspaces, Amit Daniely, STOC 2016
[3] Hardness of agnostically learning halfspaces from worst-case lattice problems, Stefan Tiegel, COLT 2023
Thank you for addressing my comments and questions.
The paper considers the problem of PAC-learning -margin halfspaces under -Massart noise. The paper provides an efficient algorithm achieving error with sample complexity . Since previous work provided evidence that an inverse-quadratic dependence on is necessary for efficient algorithms, the algorithm appears to be near-optimal in terms of sample-complexity up to logarithmic factors.
The algorithm relies on an iterative stochastic-gradient descent approach, where at each iteration a new loss function is defined which determines the gradient descent of the next iteration. The paper proves that if iterations are performed (where is a sufficiently large multiple of ), then with probability at least , one of the halfspaces obtained in iterations have error at most . By drawing some extra independent samples and computing the empirical error for each of these halfspaces, one can find a good halfspace.
The paper is generally well written.
优点
The paper provides a near-optimal algorithm (in terms of sample complexity) for learning -margin halfspaces under Massart noise.
缺点
I haven't found significant weaknesses in the paper.
Typos:
- Page 2, line 34: “We say that that the distribution” -> “We say that the distribution”
- Page 5: Line 195: If I am not mistaken, the function g is twice the gradient.
问题
-
The authors claim that the computational complexity of the algorithm is linear in the samples. However, it seems to me that the complexity of step (5) is . It does not seem to be trivial to implement step (5) in time.
-
While previous work show that and are lower bounds, it doesn't necessarily follow that is a lower bound as well.
局限性
Since this is a theoretical paper, I can't see any potential negative societal impact.
We thank the reviewer for the time and effort in reading our paper and the positive assessment. We respond to each point raised by the reviewer below.
(Question 1): The authors claim that the computational complexity of the algorithm is linear in the samples. However, it seems to me that the complexity of step (5) is . It does not seem to be trivial to implement step (5) in time.
Response: Yes, the reviewer is correct. We will fix the statement about the runtime where appropriate. As the reviewer notes, the running time incurs an extra multiplicative term (up to logarithmic factors). That is because is set to and we need to evaluate each of the hypotheses to return the best one.
(Question 2): While previous work show that and are lower bounds, it doesn't necessarily follow that is a lower bound as well.
Response: We agree with the reviewer about prior work on hardness, and we will adjust our phrasing. In more detail, the best known lower bound for the computational sample complexity of this problem is , and applies to SQ algorithms and low-degree polynomial tests. The first term is the information-theoretic sample complexity [MN06] and the second term follows from [DDK+23a,DDK+23b].
While we have good reason to believe that these hardness results can be improved to give a lower bound of , this remains an open problem.
I would like to thank the authors for their response. Since they addressed my minor comments, I'm increasing my score to 7.
This paper studies the problem of PAC learning -margin halfspaces under Massart noise: to PAC learn any distribution such that there exists with the bounded norm of that has a margin of at least , i.e. and for every , we have , where a function is bounded in (with overload of notation).
This is a classical problem in learning theory. Information theoretically, one needs samples, while for computationally efficient algorithms, it is widely believed that inverse quadratic dependence in (e.g. ) is necessary.
The main result of this paper is to present an efficient algorithm with nearly optimal sample complexity , essentially closing the question.
优点
- The paper studies a problem that is important to the learning theory community and provides a state-of-the-art result. The previous best algorithm required samples.
- The paper is well-written.
缺点
I do not see any major weaknesses.
问题
None.
局限性
This is a theoretical work with no societal impact.
We thank the reviewer for the time and effort in reviewing our paper and the positive feedback.
The paper studies the problem of learning a -margin halfspace with Massart noise and provides the first computationally efficient algorithm having 0-1 error with sample complexity which nearly matches the information theoretic bound of improving upon the previous sample complexity by [Chen et al. ‘20]. The main contributions of the paper are a novel “global” (as opposed to previous conditioning based) optimization to solve this problem directly using gradient descent. Towards this, a novel convex loss is defined using an independent vector as a parameter, and using the margin as a bound on the scaling parameter. The gradient w.r.t. the solution vector is used in an SGD loop, while the independent vector is updated by the gradient step, and its projection on the unit ball is the updated solution. This clever formulation leads to the gradient being composed of the “correct” direction of optimization along with an estimation error. The former minimizes the distance to the true solution, while the latter is bounded using standard concentration arguments.
优点
- Novel loss formulation and gradient update step which can be directly used via SGD.
- Simple algorithm and analysis.
- Near optimal computational bound on an important problem.
缺点
Result is specific to a particular noise model and it is not clear if the techniques are more broadly applicable.
问题
In Algorithm 1, seems to be a constant independent of , so it can be fixed outside the loop.
局限性
Yes
We thank the reviewer for the positive feedback and the provided questions. We respond to each point raised by the reviewer below.
(Weaknesses 1):Result is specific to a particular noise model and it is not clear if the techniques are more broadly applicable.
Response: We would like to point out that the Massart (or bounded noise) model is essentially the strongest label noise model that allows for polynomial-time algorithms in the distribution-free PAC setting. In particular, if the label noise is fully adversarial (agnostic model), it is computationally hard to achieve any non-trivial error guarantees for the class of margin halfspaces (aka to achieve "weak learning"); see [1,2,3]. Moreover, we believe that the problem we study and the results themselves are interesting in their own right (the algorithmic complexity of the problem has been a longstanding open question; the NeurIPS'19 paper on the topic received the best paper award at that conference and has subsequently had a significant impact). That said, while our focus has been on this particular problem, we believe that the technical analysis of our algorithm could be of broader interest. Specifically, we feel that our white-box analysis of online SGD is novel and could be useful elsewhere. Moreover, the reweighting scheme that we employ may be useful in other problems, as it provides a method to convexify the 0-1 loss.
(Question): In Algorithm 1, seems to be a constant independent of , so it can be fixed outside the loop.
Response: Thank you for pointing this out. We will move outside the main loop.
[1] Hardness of Learning Halfspaces with Noise, Venkatesan Guruswami, Prasad Raghavendra, FOCS 2006
[2] Complexity Theoretic Limitations on Learning Halfspaces, Amit Daniely, STOC 2016
[3] Hardness of agnostically learning halfspaces from worst-case lattice problems, Stefan Tiegel, COLT 2023
Thank you for the rebuttal and the clarification. I will keep my rating.
This paper essentially resolves the computational sample complexity of learning -margin halfspaces under the Massart noise model. Here, computational sample complexity refers to the number of samples required for polynomial-time algorithms, as opposed to general statistical estimators which may be computationally intractable. Previous works have given rigorous evidence, in the form of lower bounds against large classes of algorithms such as SQ algorithms, of an information-computation gap for learning -margin halfspaces. Without restrictions on computation, the sample complexity is for achieving zero-one loss of , where is the uniform bound on the Massart noise. For a large class of efficient algorithms, however, the required sample complexity is at least .
They show that the sample complexity is essentially tight by designing a polynomial-time algorithm for learning -margin halfspaces. Their algorithm is based on a novel and technically insightful choice of the loss sequence for online SGD, which results in a surprisingly simple and efficient algorithm with near-optimal sample complexity.
优点
This is a well-written paper that stands out for being both technically insightful and simple. A simple and efficient algorithm (online SGD with a specific choice of loss sequence) achieves optimal sample complexity and lends itself to a clean analysis. It's hard to ask for more than this.
The core idea behind the algorithm is a clever choice of the loss sequence with respect to which one runs online SGD. Previous works have already employed the LeakyReLU loss as a convex surrogate for the zero-one loss and achieved suboptimal upper bounds on the sample complexity. LeakyReLU with parameter is defined by . Applying this to the halfspace learning setting, a straightforward calculation shows that , where is a sample from the -margin halfspace distribution and is a candidate halfspace. Note the resemblance to the shifted zero-one loss (when in LeakyReLU). By the -Massart noise assumption, for the optimal halfspace and for any halfspace with zero-one loss at least .
The key difference between this shifted zero-one loss and the LeakyReLU is the term. In particular, if we reweight each sample LeakyReLU loss by , we recover the shifted zero-one loss which is unfortunately non-convex. The authors overcome this issue by considering a family of bounded and convex loss functions indexed by . Each is simply the LeakyReLU loss reweighted by . It is precisely this decoupling of the halfspace parameter and the reweighting parameter that leads to the guarantees of the algorithm. The sequence of reweighting parameters , each of which leads to a different loss , is chosen adaptively by online SGD.
缺点
I did not find any significant weaknesses, only minor comments regarding the presentation.
- The expression "(vector) v is independent of w" (line 174, 176) is confusing. I think it's easy to confuse "independence" with statistical independence. It would be helpful to clarify this by adding that the reweighting term is constant with respect to the parameter , and the reweighting term remains the same when taking the gradient of . This is already implicit in the mathematical expressions, but providing additional explanation would benefit readers.
问题
- Does the near-optimal online SGD algorithm fall within the class of SQ algorithms? If not, could there still exist non-SQ algorithms that achieve sample complexity with subquadratic dependence on ?
局限性
N/A
We thank the reviewer for the time and effort in reviewing our paper and for the positive feedback. Below we provide specific responses to the points and questions raised by the reviewer.
(Weaknesses 1): The expression "(vector) v is independent of w" (line 174, 176) is confusing. I think it's easy to confuse "independence" with statistical independence. It would be helpful to clarify this by adding that the reweighting term is constant with respect to the parameter , and the reweighting term remains the same when taking the gradient of . This is already implicit in the mathematical expressions, but providing additional explanation would benefit readers.
Response: We thank the reviewer for this suggestion. We will clarify this point in the final version of the paper.
(Question 1) Does the near-optimal online SGD algorithm fall within the class of SQ algorithms? If not, could there still exist non-SQ algorithms that achieve sample complexity with subquadratic dependence on ?
Response: The online GD algorithm (using a batch size) can indeed be efficiently implemented as an SQ algorithm. As we explain below, our algorithm can be formulated in this way. Therefore, the previously known SQ lower bound covers the algorithm developed in our paper. In more detail, this can be seen as follows: in each iteration of our algorithm, the update rule calculates the gradient -- which is of the form -- where can be viewed as the query function in the SQ model.
We also note that prior work established the same information-computation tradeoff for the class of low-degree polynomial tests (in addition to SQ algorithms). Historically, lower bounds against SQ algorithms and low-degree polynomials have been viewed as strong evidence of hardness. That said, these are restricted models of computation and it is in principle possible that an algorithm can surpass them.
Thank you for your response!
We thank the reviewers for their time, effort, and feedback. We are encouraged by the positive comments of reviewers, and that our paper was appreciated for the: (i) significant contribution (SyXT, JiXV, 2xNf); (ii) technically novel (g5LR,SyXT) and (iii) writing quality (g5LR, JiXV, 2xNf). We respond to specific comments below.
This paper advances state of the art results on efficient learning large-margin halfspaces with eta-Massart noise, with target error eta+epsilon; specifically its upper bound's quadratic dependences on 1/gamma and 1/epsilon are essentially not improvable in view of existing lower bounds. The reviewers also deemed the paper's technique of using online learning and convexifying the losses interesting.
The authors are encouraged to incorporate the discussion with the reviewers to clarify the running time analysis and the sense of optimality of the sample complexity bounds in the final version.