6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

2.5

置信度

正确性3.0

贡献度3.0

表达2.5

NeurIPS 2024

Learning a Single Neuron Robustly to Distributional Shifts and Adversarial Label Noise

Shuyao Li,Sushrut Karmalkar,Ilias Diakonikolas,Jelena Diakonikolas

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

We study the problem of learning a single neuron with respect to the $L_2^2$-loss in the presence of adversarial distribution shifts, where the labels can be arbitrary, and the goal is to find a "best-fit" function. More precisely, given training samples from a reference distribution $p_0$, the goal is to approximate the vector $\mathbf{w}^*$ which minimizes the squared loss with respect to the worst-case distribution that is close in $\chi^2$-divergence to $p_{0}$. We design a computationally efficient algorithm that recovers a vector $ \hat{\mathbf{w}}$ satisfying $\mathbb{E}_{p^*} (\sigma(\hat{\mathbf{w}} \cdot \mathbf{x}) - y)^2 \leq C \hspace{0.2em} \mathbb{E}_{p^*} (\sigma(\mathbf{w}^* \cdot \mathbf{x}) - y)^2 + \epsilon$, where $C>1$ is a dimension-independent constant and $(\mathbf{w}^*, p^*)$ is the witness attaining the min-max risk $\min_{\mathbf{w}:\|\mathbf{w}\| \leq W} \max_{p} \mathbb{E}_{(\mathbf{x}, y) \sim p} (\sigma(\mathbf{w} \cdot \mathbf{x}) - y)^2 - \nu \chi^2(p, p_0)$. Our algorithm follows the primal-dual framework and is designed by directly bounding the risk with respect to the original, nonconvex $L_2^2$ loss. From an optimization standpoint, our work opens new avenues for the design of primal-dual algorithms under structured nonconvexity.

关键词

Distributionally-Robust OptimizationNonconvex OptimizationPrimal-Dual Algorithm

评审与讨论

审稿意见

评分: 6置信度: 22024-07-12

The authors consider the problem of fitting a single neuron with respect to distributional uncertainty modeled by a $\chi^2$ divergence. The authors prove convergence to within a constant factor of the optimal value. The proof of this results proceeds by calculating lower and upper bounds on a duality gap. Unfortunately, prior work shows that finding an algorithm that converges to optimality is in general computationally infeasible, so this is the best result one could hope for.

优点

Learning with a single neuron (aka the perceptron algorithm) is a well-studied method in machine learning that sheds light on the behavior of more complicated methods. The authors seek to understand this method in the context of distributional uncertainty modeled by a chi-squared divergence.

缺点

Main concern 1. The steps and intermediate quantities of Algorithm 1 were not sufficiently explained. For instance: the step size $a_i$ is chosen to always be larger than 1. Why does the algorithm not diverge with such a choice? What is the "interpolated quantity" $g_i$ ? Can you explain the choices for $v$ and $w$ in lines 4 and 6?
Main concern 2. The set $\mathcal P (p_0)$ is not sequentially compact. Thus one cannot conclude the existence of the maximizer $q_w$ defined in the second equation below line 74 just from compactness. The existence of this maximizer seems to be related to properties of the $\chi^2$ divergence. Could you elaborate?

问题

Does your algorithm actually converge? The classical perceptron algorithm is not guaranteed to converge when OPT $\neq 0$ .

Do you have a stopping criterion for your algorithm?

Other comments:

In the introduction/title/abstract: when you discuss "robustness" and "adversarial" you should make it clear that you don't mean adversarial robustness

1st equation below line 74: The $(=\Lambda_{\sigma,\rho}\ldots)$ portion is very confusion

2nd equation below line 74: why is there another expectation over $(\mathbf x ,y)\sim \rho)$ ? This expectation was already computed in the definition of $L_\sigma$ in the line above

lines 84-88: it seems your algorithm only considers "reweightings" of the sampled points for $\hat \rho$ . Does this suffice for modelling the entire $\chi^2$ ball around $\hat p_0$ ?

line 144-145: "This condition is ..requires convergence". It's really not obvious that this condition holds at initialization, or that convergence would imply that it holds for all iterates

line 204: What is $h(r)$ ?

line 212-213: Shouldn't Assumption 2.3 always be true because the sample data set is finite?

line 250: $M$ was not defined

line 273: Why would one consider looking at $\sum_{i=1}^k a_iGap(w_i,\hat p_i)$ ? Can you better motivate this quantity?

lines 274-276: this was hard to understand

line 214-216: this comment was also non obvious

局限性

Yes. The authors acknowledge that their theorems don't prove convergence.

作者回复

2024-08-07

On our problem formulation and the perceptron algorithm:

We would like to clarify that our goal is not to understand the perceptron algorithm. We design a new method that is provably efficient and accurate for a much more challenging learning problem than the perceptron algorithm was designed to address. We would further like to note the distinction between the learning problem and the algorithm to solve it.

The perceptron algorithm efficiently learns Linear Threshold Functions with large margin in the realizable setting. We consider a far more challenging problem. We have adversarial label noise (agnostic setting) and distributional shifts. Moreover, our activations include ReLU, leaky ReLU, exponential linear unit (ELU), and normalized SoftPlus (i.e., $\sigma(0) = 0$ ) but not the sign activation that the perceptron deals with.

On the “convergence” of Algorithm 1:

Recall that OPT refers to the error in $L_2^2$ loss, achieved by the best-fitting solution to the data. As we discussed in Section 1 (Line 24–34), no algorithm can find a solution that achieves $OPT + \epsilon$ error in polynomial time, even if the $\mathbf{x}$ -marginal distribution is Gaussian and even without considering distributional robustness [GKK19; DKZ20; 28 GGK20; Dia+21; DKR23].

The best guarantee one can hope to achieve for any polynomial-time algorithm is to find an $O(OPT) + \epsilon$ error solution. We prove that our algorithm can find such a solution (in polynomial time). Although our algorithm is not guaranteed to converge to a point in the sense that $\lim_{k \rightarrow \infty} \mathbf{w}_k$ might not exist, we prove that our algorithm efficiently finds an $O(OPT) + \epsilon$ solution in $k$ iterations, where the number of iterations $k$ is given in Theorem 3.1. In other words, the sequence generated by our algorithm converges to a set in the sense that asymptotically all iterates lie in a set of $O(OPT)$ solutions, which are the target solutions for this problem. When we use the word “convergence” in our paper, we mean this weaker notion of convergence.

On stopping criteria:

We know that Algorithm 1 finds a good solution after $k$ iterations; so if we have an upper bound $K$ for $k$ , we may simply terminate the algorithm after $K$ iterations. A direct upper bound of $k$ can be obtained from Theorem 3.1 as $D_0 \le W^2 + \nu_0 c_1 / (1536 N \beta^2)$ , which follows directly from Line 262.

On the existence of the maximizer $q_{\mathbf{w}}$ :

Even without sequential compactness, the existence of optima is still a well-known result for a wide range of discrepancy-based ambiguity sets (e.g. Wasserstein distance, KL divergence, chi-square divergence, etc.). In fact, Lemma C.1 gives a closed-form formula for $q_{\mathbf{w}}$ . The key property we used is that the function $p \mapsto L_\sigma (\mathbf{w}, p; p_0)$ is strongly concave because the chi-square divergence $\chi^2 (p, p_0)$ is strongly convex in $p$ .

On the motivations behind the steps and intermediate quantities of Algorithm 1 and its analysis with Gap functions:

The accumulated gap $\sum_{i=1}^k a_i \text{Gap}(\mathbf{w}_i, p_i)$ is by now a standard quantity for tracking convergence of primal-dual methods widely used in other works that study convex-concave problems [SWD21; Dia+22d; Son+22; MDH24]. When the problem is convex-concave, this quantity is nonnegative. Such a property does not hold in our setting, as the loss is generally nonconvex. Instead, we proved in Lemma 3.2 that the gap is bounded below by a quadratic function of the distance to optimal outside the set of target $O(OPT)$ solutions, which is crucial for carrying out our argument ensuring contraction of distance to target solutions.

On the algorithm side, each step of the algorithm is fully motivated by the analysis, as detailed in Appendix D, which is not a feature of the algorithms in the prior works. For instance, the "extrapolated gradient" $\mathbf{g}_i$ is motivated in Lemma D.2 to facilitate the telescoping in Eq (32); the choice for $\mathbf{v}$ in lines 4 of the algorithm is naturally motivated by the argument in the proof of Lemma 3.4, as we explained in Section 1.3 (Line 156–162).

On sufficiency of reweighting of samples:

Because the algorithm needs to work with an empirical version of the problem, we had to prove concentration of the risk, i.e., analyzing the connection between the empirical target distribution $\hat{p}^*$ and the population target distribution $p^*$ . To give more context, the definition of our Gap function relies upon the empirical target distribution and our dual updates $\hat{p}$ try to estimate it. On the other hand, population target distribution defines the risk that the solution to Problem 1.3 tries to minimize. Their connection is captured in Lemma 2.5 and Appendix C: we show that with enough samples from $p_0$ , reweighting of those samples suffices to model the entire $\chi^2$ ball around $p_0$ ; this result is nontrivial, as uniform convergence results do not apply because $\hat{p}^*$ is not the uniform distribution over samples drawn from $p^*$ .

Please refer to the global comment for a discussion about robustness and adversarial label noise. We thank the reviewer for pointing out the typos in Line 74 and Line 204. Other technical comments about the presentation:

Line 250: $M$ is defined in Eq (6).
Line 214 – 216: We will cite Claim E.2, where this point was explained formally.
Line 142 – 145: We initialize at $\mathbf{w} = 0$ , so the condition $\|\mathbf{w}\| \le 2 \|\mathbf{w}^*\|$ holds at initialization. We proved in Claim 3.5 that this condition holds for all iterations.
Assumption 2.3: We would argue that this assumption is mild but nontrivial. The boundedness parameter $S$ appears in the iteration and sample complexities. To assert that we have a polynomial time algorithm, it is important that $S$ does not depend exponentially on the target error $\epsilon$ or dimension of the input $d$ .

2024-08-09

On the "convergence" of algorithm 1: I think clarifying this point in the paper would be helpful to the reader

intermediate quantities in algorithm 1: Please add a discussion of this motivation to the main text of your paper, even though these quantities have appeared in prior work. It's essential for readers who are new to the field

On sufficiency of reweighting of samples: Make sure this is stated clearly in the main text. It seems very important for your argument!

The appendix: It seems that in several places, central arguments appear only in the appendix, and a reference to this fact in the main text is omitted. Please make sure this does not occur!

Assuming these changes, I'm increasing the score.

2024-08-09

Thank you very much for your helpful feedback and suggestions. We commit to incorporating the suggested changes to improve the presentation of the paper, and we appreciate your positive response.

审稿意见

评分: 7置信度: 32024-07-12

This paper addresses the problem of learning a single neuron in a distributionally robust setting, where both the input distribution and labels can be adversarially perturbed. The authors propose an efficient primal-dual algorithm that achieves a constant-factor approximation to the optimal loss, with respect to the worst-case distribution within a chi-squared divergence neighborhood of the reference distribution.

The proposed algorithm iteratively updates both the weight vector and a reweighting of the empirical distribution. The analysis relies on carefully bounding a gap function related to the primal-dual objective. A key technical lemma bounds the nonconvex parts of the objective, allowing the analysis to go through despite the nonconvexity.

优点

Learning single neuron is an important problem towards understanding deep learning. This paper extends the prior works to the setting of distributionally robust loss. They give concrete primal-dual algorithm to optimize, and provide rigorous theoretical generalization guarantees of the algorithm. The algorithm is computationally efficient, with polynomial ( $O(d/\epsilon)$ ) sample complexity. The analysis introduces novel techniques for handling nonconvexity in primal-dual methods, which to me, is interesting. Their theory also applies to various activation function, not restricted to ReLU.

缺点

It seems that [Dia+22a] and [Dia+22b] also studied robustly learning single neuron. I am not familiar with those literatures, so I cannot give a very accurate judgement on the degree of technical novelty of this paper.

问题

Please see weanknesses part.

局限性

They discussed the limitations. This is a pure theory paper so I do not see potential negative societal impact.

作者回复

2024-08-07

Thank you for your question. While [Dia+22a] and [Dia+22b] have made significant contributions to learning a single neuron robustly to adversarial label noise, our work additionally addresses robustness to distributional shifts, which introduces unique and complex challenges not covered in those papers. In fact, our algorithmic results are new in the DRO setting even without the robustness guarantee (i.e., in the realizable setting).

Distributionally Robust Optimization (DRO) setting:
Unlike the settings considered in [Dia+22a] and [Dia+22b], the DRO framework involves optimizing over the worst-case distribution within a specified ambiguity set. The results in [Dia+22a] [Dia+22b][Dia+20][Wan+23a] crucially rely on distributional assumptions that we only make for the target (loss maximizing) distribution. Once we introduce distributional shifts, it is unrealistic to assume that each distribution in the ambiguity set satisfies even basic concentration properties. This is a major challenge for our algorithm, as we cannot rely on these properties to bound the loss over iterations. To address this, we prove an additional structural result (Lemma 3.4), which we believe is of independent interest.
Algorithmic Approach:
All prior works [Dia+22a] [Dia+22b][Dia+20][Wan+23a] dealt with minimization problems that can be addressed by running (vanilla/projected/stochastic) gradient descent on either a convex surrogate or the $L_2^2$ loss directly. In our case, we have a nonconvex-concave non-bilinearly coupled problem, and each of these characteristics comes with their own technical challenges, as discussed in the intro (Section 1.3) and also see point (2) in response to reviewer ehoc, copied below:

(2) Optimization algorithm analysis (Lemmas 3.2 and 3.3):
We follow the general approach of primal-dual methods from the literature for clarity. However, as noted in the introduction, standard primal-dual techniques are not applicable here because:

Negative Gap: Our gap function can be negative because the loss is nonconvex. We handle this by proving a lower bound for the gap function (Lemma 3.2).

Nonlinear Coupling: The coupling between primal and dual variables is nonlinear, posing a challenge even for strongly concave problems in recent studies (e.g., [MDH24]).

Absence of Lower Bounding Hyperplanes: Convex methods use lower bounding hyperplanes, which are not available in nonconvex settings. We address this by proving a new structural result (Lemma 3.4), which is vital for our analysis and noteworthy on its own.

We appreciate the opportunity to address these points and are open to further clarifications or suggestions.

2024-08-11

Thank you for your detailed response. I tend to keep my score.

审稿意见

评分: 5置信度: 32024-07-12

The authors study the problem of computing a distributionally robust optimization with \ell_2^2 loss functions with sample distribution p, and a chi-square regularizer that ensures that p is close to a given p0 for a single d-dimensional neuron with monotone lipschitz activation function. Their main contribution is an efficient primal dual algorithm which recovers a weight w close to the optimal weight (in \ell_2^2 norm) in O(d) iterations, as long as the optimal distribution p^* satisfies certain sub-exponential concentration properties.

优点

The problem is natural and a practically important one since distribution shifts issues occur in practice and before we study large neural nets, it's important to be able to answer the question for a single neuron. The proofs are fairly detailed, which is often missing in many conference papers.

缺点

The crux of the paper is Theorem 3.1 which bounds the error in each iterate of their primal dual algorithm. The proof of the theorem lies in bounding the primal dual gap and while the argument is interesting, the ideas are elementary and do not seem to be new to the area. I am happy to be corrected if I have missed any significantly new intuition or novelty in their arguments in Lemma 3.2 or 3.3 (which is the crux of their proof).

问题

Can this be extended to two single neuron GANs? The min-max structure might be a simple yet tractable extension to your single neuron setting.

局限性

No societal impact, theory result.

作者回复

2024-08-07

Thank you for your comments.

On the intricacies of proving Theorem 3.1 (Main Theorem):

Broadly speaking, the proof of this theorem relies on two technical contributions as we discussed in Section 1.3:

Concentration of the target distribution:
Recall that the target distribution is the solution to the min-max problem, while the reference distribution is where we draw samples from. We emphasize that, since the algorithm must work with an empirical version of the problem, we had to prove the concentration of risk on the target distribution. This is crucial for transferring assumptions (Assumptions 2.1 and 2.2) from the population target distribution to the empirical target distribution, which the definition of our Gap function relies on and our dual updates $\hat{p}$ try to estimate. The challenge is that typical uniform convergence results do not apply because our assumptions are only on the population target distribution. The empirical target distribution depends on samples from the empirical reference distribution, making it hard to transfer these properties. We address this by proving an interesting connection between the distributions in Lemma 2.5.
Optimization algorithm analysis (Lemmas 3.2 and 3.3):
We follow the general approach of primal-dual methods from the literature for clarity. However, as noted in the introduction, standard primal-dual techniques are not applicable here because:
- Negative Gap: Our gap function can be negative because the loss is nonconvex. We handle this by proving a lower bound for the gap function (Lemma 3.2).
- Nonlinear Coupling: The coupling between primal and dual variables is nonlinear, posing a challenge even for (strongly) convex-concave problems in recent studies (e.g., [MDH24]).
- Absence of Lower Bounding Hyperplanes: Convex methods use lower bounding hyperplanes, which are not available in nonconvex settings. We address this by proving a new structural result (Lemma 3.4), which is vital for our analysis and noteworthy on its own.

Challenges of learning neural networks:

Thank you for the question. Analyzing two single-neuron GANs is an interesting problem that may share similar techniques but is outside the scope of this work.

We would like to emphasize that learning a single neuron in the agnostic model, even without distributional robustness, is an ongoing research challenge. Notably, our results in the DRO setting are new even for the realizable setting (except for the special case of linear regression). Our work handles a class of problems nearly as large as those addressed without distributional robustness in a recent ICML paper [Wan+23a], demonstrating the significance and breadth of our approach.

We appreciate the opportunity to address these points and are open to further clarifications or suggestions.

审稿意见

评分: 6置信度: 22024-07-16

This work considers the labels with noise and possible distribution shifts of the data. The authors aim to minimize the model's loss on a worst-case distribution from a set of distributions close to the reference distribution, which is defined as ambiguity set. The activation functions used to produce labels are nonconvex, including ReLU, leaky ReLU, exponential linear unit, and normalized SoftPlus. In such situation, the distributionally robust optimization can achieve a stationary point, which is insufficient for learning a ReLU neuron. To close the gap, this work proposed the first polynomial sample and time algorithm for learning a neuron in a distributionally robust setting for a broad class of activations.

To this end, the authors make two assumptions about the target distribution covariates. Then, the loss function on the target distribution is claimed to be sharp. The algorithm is proposed to produce a sequence of primal-dual pairs w, p given a sequence of positive step sizes. The difference between optimal w and empirical w is called gap. The gap function is bounded, which further leads to the convergence of the algorithm as shown in Theorem 3.1.

优点

This work considers the labels with noise and possible distribution shifts of the data. Existing works do not consider both of the concerns.

The problem is with practical interests.

The authors propose the novel algorithm to solve the problem.

缺点

The variable \hat{p}_i in Algorithm 1 (line 261 on page 7 ), is not updated in other iterations.
The analysis about time complexity and memory cost is missing.
It is not clear where the noise of labels is discussed in Theorem 3.1 on page 6.
How to control distribution shift is not discussed in Theorem 3.1.

问题

What is the time complexity and memory cost of the algorithm?
Where do authors consider label noise in Theorem 3.1?
How to control distribution shift is not discussed in Theorem 3.1.

局限性

The time and memory cost is not provided. It may limit the applications in real-world applications.

作者回复

2024-08-07

Thank you for your comments.

Line 261, page 7:
The variable $\hat{p}_i$ in Algorithm 1 (line 261 on page 7), is updated in Line 7 of the current iteration and used in Line 5 of the subsequent iteration.

Memory cost and runtime:
The memory cost and runtime analysis is standard for the area and often omitted beyond the number of iterations: the memory needed to store the samples is $O(nd)$ and the variables maintained by the algorithm require $O(n + d)$ , so the total memory cost is still $O(nd)$ . The runtime is the number of iterations $k$ given in Theorem 3.1 multiplied by the per-iteration complexity $O(nd)$ .

Adversarial label noise:
We explain this in the common response.

Accounting for distribution shift:
The problem formulation itself of the DRO model already accounts for the worst-case distribution shift. We do not discuss it explicitly because it is a standard model (see [Bla+24, Section 1]). We will add a brief commentary on how the DRO model accounts for the distributional shift, as here we are bounding the gap (Line 135) with respect to the worst-case distribution ${\hat{p}^*}$ from the ambiguity set, which corresponds to the worst-case distributional shift.

Ethical review:
Our paper only presents theoretical results, and we are confused by the need for an ethical review on Human rights (including surveillance). We are wondering if this was perhaps selected in error or if the reviewer could clarify the concern.

We appreciate the opportunity to address these points and are open to further clarifications or suggestions.

评论- read the rebuttal

2024-08-12

I have reviewed the authors' response and will maintain my current score.

作者回复

2024-08-07

We thank all the reviewers for their valuable time and feedback. We are encouraged by the reviewers’ positive comments about novelty (3Jnc) and importance of the problem (ehoc, aZHR).

Here we address some common concerns and give more detailed responses to the reviewers’ specific concerns below.

Adversarial label noise versus adversarial robustness

Our paper explores the problem of learning a single neuron robustly in the presence of adversarial label noise and distributional shifts. It is important to distinguish between different models of robustness.

Adversarial label noise:

There are several ways to model the label noise. In this paper, we consider the standard agnostic setting where the goal is to find the best-fit function (in our case, in terms of the $L_2^2$ loss) to arbitrary given labels; as explained in Lines 24–26 of our paper. In the case of boolean functions, [KSS92] proved that the agnostic setting is equivalent to learning with maliciously corrupted labels.

Adversarial robustness:

In modern deep learning literature (e.g., [GSS14]), robustness to perturbed inputs at test time is often referred to as adversarial robustness. This type of robustness guards against small perturbations to the inputs of neural networks that lead to incorrect model outputs. While our paper does not address this model directly, we acknowledge the distinction and will add a sentence to clarify our focus. Additionally, there are existing works (e.g., [SND18]) that discuss the relationship between distributionally robust optimization (DRO) and adversarial robustness.

References:
[GSS14] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

最终决定Accept (poster)

2024-09-25

This paper studies PAC learning of a single neuron under distributional shifts of covariates and adversarial label noise. Prior works have established hardness results for distribution-independent learning and for exact learning (where approximation ratio must be 1). This paper gives a polynomial-time algorithm that achieves a constant-approximation ratio guarantee when the underlying distribution is well-behaved. A major technical novelty of the algorithm is utilizing the primal-dual toolkit to track a potential function and showing that it is decreasing over the iterations.

The paper is overall well written and reviewers agree that it carries out novel technical contributions and the results look sound. The proposed approach may inspire a line of research on bridging optimization and structural nonconvexity.