PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
6
6
7
3.0
置信度
正确性3.3
贡献度3.0
表达2.7
NeurIPS 2024

On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We study the power of differentiable learning compared to the statistical query (SQ) and correlation statistical query (CSQ) methods for the problem of learning sparse support of size P out of d coordinates with d>>P.

摘要

The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($\mathsf{SQ}$), which we call Differentiable Learning Queries ($\mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of $\mathsf{DLQ}$ for learning the support of a sparse function over generic product distributions. This complexity crucially depends on the loss function. For the squared loss, $\mathsf{DLQ}$ matches the complexity of Correlation Statistical Queries $(\mathsf{CSQ})$—potentially much worse than $\mathsf{SQ}$. But for other simple loss functions, including the $\ell_1$ loss, $\mathsf{DLQ}$ always achieves the same complexity as $\mathsf{SQ}$. We also provide evidence that $\mathsf{DLQ}$ can indeed capture learning with (stochastic) gradient descent by showing it correctly describes the complexity of learning with a two-layer neural network in the mean field regime and linear scaling.
关键词
Statistical Query ComplexityDiffertiable LearningSparse FunctionsLeap Exponent

评审与讨论

审稿意见
6

This paper studies the problem of learning juntas from the perspective of restricted statistical query algorithms. Concretely, for some d-dimensional product distribution \mu_x^d and some conditional distribution \mu_{y|z}, the learner is given access (via samples or, in this paper, certain queries) to the distribution of (y,x) where x ~ \mu_x^d and y|x has distribution \mu_{y|z} where z is some tuple of P unknown coordinates of x. That set of coordinates that influence y is called the support of the distribution, and the goal is to learn the support.

A Q-restricted statistical query algorithm is one that only makes queries to E[phi(y,x)] for functions phi in Q. Examples include SQ and CSQ; this paper introduces, for any fixed loss function \ell, another class DLQ_\ell, where the allowable queries are gradients of \ell(f(x,w),y) (w.r.t. w) for arbitrary functions f.

The main result of this paper is a characterization for any {SQ, CSQ, DLQ} of the query complexity of the support learning problem, when P, \mu_x, \mu_{y|z} are fixed but d is growing: up to lower-order terms it has the form d^k for a "leap exponent" k that can be determined from \mu_x,\mu_{y|z}, and the query class Q. As a consequence, the paper gives separations and equivalences among these classes: notably, CSQ is equivalent to DLQ for \ell_2 loss (Proposition 6.2a), but SQ is equivalent to DLQ for \ell_1 loss (Proposition 6.1a).

Additionally, the paper has some results showing that if one performs SGD on a two-layer neural network in the mean-field regime and with loss function \ell, then DLQ_\ell (rather than SQ or CSQ) is what generically characterizes success of the dynamics (more precisely, Leap(DLQ_\ell) = 1 is what characterizes success with O(d) samples).

优点

  • This paper extends prior work by Abbe, Boix-Adsera, and Misiakiewicz to a more general setting (specifically, the prior work considered the special case of CSQ and the uniform distribution on the hypercube). While CSQ has long been motivated by the fact that it captures gradient descent, a key takeaway of this paper is that this is only for \ell_2 loss, and for other losses a different characterization is needed. This seems like a useful observation for the community.
  • The paper also contrasts the power of adaptive queries versus non-adaptive queries, and gives a characterization of query complexity for both.
  • The paper is written carefully and rigorously.

缺点

  • I spent a while trying to parse the notation for the proof of Theorem 5.1(a). Ultimately I think it is not super complicated, but some intuitive, high-level explanation of the construction would have been very helpful in addition to the formal definitions. Also, the proof outline explains the boilerplate elements that appear in any SQ lower bound argument but gives no intuition/proof sketch for the crucial and presumably novel inequality (25).

问题

  • Does the characterization extend beyond just {SQ, CSQ, DLQ} to any query class Q, or if not what breaks down?
  • The equivalence between SQ and DLQ for ell_1 loss rests on a universal approximation result. I am curious if there is intuition for why the analogous approximation result for ell_2 does not hold? Is it just that there is no non-linearity in the gradient? If this is the intuition, it seems to me that in some sense "most" loss functions besides ell_2 ought to induce an equivalence between DLQ_ell and SQ? Proposition 6.2(c) is interesting because it gives a loss function that does not degenerate to SQ nor CSQ: is there no universal approximation here because it's a polynomial?
  • I assume that CSQ and SQ provably do not capture SGD on neural networks, but I didn't see a formal statement to this effect; does this follow from the separations in Section 6?
  • This is a minor technical confusion: in lines 505-506, I see why the null responses are compatible with nu^\sigma, but since nu_{S^\star} is not in the hypothesis class (since it's a partially "decoupled" distribution), I don't see why this immediately gives failure of the learning algorithm? I would have thought one needs two hypothesis in the hypothesis class to be indistinguishable. Of course I think this should also follow from (26) but I just wanted to check if I am misunderstanding.

局限性

Yes

作者回复

We thank the reviewer for their appreciation and their valuable feedback! Below we address the weaknesses and questions.

Addressing the weakness: Thank you for pointing that out. We will add a more intuitive summary of the proof, especially how Equality (25)(25) is derived and how the Leap exponent appears in the RHS.

We define the null distribution as the distribution with matching marginals (y,x)(y,\boldsymbol{x}) where the label only depends on the support of x\boldsymbol{x} discoverable with leap k1k_*-1. Hence, to make any progress, any new detectable set added to the support needs to contain at least kk_* new coordinates. In the second moment, to have a non-zero contribution, we need the permutation to map the support to a detectable subset, i.e. the permutation to map to a subset of size at least kk, which has probability dkd^{-k_*} over permutation.

Questions:

  1. Yes, the characterization generalizes to any query class Q\mathcal{Q} with the class specific system of detectable subsets, where the set of test functions Ψ\Psi is simply defined as the set of yϕ(y,u):ϕQ,uX\\{ y \mapsto \phi (y,\boldsymbol{u}) : \phi \in \mathcal{Q}, \boldsymbol{u} \in \mathcal{X} \\}. The lower bound is proved under this general setting. However, note that our upper bound uses that (13) can be obtained using a query ϕQ\phi \in \mathcal{Q}. We can add a further property on the set Q\mathcal{Q} so that it is the case (note that the lower bound in Theorem 5.1 is not achievable for all classes Q\mathcal{Q}, for example if Q\mathcal{Q} only contains ϕ(y,x)=y\phi(y,\boldsymbol{x}) = y. However, we consider these cases to be less interesting).

  2. Yes, the general intuition behind the universality is correct. Roughly speaking, we need functions (YR\mathcal{Y} \to \mathbb{R}) of the form ΨDLQ\Psi_{\mathsf{DLQ}} for a loss \ell mentioned in Eq (14) to be dense in L2(μy)=ΨSQL^2(\mu_y)=\Psi_{\mathsf{SQ}} to match the SQ\mathsf{SQ} complexity for any link distribution. For the squared loss, ΨDLQ\Psi_{\mathsf{DLQ}} only contains affine functions, and thus are not dense in L2(μy)L^2(\mu_y) (except for yy supported on two points where SQ=CSQ\mathsf{SQ}=\mathsf{CSQ}).

The intuition for Proposition 6.2.(c) is indeed similar: ΨDLQ\Psi_{\mathsf{DLQ}} is a set of polynomials and is not dense in L2(μy)L^2(\mu_y) (except when the support of yy is finite and the dimension of the span of the polynomials in ΨDLQ\Psi_{\mathsf{DLQ}} is equal to the number of points in the support, i.e., all polynomials of degree is at most P1P-1, where PP is the support size). However, it is more powerful than ΨCSQ=yy\Psi_{\mathsf{CSQ}}=\\{ y \mapsto y \\} which only contains identity, and there exist distributions such that the DLQ\mathsf{DLQ} exponents will be strictly smaller than the CSQ\mathsf{CSQ} exponents.

  1. We do not quite understand the question on “SQ/CSQ\mathsf{SQ/CSQ} provably not capturing SGD”. In general, the question of whether SQ\mathsf{SQ}-type lower bounds capture the complexity of SGD algorithms (and other algorithms) is a subtle and difficult one. First, SQ lower bounds are proved under “adversarial noise” (worst case response in the tolerance interval [τ,τ][-\tau,\tau]) instead of statistical noise (when samples are used to evaluate the expectations). Second, SQ\mathsf{SQ}-algorithms only constrain the access to the data distribution, and not on how the responses are combined to produce the next query and the output (for example, in Section 7, logd\log d online SGD steps are enough to discover the support by looking at the gradient evaluations, however regular SGD will require O(d)O(d) online SGD steps for the weights of the neural network to be updated and fit the target function).

In general, it is known that SQ\mathsf{SQ} cannot capture the complexity of learning algorithms that use samples (e.g., separation results between PAC and SQ\mathsf{SQ} learning), even for SGD on neural networks. Indeed, [Abbe et al. 2021b] shows that with enough machine precision and for any PAC algorithm, one can construct a neural network such that SGD emulates it by implementing sample extraction. However, such a neural network is highly irregular. It is however conjectured that SQ\mathsf{SQ} lower bounds capture the complexity of a large class of regular algorithms (e.g., regular neural networks such as fully connected neural networks). Note that proving general computational lower bounds is difficult, and researchers often only prove lower bounds against restricted classes of algorithms such as SQ\mathsf{SQ}.

Another subtlety concerns the reuse of samples between different queries. [Dandi et al 2024] shows that by re-using samples, SGD on square loss can beat the CSQ\mathsf{CSQ} lower bound. This requires choosing hyperparameters (batch size and step size) such that two gradient steps interact to transform two CSQ\mathsf{CSQ} queries into one SQ\mathsf{SQ} query. In general, we expect that sample reuse will not be able to beat SQ\mathsf{SQ} lower bounds on regular architectures (as discussed above), while CSQ\mathsf{CSQ} (and DLQ\mathsf{DLQ}) will capture online SGD or multi-pass SGD where the hyperparameters are not chosen carefully.

Many of these issues are open questions, which do not (yet) have satisfactory answers. We discuss some of them in the introduction and the discussion section (Section 8). We will expand these discussions in the revised version.

  1. Yes, you are understanding correctly. Using Eq (26)(26) we get that there are two members from Hμd\mathcal{H}^d_\mu whose query responses are compatible with the null νS\nu_{\mathcal{S}_*} and hence also mutually compatible. This then establishes the failure of the learning algorithm.
评论

Thanks for the detailed response. I maintain my score.

审稿意见
6

This paper studies the complexity of learning sparse functions (i.e., juntas) using statistical and gradient queries. For that, it introduces Differentiable Learning Queries (DLQ) that can be seen as a generalization of correlation statistical queries beyond square loss.

With the notion, the paper gives a characterization of the query complexity for learning the support of a sparse function over generic product distributions. The characterization is in terms of leap exponent introduced from Abbe et al earlier.

Finally, the paper also introduces cover exponent that characterizes non-adaptive query complexity.

优点

The paper studies a well-motivated question of query complexity of learning juntas, a long established question in ML theory. The paper builds upon the prior work of Abbe et al and provides a generalization beyond CSQ and square loss function. I find the results strong at a technical level. To my knowledge, the main theorems are novel.

The work proves general relations between SQ, CSQ and DLQ.

The paper also provides experimental validation of the theoretical results, though it's somewhat limited.

The paper is well written.

缺点

Figure 1 is an experimental study. I believe it'd be nice to expand this into a small section that extends beyond learning y_2 (2) and using the 2 loss functions here. I also can't find more details: what is the architecture of this two-layer net?

Abbe et al [2023] has some nicer numerical results. I'd recommend check it out.

Compared with Abbe et al 2023, the paper only tackles the case when the target function is junta. It doesn't extend much beyond sparse functions.

Minor comments follow:

  • Line 18: remove “that is”
  • Line 24: it should be clarified that Abbe et al. [2022, 2023] studies learning juntas using neural networks via SGD

问题

Any upper bound results or conjectures on how to match the lower bound in section 5?

局限性

The paper lists several limitations.

作者回复

We thank the reviewer for their appreciation and their valuable feedback! Below we address the weaknesses and questions.

Simulations:

Following the reviewer’s suggestion, we will include additional figures (beyond just the function y2y_2) in a separate section, as well as details on the architecture and algorithm used in the simulations. We will further improve their presentations to make them more striking. In all simulations, we use the two-layer neural network stated in Eq (15)(15), with a sigmoid activation function. We set the biases to 0 and train simultaneously aja_j and wjw_j with vanilla one-pass SGD and stepsize 1/d\propto 1/d.


Only tackling sparse functions: In this paper, we focus on the setting of learning sparse functions (juntas). Note that learning juntas is a popular model in learning theory, and has been the testbed for many investigations on the power of different classes of algorithms. For example, learning sparse functions on the hypercube has been studied in the context of SQ\mathsf{SQ} [Kearns 1998], differentiable learning [Abbe et al. 2021b], and SGD on regular neural networks [Abbe et al. 2022, 2023, Telgarsky 2022, Margalit 2023].

[Abbe et al, 2023] considers both uniform on the hypercube and isotropic Gaussian data, which are associated to two symmetry groups: the permutation group and orthogonal group respectively. In this paper, we focus on the permutation group. However, compared to [Abbe et al, 2023], which only considers the hypercube and adaptive CSQ\mathsf{CSQ}, we vastly generalize their results by allowing: 1) general product measurable spaces, 2) general Q\mathcal{Q}-restricted SQ\mathsf{SQ} algorithms, including SQ\mathsf{SQ} and DLQ\mathsf{DLQ}_{\ell}, 3) both adaptive and nonadaptive algorithms. We show that in this general setting, the complexity of SQ\mathsf{SQ} algorithms is tightly captured by a leap and cover complexities. This demonstrates the generality of the leap complexity first introduced in [Abbe et al 2022, 2023]: it is induced by the permutation group structure rather than any other properties of the hypercube, and extends beyond CSQ\mathsf{CSQ} to any Q\mathcal{Q}-restricted SQ\mathsf{SQ} algorithms by changing which subsets of coordinates can be detected.

We further emphasize that key takeaways from our paper are fully independent of [Abbe et al. 2022, 2023, Damian et al 2023, 2024]: (1) The introduction of Q\mathcal{Q}-restricted SQ\mathsf{SQ} algorithms, in particular DLQ\mathsf{DLQ} that generalizes CSQ\mathsf{CSQ} algorithms, which are novel to the best of our knowledge. (2) Functions with high CSQ\mathsf{CSQ} Leap [Abbe et al 2023] can still be learned at much lower complexity by simply changing the loss function, with a tight characterization for a given loss function and target distribution using DLQ\mathsf{DLQ}_\ell, as well as in the mean-field regime for SGD on two-layer neural networks.

The other setting that was considered in the literature is the case of isotropic Gaussian data: isoLeap with multi-index functions and CSQ\mathsf{CSQ} [Abbe et al, 2023], information exponent for single index and CSQ\mathsf{CSQ} [Damian et al, 2023], and generative exponent for single index and SQ\mathsf{SQ} [Damian et al, 2024]. We believe that our results can be extended to this setting (and more general symmetry groups), and we are planning to report on these extensions in future work. See also the discussion in Section 8.

Finally, note that our results apply to isotropic Gaussian data where the multi-index function is coordinate aligned (e.g. when the coordinate system is known in advance).


Questions:

Theorem 5.1 shows algorithms that match the lower bounds for SQ,CSQ\mathsf{SQ}, \mathsf{CSQ}, and DLQ\mathsf{DLQ}_\ell. The algorithm is straightforward: in the definition of the detectable subgroups, the condition (13)(13) corresponds to one query evaluation. Then these queries can be combined following the leap and cover complexities. See also Remark 5.2 and the proof of the upper bound in Appendix B.4 which shows how to trade-off tolerance and number of queries (resulting in an extra log(d)\log(d) factor). Further note that these upper bounds require the knowledge of the link function. However, this can be relaxed following a procedure similar to [Damian et al 2024b] (see Remark 5.3).

评论

Thanks for responding to my original review! I maintain my rating.

审稿意见
7

This submission studies the computational complexity of learning low-dimensional functions (juntas) and examines a new type of statistical query termed DLQ, which captures the difficulty of learning using gradient-based queries. The authors identify quantities that dictate the computational lower- and upper-bounds based on the properties of the link function and show that the DLQ complexity may match either SQ or CSQ depending on the loss function. This generalizes the earlier leap exponent in (Abbe et al. 2023), which is specific to CSQ models. Finally, in a simplified hypercube setting, the authors show that a mean-field neural network can learn functions with DLQ-leap exponent 1 using n=O(d)n=O(d) samples, which matches the DLQ complexity.

优点

This submission makes a number of contributions to the learning of low-dimensional functions on product distributions:

  • The proposed DLQ "interpolates" between SQ and CSQ depending on the structure of the loss function. This provides insights into the class of low-dimensional functions that can be learned by gradient descent algorithms.

  • The authors also distinguish between adaptive and non-adaptive queries, providing a separation between algorithms with one gradient step versus consecutive steps.

  • The mean-field analysis extends (Abbe et al. 2022) to different loss functions and confirms that gradient descent training of a two-layer neural network learns juntas with DLQ leap exponent 1.

缺点

I have the following concerns/questions:

  1. The class of functions with the prescribed DLQ leap exponent is rather opaque, even in the simple hypercube setting. For Gaussian single-index models, the SQ-leap exponent is completely characterized in (Damian et al. 2024), and it is known that all polynomial links have SQ exponent 1 or 2. But in the current submission, it is not clear what functions have DLQ leap exponent 1. The authors provided one example in Proposition 6.2(c) for one specific loss and link function, but there is no general principle that determines the DLQ hardness.
    Intuitively, if the output is discrete, then the detectable set for SQ versus CSQ may not differ too much, so it is not easy to see that the SQ/DLQ-learnable function class is significantly larger on hypercube input. Can the authors clarify?

  2. The current submission does not take the rotational symmetry into account, and the query dependence in the lower bound is polynomial as opposed to logarithmic in the Gaussian case (Damian et al. 2022). For algorithms that are not axis-aware, such as SGD from rotationally invariant initialization, do we expect a lower bound that is stronger in terms of query dependence?

  3. Assumption 2.1 needs more explanation beyond "it always holds when X\mathcal{X} is discrete". Can the authors comment on how to translate this assumption to a condition on the link function when the input is continuous, such as Gaussian?

  4. How do the authors relate the number of queries in the SQ lower bound to the sample size in the mean-field analysis?


The authors’ response addressed most of my concerns. I have updated my evaluation accordingly.

问题

See Weaknesses.

局限性

See Weaknesses.

作者回复

We thank the reviewer for their appreciation and their valuable feedback! Below we address the weaknesses/questions in order:


Intuition behind easily learnable functions: The simple characterization for the generative exponent in [Damian et al. 2024] is due to the setting they consider: 1) Gaussian data, 2) single-index model, 3) SQ algorithms. This allows them to define their generative exponent in terms of (one-dimensional) hermite coefficients of the Radon-Nikodym derivative. In our case, we have 1) generic product spaces, 2) multi-index functions, 3) Q\mathcal{Q}-restricted SQ\mathsf{SQ} algorithms. The generality of our setting doesn’t allow for such a simple characterization. We further believe that despite the generality of our results, the characterization remains surprisingly simple: the Q\mathcal{Q} class of queries defined a system of detectable subsets, while the leap and cover complexities describe how to combine these subsets to achieve the optimal query complexity (similarly to [Abbe et al. 2023] for CSQ\mathsf{CSQ} leap on the hypercube).

In the case of SQ\mathsf{SQ}, T(y)T(y) in (13)(13) can be replaced by the Radon-Nikodym derivative (see Proposition A.1) and an equivalent characterization can be obtained by looking at its decomposition in the tensor product basis (analogous to the Hermite polynomials in [Damian et al. 2024]). However, for general Q\mathcal{Q}-restricted SQ\mathsf{SQ}, no such simplification exists. Because of this generality, no simple principle allows us to characterize the DLQ\mathsf{DLQ} exponent besides Definitions 1 and 2. Hence, Theorem 5.1 should be thought of as a meta-theorem where one has to characterize detectable sets for each given setting (see below for illustrations and response to reviewer Grg1). But we again emphasize that this abstract characterization captures precisely the complexity of Q\mathcal{Q}-restricted algorithms for learning juntas.

As for the question on SQ\mathsf{SQ} vs CSQ\mathsf{CSQ} for detectable sets being not too different on the hypercube data: unfortunately, this intuition is not correct. In particular, on the hypercube data, even with discrete outputs, there are functions with LeapSQ=1 Leap_{\mathsf{SQ}}=1 but LeapCSQ=P1Leap_{\mathsf{CSQ}}=P-1, where PP is the support size, i.e. we can get arbitrarily large separations between these complexities even on the hypercube data. See for example the proof of Proposition 6.2.(c) for such an instance.

For DLQ\mathsf{DLQ}, we show an example that is strictly between SQ\mathsf{SQ} and CSQ\mathsf{CSQ}, when the loss function is a polynomial. However, for non-polynomial loss, we expect in general that SQ=DLQ\mathsf{SQ}=\mathsf{DLQ} (see Proposition 6.2).

Let us add that for the example in Proposition 6.2.(c) proof, our results in Section 7 show that it can be learned in linear scaling by SGD with 1\ell_1 loss (or a smoothed version to satisfy our technical conditions) on a two-layer neural network, while we expect online SGD on the squared loss to require dP2d^{P-2} iterations. See response to Reviewer zev8.


Rotational symmetries and clarification (q,τ)(q,\tau) relationship: Note that in [Damian et al. 2024], they can show an analogous lower bound q/τ2dk/2q/\tau^2 \geq d^{k/2}, where kk is the generative exponent. They further show that if τ2>dk/2\tau^2 > d^{-k/2}, an exponential number of queries is required. However, one should be careful in interpreting lower bounds for SQ algorithms: for example, such an exponential lower bound exists for CSQ\mathsf{CSQ} (i.e., with adversarial noise), while [Damian et al. 2023] presents an algorithm with a polynomial number of queries and τ=Θ(1)\tau = \Theta(1) with statistical noise (batch size 11 online SGD). Interestingly, their second lower bound for low-degree polynomials shows that there is an information-to-computational gap in their model: Θ(d)\Theta (d) samples is optimal information-theoretically, while Ω(dk/2)\Omega (d^{k/2}) samples are necessary for a low-degree polynomial algorithm to succeeds. Such an information-to-computation gap does not appear in the axis-aligned case (permutation symmetry), even when considering Gaussian data (both are Θ(log(d))\Theta (\log(d))).

If we consider a rotational equivariant algorithm to learn a sparse function on the hypercube, the issue is subtle. [Margalit 2023] (and generalizations of their results) show that the information-theoretic threshold becomes Θ(d)\Theta(d) (instead of Θ(log(d))\Theta (\log (d))). However, we believe that an information-to-computation gap does not appear: one can use the input x\boldsymbol{x} to first learn the axis (which requires O(d)O(d) samples) and then fit a sparse function on these coordinates (which requires O(logd)O(\log d) samples). An open problem would be to show that a regular equivariant algorithm, such as SGD on a two-layer neural network, cannot emulate this axis-alignment first stage and would inherit the information-to-computation gap from Gaussian data.


Clarification on Assumption 2.1: Indeed this assumption requires a longer discussion which we will include in the revised version. In short, Assumption 2.1 is necessary for our results to hold in the case of continuous functions, as argued in [Vempala, Wilmes, 2019]: with a finite and noiseless class of functions H\mathcal{H}, it turns out that logH\log |\mathcal{H}| queries are sufficient with τ2=Θ(1)\tau^2 = \Theta (1) under a weak “non-degeneracy” assumption. Hence, enough noise in the link function is required to prevent these algorithms (that require highly non-regular measurable functions). Intuitively, Assumption 2.1 is a quantitative measure for “enough noise in the concepts”. E.g., it is easy to check that Assumption 2.1 will always be satisfied for the link of the form y=f(z)+εy=f_*(\boldsymbol{z})+\varepsilon, for any bounded ff_* and εN(0,σ2)\varepsilon \sim \mathsf{N}(0,\sigma^2) as soon as σ>0\sigma>0, for arbitrary product measure on Xd\mathcal{X}^d.

We continue in common rebuttal space for the last point.

作者回复

This is a continuation of the response to the Reviewer atP9; other reviewers are not subjected to read.

Relation between DLQ\mathsf{DLQ}_\ell complexity and the online SGD samples:

In Section 7, we show that DLQ\mathsf{DLQ}_\ell leap 1 functions correspond exactly to the class of functions that can be learned in the mean-field regime and linear scaling (i.e., O(d)O(d)-online SGD steps with batch 11, which corresponds to O(d)O(d) samples). This is similar to [Abbe et al 2022, 2023], but now beyond square loss and CSQ\mathsf{CSQ}. As noted in lines 286 to 297, this corresponds to a number of queries O(d2)O(d^2) (each gradient step corresponds to dd queries, which is the size of the gradient vector). This is different from the optimal Θ(d)\Theta(d) queries in Theorem 5.1. This discrepancy was already noted in [Abbe et al 2022, 2023], and is further discussed in Section 8. The reason is that SGD on two-layer neural networks is rotationally equivariant and cannot emulate the optimal algorithm in Theorem 5.1 (this can be seen also in terms of the Θ(d)\Theta (d) information-theoretic lower bound, which implies that Ω(d)\Omega(d) steps is necessary). A possibility is to change the architecture of the two-layer neural network to make it “axis-aware”, as mentioned in Section 8 (see also the response to Reviewer Grg1).

Despite this discrepancy, we included Section 7 to show the relevance of the DLQ\mathsf{DLQ}-leap complexity in a practical setting (SGD on a regular two-layer neural network) as well as to provide a concrete separation result for regular SGD between different losses: SGD on the 1\ell_1 loss will vastly outperform online SGD on the square loss as soon as the SQ\mathsf{SQ} leap is 1 but CSQ\mathsf{CSQ} leap is much bigger (e.g., the example in Proposition 6.2.(c)).

评论

Another clarification regarding Assumption 2.1 which some reviewers were concerned about:

We phrased the assumption this way since we wanted to be as general as possible while maintaining rigor. The Assumption always holds for y=f(z)+N(0,σ2)y=f_*(\boldsymbol{z})+\mathsf{N}(0,\sigma^2), with σ2>0\sigma^2>0, which is a more restricted continuous model of perhaps the greatest interest--we will emphasize this better in the paper. We note that an alternative proof, that we will include in the final version, can avoid this assumption for DLQ\mathsf{DLQ} with analytic loss (including CSQ\mathsf{CSQ}), but not for general SQ\mathsf{SQ} or DLQ\mathsf{DLQ}, as noted in Valiant [2012], Song et al. [2017], Vempala and Wilmes [2019], Andoni et al. [2014].

Essentially all related SQ\mathsf{SQ} lower bounds require this assumption (i) sometimes implicitly [Damian et al. 2024, Feldman et al. 2018] (ii) or by the virtue of X\mathcal{X} being discrete [Diakonikolas et al. 2022, Feldman et al. 2017, Abbe et al. 2023] or only considering CSQ\mathsf{CSQ} [Damian et al. 2023, Abbe et al. 2023] (iii) or by restricting to y=f(z)+y=f_*(\boldsymbol{z})+ noise [Abbe et al. 2023]. Even more generally in hypothesis testing literature, see [Hopkins 2018, Kunisky et al. 2019] for low degree lower bound and [Perry et al., 2018] for contiguity lower bounds.


Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and juntas. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, pages 11–20. IEEE, 2012.

Le Song, Santosh Vempala, John Wilmes, and Bo Xie. On the complexity of learning neural networks. Advances in neural information processing systems, 30, 2017.

Santosh Vempala and John Wilmes. Gradient descent for one-hidden-layer neural networks: Poly- nomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117. PMLR, 2019.

Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning sparse polynomial functions. In Proceedings of the twenty-fifth annual ACM- SIAM symposium on Discrete algorithms, pages 500–510. SIAM, 2014.

Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao. Statistical algorithms and a lower bound for detecting planted cliques. Journal of the ACM (JACM), 64 (2):1–37, 2017.

I Diakonikolas, D Kane, L Ren, Y Sun. SQ lower bounds for learning single neurons with Massart noise. Advances in Neural Information Processing Systems, 2022.

Samuel Hopkins. Statistical inference and the sum of squares method. PhD thesis, Cornell University, 2018.

Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio. In ISAAC Congress (International Society for Analysis, its Applications and Computation), pages 1–50. Springer, 2019.

Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and sub- optimality of pca i: Spiked random matrix models. The Annals of Statistics, 46(5):2416–2451, 2018.

最终决定

The authors present Differentiable Learning Queries (DLQ), a new theory framework for differentiable learning, comparing it to the existing SQ and CSQ frameworks. This clarifies the complexity of differentiable learning as being distinct from them, which is a useful observation for the community. They give a characterization of the power of DLQ for learning the support of a sparse function. All reviewers viewed the paper favorably. One of the reviewers increased their rating after discussion.