6.0

/10

Poster4 位审稿人

最低6最高6标准差0.0

3.8

置信度

正确性3.3

贡献度2.5

表达3.3

NeurIPS 2024

Matching the Statistical Query Lower Bound for $k$-Sparse Parity Problems with Sign Stochastic Gradient Descent

Yiwen Kou,Zixiang Chen,Quanquan Gu,Sham M. Kakade

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

The $k$-sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\le O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d^{k-1})$ using $2^{\Theta(k)}$ neurons, matching the established $\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. To the best of our knowledge, this is the first result that matches the SQ lower bound for solving $k$-sparse parity problem using gradient-based methods.

关键词

k-ParityOptimizationGeneralization

评审与讨论

审稿意见

评分: 6置信度: 42024-07-11

This paper studies the problem of learning the $k$ -sparse parity problem over the $d$ -dimensional boolean hypercube, using a two-layer neural network. The main result is that a specific modification of online SGD, called "sign SGD," can learn the k-parity problem $n = \tilde O(d^{k-1})$ samples and a network width of $2^{\Theta(k)}$ ; the number of scalar queries to the model is $m\cdot n \cdot d = \tilde O(d^k)$ , and thus this algorithm matches the SQ lower bound of $d^k$ .

优点

The k-sparse parity problem has attracted much recent interest in the deep learning theory community, with many prior works focusing on both lower bounds (via the SQ/CSQ framework) and learning guarantees for neural networks. This paper makes good progress by showing that the SQ lower bound of $d^k$ queries can indeed be achieved by a gradient-based algorithm.
The paper is well written and easy to understand, and I was able to follow the proof sketch.
I have skimmed the appendices, and the proofs appear to all be correct.

缺点

My main issue with this paper is the choice of the Sign SGD algorithm. Unlike vanilla SGD, Sign SGD is not invariant to the choice of the basis, and is implicitly taking advantage of the structure of the problem being k-parity (where in the ground truth solution each neuron weight is either -1, 0, or 1). For example, the Sign SGD algorithm here would fail if the task was instead k-parity in an unknown basis. Many of the prior works discussed in Tables 1 and 2, however, are rotationally invariant.
In particular, the paper (Glasgow, 2023) had the more ambitious goal to show that vanilla SGD, with minimal modifications, can learn the 2-parity problem. The algorithm in that paper is rotationally invariant and will succeed if the target was XOR in an unknown basis. I thus find the comparisons to (Glasgow, 2023) in the paper, such as "our result matches the best sample complexity for solving the XOR (i.e., 2-parity) problem (Glasgow, 2023)" (line 84) to be a bit misleading.
On a more minor note, I also find the choice of activation $\sigma(z) = z^k$ to be unrealistic, as this again is essentially hardcoding the fact that the target problem is $k$ -parity. As seen in Tables 1 and 2, most of the prior work uses the ReLU activation.

问题

Can you include more intuition in the main text for Lemma 5.3, specifically why a batch size of $\tilde O(d^{k-1})$ is sufficient for gradient concentration?

局限性

N/A

作者回复

2024-08-07

Thank you for your support and constructive feedback. We address your questions and provide clarifications below.

Q1. My main issue with this paper is the choice of the Sign SGD algorithm. Unlike vanilla SGD, Sign SGD is not invariant to the choice of the basis, and is implicitly taking advantage of the structure of the problem being k-parity (where in the ground truth solution each neuron weight is either -1, 0, or 1). For example, the Sign SGD algorithm here would fail if the task was instead k-parity in an unknown basis. Many of the prior works discussed in Tables 1 and 2, however, are rotationally invariant.

A1. Our choice was motivated by several factors outlined in Remark 3.2: it normalizes the gradient, ensures uniform neuron updates towards identifying parity, and effectively nullifies noisy coordinate gradients. While effective for the standard k-parity problem with a standard basis, we acknowledge its limitations for unknown bases. To address this, we could consider alternatives like normalized gradient descent with an adaptive learning rate, or incorporating momentum into Sign SGD. These approaches could potentially achieve rotational invariance while retaining the benefits of Sign SGD. It's worth noting that our use of Sign SGD is primarily due to the polynomial activation function; normal SGD could potentially be applied if we can extend our analysis from polynomial activation to other activation functions. For instance, we could potentially extend our approach to sigmoid or ReLU activation through polynomial function approximation. However, the main challenge lies in identifying an appropriate functional decomposition and accurately characterizing the approximation error during training. We will discuss this rotation invariant issue in our limitations section, along with potential future research directions to address it.

Q2. In particular, the paper (Glasgow, 2023) had the more ambitious goal to show that vanilla SGD, with minimal modifications, can learn the 2-parity problem. The algorithm in that paper is rotationally invariant and will succeed if the target was XOR in an unknown basis. I thus find the comparisons to (Glasgow, 2023) in the paper, such as "our result matches the best sample complexity for solving the XOR (i.e., 2-parity) problem (Glasgow, 2023)" (line 84) to be a bit misleading.

A2. We appreciate your observation regarding the differences between our work and Glasgow (2023) [1]. Their paper studied vanilla SGD learning the 2-parity problem, which is rotationally invariant. Our work focuses on solving the general k-parity problem with sign SGD, which doesn't include XOR in an unknown basis. We have highlighted the differences in optimizers in Tables 1 and 2. Given these differences, we will revise our statement on line 84 to more accurately reflect that we match the best sample complexity only for the standard basis case. Additionally, we will use 'Sign SGD' in the title and abstract to better reflect the specific nature of our method.

[1] Glasgow, Margalit. "SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem." ICLR 2024.

Q3. On a more minor note, I also find the choice of activation to be unrealistic, as this again is essentially hardcoding the fact that the target problem is k-parity.

A3. We chose $\sigma(z) = z^k$ activation function primarily to facilitate our theoretical analysis and derive clean results for solving the k-parity problem. Since our work is the first to achieve the SQ lower bound for learning k-parity, we believe that even using a monomial activation function is significant enough. For future work, we plan to explore non-monomial activations that can be approximated by polynomial functions. This approach could lead to a more general activation function for solving a broader class of problems while maintaining theoretical traceability.

Q4. Can you include more intuition in the main text for Lemma 5.3, specifically why a batch size of O(d^{k-1}) is sufficient for gradient concentration?

A4. The batch size of O(d^{k-1}) is crucial for gradient concentration due to the specific structure and learning dynamics of the k-parity problem. Here's an intuitive explanation:

Lemma D.1 shows that for each coordinate of the $r$ -th neuron, the gap between population gradient and stochastic gradient is bounded by $\epsilon_1 \cdot \\|\mathbf{w}\_r\\|\_{2}^{k-1}$ , where $\epsilon_1 = \tilde O(B^{-1/2} + d^{(k-3)/2}B^{-1})$ . This implies that increasing the batch size B reduces the approximation error.
Lemma 5.3 and Corollary 5.4 demonstrate that we need $\epsilon_1 = \tilde O(d^{-(k-1)/2})$ to ensure the stochastic sign gradient matches the population sign gradient. Solving $B^{-1/2} + d^{(k-3)/2}B^{-1} = d^{-(k-1)/2}$ yields a sufficient batch size of O(d^{k-1}) for gradient concentration.
The condition $\epsilon_1 = \tilde O(d^{-(k-1)/2})$ is crucial because the absolute value of the population gradient on the signal coordinate is approximately $\tilde O(d^{-(k-1)/2}) \cdot \\|\mathbf{w}\_r\\|\_{2}^{k-1}$ at initialization. To guarantee that the gradient sign doesn't flip, the approximation error must be smaller than this absolute value.

We recognize the value of including this intuition in the main text. Given the one extra page for camera ready, we will add a discussion explaining these points after Lemma 5.3, helping readers better understand the connection between the problem structure, our method, and the resulting sample complexity. For a more detailed technical analysis of the gradient concentration and the derivation of the O(d^{k-1}) batch size, we refer readers to Appendix D, specifically the proof of Lemma D.1, which provides a rigorous proof of the concentration bounds and explains how we arrive at this batch size condition.

评论- Response

2024-08-09

Thank you to the authors for your detailed response, and for agreeing to add a discussion of the limitations of SignGD to the paper and revise comparisons to prior work. While I still believe that the choice of algorithm and activation limit the impact of this paper (this sentiment seems to be shared by other reviewers), I agree that matching the SQ lower bound with a gradient-based algorithm on a standard neural network is a nice contribution, and I thus raise my score to "Weak Accept."

2024-08-09

Thank you for raising your score. We will add the discussion and revise comparisons in the revision. We appreciate your recognition of our theoretical contribution and your valuable feedback throughout this process.

审稿意见

评分: 6置信度: 42024-07-12

This paper considers the problem of learning a $k$ -parity function on the $d$ -dimensional hypercube using SGD on a two-layer neural network. They consider a specific choice of activation ( $\sigma (x) = x^k$ ) and training dynamics (correlation loss, online sign SGD, fixed second layer weights) and show the following: for $m = 2^{\Theta (k)}$ neurons, batch size $B = \Theta (d^{k-1} \text{polylog} (d))$ , then $T = \Theta (log (d))$ steps of online sign SGD achieves small $0/1$ -test error. In particular, the total computation time scales as $\Theta (mTB) = \Theta (d^{k} \text{polylog} (d))$ . This is noteworthy as it matches the lower bound for learning parities from a large class of learning algorithms, namely statistical queries.

优点

The paper is well written, clear, and a very pleasant read (thank you!). Remarks and discussions provide background and comparison with other works, and help outline the contributions.
The theoretical analysis and statements are clean, and the proof quite compact and elegant, thanks to some simplifying assumptions in the model..
The fact that SGD on a (quite regular) neural network can match the runtime complexity of the SQ class of algorithms is a nice message. There is a growing literature trying to understand the power of learning with SGD on neural networks, compared to other classes of algorithms, and I believe this paper is a useful addition to the literature.

缺点

I am not sure the paper offers striking novel insights compared to existing literature, or that the analysis is particularly interesting in terms of proof techniques. This is the reason for my rating, but I remain open to changing it during the discussion period.

I think it was already quite widely believed that SGD on NNs can match the SQ lower bound (this is indicated in many papers on learning sparse functions on the hypercube/multi-index functions on Gaussian data). The case of $k$ -parities was not written/proven before, and I think this paper makes the necessary effort to settle it. However, I believe some existing analysis can be extended to this case (e.g., [Abbe2023a], [Margalit et al., 2023]), even though they will have drawbacks specific to their choice of simplifications.
The analysis and proofs are quite nice, but it feels like a game of “there exists a specific choice of hyperparameters so that an analysis goes through”. Given that the paper doesn’t put forth particularly novel insights/phenomena, the analysis of a more vanilla training setting (non-monomial activation, more general loss, and standard GD) would have felt more substantial.
There are several questions about this setting that remain completely not understood (e.g., what happens when reusing batches) which further contributes to the impression that this work is incremental.

问题

About the conclusion on going beyond SQ: the paper “On the power of differentiable learning versus PAC and SQ learning” shows that it is indeed possible with enough machine precision: we can emulate Gaussian elimination. Of course, this is not a very interesting result, as it uses a highly non-standard network that encodes an algorithm in its architecture. There are other noise-tolerant algorithms that do slightly better than SQ, it would be interesting to show that regular architectures prevent their emulation (besides low machine precision).

局限性

Yes

作者回复

2024-08-07

Thank you for your support and helpful comments. Below, we address the questions.

Q1. I think it was already quite widely believed that SGD on NNs can match the SQ lower bound (this is indicated in many papers on learning sparse functions on the hypercube/multi-index functions on Gaussian data). The case of k-parities was not written/proven before, and I think this paper makes the necessary effort to settle it. However, I believe some existing analysis can be extended to this case (e.g., [Abbe2023a], [Margalit et al., 2023]), even though they will have drawbacks specific to their choice of simplifications.

A1. Extending existing analyses to the Boolean k-parity case is not straightforward. [Abbe2023a]'s results [1] rely on Gaussian input and Hermite polynomial techniques, which don't directly translate to discrete Boolean input space. In addition, [Margalit et al., 2023]'s proof [2] depends on specific feature growth patterns of w_{sig} in xor problem, which requires non-trivial extensions for k-parity. In contrast, our paper addresses the Boolean input k-parity problem directly, introducing a feature-learning technique that illustrates the divergence between feature and noise coordinates of different neurons. By settling this case, we provide a rigorous framework for understanding SGD's behavior in this important discrete setting, bridging a gap in the literature.

[1] Abbe et al. "Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics." The Thirty Sixth Annual Conference on Learning Theory. PMLR, 2023.

[2] Glasgow, Margalit. "SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem." The Twelfth International Conference on Learning Representations.

Q2. The analysis and proofs are quite nice, but it feels like a game of “there exists a specific choice of hyperparameters so that an analysis goes through.” Given that the paper doesn’t put forth particularly novel insights/phenomena, the analysis of a more vanilla training setting (non-monomial activation, more general loss, and standard GD) would have felt more substantial.

A2. While our chosen hyperparameters may seem specific, they allow us to provide rigorous proof that matches the SQ lower bound, which is the main contribution of our paper. The insights gained from this analysis—particularly the separation of feature and noise coordinates—are novel and potentially applicable to more general settings. Our approach can be extended to other scenarios: non-monomial activations could be approximated by polynomial functions; for different loss functions, we could divide the learning process into phases, with the first phase mimicking our current analysis where neural network outputs are small and different neurons update independently; and our analysis could potentially be extended to sign SGD with momentum, which has shown connections to adaptive optimizers like Adam in recent research [1~4]. These extensions demonstrate the flexibility and potential broader applicability of our analytical approach.

[1] Balles et al. "Dissecting adam: The sign, magnitude and variance of stochastic gradients." ICML 2018.

[2] Bernstein et al. "signSGD: Compressed optimization for non-convex problems." ICML 2018.

[3] Zou et al. "Understanding the generalization of adam in learning neural networks with proper regularization." ICLR 2023.

[4] Xie et al. "Implicit bias of adamw: ℓ∞ norm constrained optimization." arXiv 2024.

Q3. There are several questions about this setting that remain completely not understood (e.g., what happens when reusing batches) which further contributes to the impression that this work is incremental.

A3. While some aspects, like batch reuse, remain unexplored, we believe this doesn't diminish the significance of our contribution. Our results use the online learning setting, widely adopted in k-parity problem studies [1-3], aligning our work with existing literature and enabling direct comparisons. We use fresh samples in each iteration to provide a clear gradient estimation error analysis, demonstrating the separation between feature and noise coordinates. Exploring batch reuse effects could be a valuable future direction, potentially employing more advanced concentration techniques or implicit bias analysis. This might lead to more efficient algorithms or tighter bounds. We view these open questions as opportunities for future research, building upon the foundation our work provides.

[1] Barak et al. "Hidden progress in deep learning: Sgd learns parities near the computational limit." NeurIPS 2022.

[2] Edelman et al. "Pareto frontiers in neural feature learning: Data, compute, width, and luck." NeurIPS 2023.

[3] Glasgow, Margalit. "SGD Finds then Tunes Features in Two-Layer Neural Networks with near-Optimal Sample Complexity: A Case Study in the XOR problem." ICLR 2024.

Q4. About the conclusion on going beyond SQ: the paper “On the power of differentiable learning versus PAC and SQ learning” shows that it is indeed possible with enough machine precision: we can emulate Gaussian elimination $\dots$ it would be interesting to show that regular architectures prevent their emulation (besides low machine precision).

A4. Thank you for your insightful comment and for sharing the paper on emulating Gaussian elimination with non-standard networks. Your suggestion to investigate whether standard architectures prevent the emulation of noise-tolerant algorithms that outperform SQ is a valuable research direction. It can reveal fundamental limitations of common neural network designs and deepen our understanding of the relationships between architecture, learning algorithms, and computational complexity. We will include this discussion in our future work section.

2024-08-10

I thank the authors for their detailed and careful response.

While I still believe that [Abbe et al, 2023] can be modified to the hypercube (note that they also have feature/noise coordinates separation in their analysis) and [Glasgow, 2023] can be generalized, I acknowledge that this is not relevant to judging the present paper. Furthermore, the analysis is significantly different because of the use of sign SGD. As mentioned by reviewer 9WvP, it might be helpful to outline the use of signed SGD in the abstract (which, as mentioned in the response, can be an interesting analytical approach).

I have no further questions at this point!

2024-08-10

Thank you for your feedback and support. We appreciate your acknowledgment of our analysis's novelty and will emphasize the use of signed SGD in the abstract.

审稿意见

评分: 6置信度: 32024-07-12

The paper addresses the k-sparse parity problem, a fundamental one in computational complexity and algorithmic theory, by using stochastic gradient descent (SGD) on two-layer fully-connected neural networks. The authors demonstrate that SGD can efficiently solve the problem on a d-dimensional hypercube with a sample complexity that matches the established lower bounds of Statistical Query (SQ) models.

优点

The paper provides a solid theoretical foundation by matching the SQ lower bound for learning k-sparse parity problems, which is a significant achievement in this area. Given the recent interest in SQ and CSA bounds and the amount of works on single and multi-index models folllowing the work of Abbe et al, I find these results pertinent.

缺点

The results, both in their scope and sytle, are more aligned with theoretical computer science (CS) than machine learning (ML). While the mathematical contributions are significant, the immediate applicability to practical ML problems is very limited.

问题

A variant of the parity problem that is not on the hypercube would be if one uses y = sign(z1 z2 z3) with z_mu = x*w_mu, with w_mu a hidden direction on the hypercube. I was wondering how different would the SG bounds and the results of the algorithm be in this case.

局限性

N/A

作者回复

2024-08-07

Thank you for your support and valuable feedback. We address your questions as follows.

Q1. The results, both in their scope and style, are more aligned with theoretical computer science (CS) than machine learning (ML). While the mathematical contributions are significant, the immediate applicability to practical ML problems is very limited.

A1. While our results are more aligned with theoretical computer science in style, they still make contributions to the machine learning community. Our work provides insights into the power and limitations of gradient-based learning methods. Furthermore, sign SGD based adaptive optimizers have been recently developed to train large models [1, 2, 3].

The theoretical analysis of sign SGD's behavior on the k-parity problem sheds light on how neural networks learn complex patterns, which can motivate the development of more efficient training algorithms. While not immediately applicable to all ML problems, our results provide a rigorous foundation for future work that could bridge theory and practice more directly. In addition, since most of the relevant work discussed in our paper has been published in leading machine learning venues (e.g., NeurIPS, ICLR, ICML), we believe that NeurIPS is an ideal venue for our work.

[1] Bernstein, et al. "signSGD: Compressed optimization for non-convex problems." ICML2018.

[2] Chen, et al. "Symbolic discovery of optimization algorithms." NeurIPS 2023.

[3] Liu, et al. "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training." ICLR 2024.

Q2. A variant of the parity problem that is not on the hypercube would be if one uses y = sign(z1 z2 z3) with $z\_{\mu} = \mathbf{x}*\mathbf{w}\_{\mu}$ , with $\mathbf{w}\_{\mu}$ a hidden direction on the hypercube. I was wondering how different would the SG bounds and the results of the algorithm be in this case.

A2. Thank you for your insightful question. The variant you propose introduces interesting modifications to the standard k-parity problem. Let's analyze this in stages:

When $\mathbf{w}\_{\mu} \in \\{e_1, ..., e_d\\}$ (standard basis vectors), the problem reduces to the standard 3-parity problem. In this case, the SQ bounds and the results of our algorithm would be O(d^3), consistent with our findings for k=3.
For general $\mathbf{w}\_{\mu} \in \\{0,1 \\}^d$ , the problem becomes more complex as the hypothesis space expands. This would likely worsen both the SQ bounds and the algorithm's performance compared to the standard case. However, with additional information about the formula, it's possible to decompose the problem into monomials to obtain tighter bounds. For example, $(\mathbf{x}\cdot \mathbf{w}_1)(\mathbf{x}\cdot \mathbf{w}_2)(\mathbf{x}\cdot \mathbf{w}_3) = (x_j+x_k)x_lx_m = x_j x_l x_m + x_k x_l x_m$ . This monomial decomposition enables us to apply the CSQ lower bound with leap complexity from [1].
For analysis of our algorithm, we can start by examining the population gradient descent. With some additional information about the formula of w_mu, we can decompose the population gradient into different parts using the same technique in point 2 and then analyze the weight update as in Equations 5.3 and 5.4 of our paper. We can then apply the approximation error between the population gradient and the stochastic gradient (Lemma 5.3) to derive the final results. Such an approach allows us to extend our current techniques to more complex settings.

In conclusion, while the exact bounds and algorithm performance for this variant require detailed analysis, we expect both the SQ bounds and the required sample complexity to be highly dependent on the specific form of $\mathbf{w}\_{\mu}$ . This variant presents an interesting direction for future research, potentially bridging our results with more general machine-learning problems.

[1] Abbe, et al. "SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics." COLT 2023.

审稿意见

评分: 6置信度: 42024-07-13

This work considers learning k-sparse parities with two layer neural networks trained with stochastic signed gradient descent, where the second layer is fixed at initialization and the first layer is trained. It is shown that with the activation $\sigma(z) = z^k$ , a network with width $2^{\Theta(k)}$ can solve the k-sparse parity problem with $\tilde{O}(d^{k-1})$ samples in $O(\log d)$ iterations, almost matching the Statistical Query lower bound for this problem in terms of the number of queries.

优点

The theoretical analysis of the paper is conceptually simple and clean, and there is sufficient discussion on the intuition behind the proof techniques in the main text, easily conveying the ideas to the reader.

缺点

While the goal of this work is to understand the training dynamics of modern neural networks, given the use of a polynomial activation and signed gradients, it is not clear to what extent the intuitions from this work extend to more mainstream algorithms, such as ReLU networks trained with SGD.

Further, it is possible to more explicitly discuss some limitations of the work. For example, it appears that the algorithm is not rotationally invariant, therefore is using the knowledge of the coordinate system, in comparison with the work of Glasgow (2023) which does not require this knowledge (for the simpler XOR problem). It might also give readers a more accurate picture if “signed SGD” is used in the title and abstract instead of “SGD”.

问题

Is there a way to trade off compute (iterations/neurons) against samples in upper bounds while maintaining the optimal number of queries?
The authors mention non-i.i.d. data as room for future research. There are perhaps even more accessible settings to be investigated. For example, can these results be extended to non-isotropic data, similar to the setting of [1]?
The summation notation in Line 183 might need clarification. I might have missed this, but I didn't see the definition of $s$ used here.

[1] Nitanda et al. "Improved statistical and computational complexity of the mean-field Langevin dynamics under structured data." ICLR 2024.

局限性

Some limitations could be more explicitly discussed as described in the “Weaknesses” section.

作者回复

2024-08-07

Thank you for your support and constructive feedback. We address your questions as follows.

Q1. While the goal of this work is to understand the training dynamics of modern neural networks, given the use of a polynomial activation and signed gradients, it is not clear to what extent the intuitions from this work extend to more mainstream algorithms, such as ReLU networks trained with SGD.

A1. Our study focuses on efficiently solving the k-parity problem using gradient descent, employing polynomial activation and sign SGD for theoretical analysis. Although our setting is specific, core insights like separating features and noise may extend to broader settings. Moreover, recent research has shown connections between sign-based methods and adaptive optimizers like Adam [1-3], indicating relevance to practical algorithms. Importantly, our results match the statistical query lower bound, thus providing valuable theoretical benchmarks and analytical techniques for learning k-parity problems. Our findings have the potential for extension to ReLU activations through polynomial function approximation. The main challenges in extending this to ReLU activation lie in identifying an appropriate functional decomposition and accurately characterizing the approximation error during training. While we primarily use sign SGD due to our polynomial activation function, standard SGD may also be applicable.

[1] Balles et al. "Dissecting adam: The sign, magnitude and variance of stochastic gradients." ICML 2018.

[2] Bernstein et al. "signSGD: Compressed optimization for non-convex problems." ICML 2018.

[3] Zou et al. "Understanding the generalization of adam in learning neural networks with proper regularization." ICLR 2023.

Q2. Further, it is possible to more explicitly discuss some limitations of the work. For example, it appears that the algorithm is not rotationally invariant, $\ldots$ It might also give readers a more accurate picture if “signed SGD” is used in the title and abstract instead of “SGD”.

A2. Our algorithm is not rotationally invariant, which is a limitation compared to gradient descent. We have acknowledged it in the comparison tables 1 & 2. In the revision, we will expand our limitations section to explicitly discuss this as well, and use 'signed SGD' in the title and abstract. While these limitations exist, our approach provides valuable insights into efficiently solving the k-parity problem, demonstrating the power of SGD. Furthermore, adaptive optimizers with sign SGD have been recently developed to train large models [1, 2]. We believe our work could potentially extend to momentum-based sign SGD algorithms. This could be an interesting direction for future research, potentially leading to even more efficient algorithms for the k-parity problem.

[1] Chen, Xiangning, et al. "Symbolic discovery of optimization algorithms." Advances in neural information processing systems 36 (2023).

[2] Liu, Hong, et al. "Sophia: A scalable stochastic second-order optimizer for language model pre-training." arXiv preprint arXiv:2305.14342 (2023).

Q3. Is there a way to trade off compute (iterations/neurons) against samples in upper bounds while maintaining the optimal number of queries?

A3. Yes, there is a way to trade off computation resources against sample size while maintaining the optimal query complexity. Edelman et al. [1] demonstrated that sparse initialization and increased network width lead to improvements in sample efficiency for solving the k-parity problem. We believe this technique could be adapted to our method to achieve better sample complexity at the cost of using more neurons. Specifically, using sparse initialization and a larger number of neurons would likely reduce the stochastic approximation error of the gradient under the same batch size. Lemma 5.3 of our paper shows that this estimation error is significantly influenced by the weights' norm, which decreases with sparse initialization. This suggests we can use a smaller batch size during training while maintaining performance. This trade-off would preserve the optimal query complexity while reducing the sample size, at the expense of an increased number of neurons. However, further analysis would be needed to determine the exact relationship between network size, sample complexity, and overall query complexity in our approach. This could be an interesting direction for future work.

[1] Edelman et al. "Pareto frontiers in deep feature learning: Data, compute, width, and luck." NeurIPS 2023.

Q4. The authors mention non-i.i.d. data as room for future research. There are perhaps even more accessible settings to be investigated. For example, can these results be extended to non-isotropic data, similar to the setting of [1]?

A4. Thank you for highlighting this important direction and pointing out the related work. We will add a discussion on extending our results to non-isotropic data settings [1] in the revision. To address the k-sparse parity problem under linear transformations (which would result in non-isotropic data), we believe that sign gradient descent with momentum could be a promising approach. Momentum-based methods have shown effectiveness in handling ill-conditioned optimization landscapes, which often arise from non-isotropic data distributions. Specifically, momentum could help in two ways:

It could accelerate learning along the directions of consistent gradient, which may correspond to the important features in the transformed space.
It could help overcome local irregularities in the loss landscape introduced by the linear transformation. However, a rigorous analysis would be needed to determine how the sample complexity and convergence guarantees would change as the linear transformation matrix varies.

[1] Nitanda et al. "Improved statistical and computational complexity of the mean-field Langevin dynamics under structured data." ICLR 2024.

2024-08-11

Thank you for your detailed responses. After reading them and the other reviews, I am happy to keep my original rating and recommend acceptance.

2024-08-11

Thank you for your continued support.

最终决定Accept (poster)

2024-09-25

This paper constructs a depth-2 neural network with polynomial activations such that a variant of SGD solves the sparse parity problem nearly as well as any Statistical Query algorithm. This work is part of an extensive line of work over the past years. While results of qualitatively the same flavor have been shown for several related problems, the reviews agreed after the discussion that this work meets the acceptance threshold.