We thank the reviewer for the constructive comments. We are grateful that the reviewer recognized the novelty of the work. We also understand your concerns. We would like to address your concerns and questions one by one below.

W1: “... However, it implicitly requires that the input dimension is bigger than the number of samples . In fact, if that is not the case, is not full rank, and the statement becomes trivial …”

A: We apologize for not making this point clear in the submission. In short, we do not require . Technically speaking, it is the “effective” NTK condition number (largest eigenvalue divided by the smallest non-zero eigenvalue) that controls the convergence rate. When is not full rank, this relates to the smallest non-zero eigenvalue of . We would like to explain this in detail below:

Let’s consider the case , where the concern arises. Why it is the smallest non-zero eigenvalue?

– In this case, the Gram matrix is just the NTK of the linear model, and is not full rank. Note that this linear model is in the under-parameterized regime. As we know, for the linear model, the NTK has the same spectrum as the Hessian of least square loss , except the zero eigenvalues. This Hessian is expected to have full rank, and the least square loss is convex. As is well-known, the condition number is defined as , which is equivalent to , with as the smallest non-zero eigenvalue of NTK.

– How about the linear network and ReLU network? In the infinite width limit, they are essentially linear models, with the model gradient (instead of original input ) as the feature. These linear models are over-parameterized, and have a hyper-plane as the solution set . Intuitively, the optimization (by gradient descent) occurs in sub-spaces that are perpendicular to . In addition, the zero eigenspace of NTK are also perpendicular to . Hence, the zero eigenvalues of NTK contribute nothing to the optimization. Therefore, the condition number of optimizational interest is the “effective” condition number (i.e., ignoring the zero eigenvalues).

In summary, our results do not require . Our results essentially rely on the “effective” condition number of NTK. We will clarify this point in the revision.

W2: “Theorem 5.3 holds for a general two layer network with the outer layer being fixed”.

A: This setting is commonly used in literature, including very good papers, for example Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. “Gradient descent provably optimizes over-parameterized neural networks”. We also observe that many novel findings originally come out in simple settings, which does not affect its completion by later works. We believe that this simple setting is a good starting point. We are aware that for deeper networks it requires much complicated analysis. It is believable the same results hold for deep networks, as partially evident by Figure 2 and Theorem 5.2.

W3: “the numbers reported in Figure 2(b) when are a bit suspicious” & Q1: “Can the authors comment on the points reported in Figure 2(b) when ?”

A: In the experiment to compute the numbers in Figure 2(b), we actually evaluated NTK based on batches of size . This is due to the limitation by our computational resource, which could not compute and store larger Jacobian matrices. The reported numbers are averaged over the batches. In this case, the Gram matrix is always full rank, hence it is not surprising that is finite. One can anticipate that the condition numbers at may increase with larger batch size, and perhaps be infinite if it exceeds . We will clarify this setting in the revision.