1.It is unclear why the result of Lemma 3.1 is important. In my understanding, it is very technical and serves for the proof of the main results. The writing lacks intuition and presenting the proof is not helpful for understanding.

2.Theorems 3.1 and 3.2 hold only when is not very large. What will happen after that? When is close to 0, it means that even without any training, but the risk has almost the same guarantee. Does that make sense?

3.The presentation of formulas should be improved. The in Line 207 is different from the defined in equation (9)? What role does it play in this setting? What are the "one hot vectors of the token" defined in Line 230? What’s the dimension of this vector? And how do you get equation (11)? In Theorem 3.1, what are and ?

4.I don’t quite get the message in the experiment part. In Section 4.1, the theoretical fits seem like linear? But the theoretical result is not? How do you choose in the graph? It is very strange to claim the approximation is good without specifying the parameters. Moreover, It says the approximation is good for , but the experiment is only done for ? In Section 4.2, still I don’t get how the empirical line is drawn, and why the approximation can reflect anything. Why can’t we just treat the trend as linear and use a linear line to approximate it instead of the proved ?

In general, although there are many parts to be clarified, I can almost understand the theoretical results. However, the experiment part should be further explained and the paper should be further polished to convey the message clearly.