Q6: Can you show a more detailed demonstration on why the inequality of holds (or is just a definition?) and why Lemma 5.2 can be incorporated into Lemma 5.3?

A6: As we have clarified previously, the conclusion regarding the inequalities of in Lemma 5.2 (Lemma 4.2 in our updated manuscript) is rigorously derived solely based on our assumptions regarding the data distribution in Definition 3.1 (Definition 2.1 in our updated manuscript) and assumptions regarding the scale of parameters in Theorem 3.2 (Theorem 2.2 in our updated manuscript). The complete proof for conclusions in Lemma 5.2 is demonstrated from page 24 to page 28 and specifically the inequalities for are from line 1424 to line 1445 for your reference.

Here, we also briefly introduce the procedure for developing the inequalities for . Firstly, by the assumptions regarding the data distribution, we can derive both upper bound and lower bound of the iterative rule of as follows:

We can further observe that both the upper-bound and the lower-bounds above are Euler discretizations of the ODE with different coefficient . The fact that the solution to this ODE is can well explain why both the upper and lower bounds for are at the scale of . The rigorous proof regarding the bounds for discretized iterative inequalities (1) and (2) is demonstrated in Lemma E.19 (Lemma C.19 in the original manuscript). Since we derive matching upper and lower bounds for , we can rigorously plug these results into both the lower and upper bounds for in Lemma 5.3 (Lemma 4.3 in our updated manuscript). Then we can also get matching lower and upper bounds for .

Q7: How you define the group sparse? What is this different from the standard classification?

A7: Thank you for pointing out the absence of a definition and illustration for the concept of group sparsity. We have included a formal definition for "group sparse" learning problem in our updated manuscripts, from line 105 to line 120 for your reference. In brief, for a standard linear classification problem, we denote the feature vector, with the label determined by for some pre-defined linear ground truth . Besides, there exist predefined disjoint partitions for as . Then we refer to this learning problem as "group sparse'' linear classification if the ground truth linear vector satisfies that

for some . This requirement for the ground-truth vector clarifies the distinction between the standard classification and group sparse classification.

Besides, we provide further clarification to help understand the connections and different notations between the definition above of group sparsity and Definition 3.1 (Definition 2.1 in our updated manuscript) of the data model. We consider the setting where all ’s are of the same size , implying . Besides, we convert the feature vector into a matrix , where each column is the collection of the variables from . Let be the index of the label-relevant group, i.e. . Then we can denote the -dimensional vector obtained by restricting on , so that . Based on the preceding illustrations, it is clear that Definition 3.1 defines a linear classification problem with group sparsity.

[1] Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. JMLR, 2024.

[2] Zixuan Wang, Stanley Wei, Daniel Hsu, and Jason D Lee. Transformers provably learn sparse token selection while fully-connected nets cannot. ICML, 2024.

[3] Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. NeurIPS, 2022.

[4] Spencer Frei, and Gal Vardi. Trained transformer classifiers generalize and exhibit benign overfitting in-context. arXiv preprint arXiv:2410.01774, 2024.