PaperHub
6.0
/10
Poster3 位审稿人
最低5最高7标准差0.8
7
6
5
3.7
置信度
正确性2.7
贡献度2.3
表达3.0
NeurIPS 2024

Fine-grained Analysis of In-context Linear Estimation: Data, Architecture, and Beyond

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

We study the loss landscape of in-context learning for single-layer linear attention and state-space models under general linear tasks model while delineating the effect of distributional alignments (e.g., RAG), low-rank constraints, and LoRA.

摘要

关键词
In-context learninglinear attentionstate-space modeloptimizationRAGLoRA

评审与讨论

审稿意见
7

The paper studies the in-context learning capabilities of linear attention (ATT) and linear state space layers (SSM) on a linear regression task. It shows that for both ATT and SSM there exists a parametrization such that they perform as well as one step of preconditioned gradient descent (PGD) with optimal preconditioning in expectation, and that they cannot perform better.

The authors then propose a model for retrieval-augmented generation where in-context examples and the final query are correlated. For PGD, they provide an approximate expression for the optimal preconditioning weights and the related loss and relate it to the optimal weights in the noiseless, i.i.d. case, showing how the correlations between in-context examples and query translate into an increased effective sample size.

Next, the authors give an upper bound on the optimal expected loss (for 1-step PGD <=> ATT <=> SSM) in a low-rank adapter setting.

Experiments validate the theoretical results.

A final experiment shows some initial results hinting that the SSM model studied (H3) performs slightly better than linear attention in an online setting and with a non-stationary task.

优点

In terms of results, the main original contribution is the analysis of the model with correlated training examples and query points. However, the analysis of the SSM model in the appendix seems non-trivial and should therefore also be considered as a significant element of originality.

The section on related work is well-written and covers relevant work.

The problem setup is very clean, making it very easy to read and understand the results. All results are clearly stated with the necessary assumptions and accompanied by clear explanations.

The questions studied are highly relevant to contemporary machine learning research and applications of AI.

缺点

  • The main result (12) is not stated as a theorem, and the derivation in the appendix is not completely rigorous. In particular, it is not clear under which conditions the approximation in l.558 is justified. Unless the conditions for the approximations to be valid are stated more clearly, it is difficult to extrapolate from the experimental results to the full claim (A2).

  • The result on LoRA is not interpreted, and it is not immediately clear what the implications are from the statement of the result. I am not sure if the result implies claim (A3).

  • The additional experimental results (Fig. 3) seem somewhat preliminary and detached from the rest of the paper. Based on these results, it is not clear that the part of claim (A1) claiming an advantage of H3 is fully supported by the evidence.

问题

  • Under which assumptions are the approximation in (12) valid? What is the nature of the approximations made in the proof of the statement (high probability, asymptotic, ...)?

  • Can you elaborate on how claim (A3) follow from (14)?

  • Define α\alpha-correlated

  • l.342: wildly studied -> widely studied (?)

局限性

The authors address the limitations of their work and acknowledge that their analyses are "not precise and fully formal".

作者回复

We would like to thank the reviewer for recognizing the relevance of our work to contemporary ML research and AI applications. Indeed, our focus is developing a theoretical understanding of in-context learning (ICL) that is insightful for practical settings. Below, we respond to the concerns raised by the reviewer point by point.

W1/W2: We have clarified these in our Common Response to all reviewers.

W3: We acknowledge this concern and will better explain Figure 3. Proposition 1 establishes the equivalence of linear attention (ATT) and SSM under Assumptions 1 and 2, namely, (i)(i) all ICL examples within the prompt share the same task vector, (ii)(ii) they are conditioned on the task and query vector, and (iii)(iii) the model is trained to predict query at a fixed time n+1n+1. The point of Fig. 3 is that, when these assumptions are not valid, SSM and ATT are not necessarily equivalent and also that SSM can implement a more general predictor, namely weighted PGD.

  • In Fig 3(a,b,c) we train ATT and SSM using the average loss where we treat all examples within the prompt as a query. That is, rather than fitting only the last token (index n+1n+1), we fit all timesteps where we predict time i+1i+1 using examples from 11 to ii. This is in line with how LLMs are trained seq2seq in practice as well as ICL experiments of [Garg et al. NeurIPS’22]. Under this setting, Fig 3(a,b) demonstrate that SSM and ATT have different behavior whereas Fig 3(c) demonstrates that SSM achieves a strictly lower risk which is in line with our statement (as weighted PGD is more general than PGD). Here, the intuition is that SSM can intelligently weight different positions in the prompt to better minimize average risk.

  • Finally, Fig 3(d) showcases another setting where Proposition 1 does not hold: The task vector β\beta is not shared and is varying along the prompt (i.e., temporally-drifting distribution). In this setting, SSM outperforms ATT for the same reason. The difference becomes visible as nn grows and there is more room for weighing the context window via weighted PGD.

We will revise the current text so that these points come across more clearly.

Q1: We clarified this in the Common Response: This approximation is valid when α=O(1/d),n/d=O(1)\alpha=\mathcal{O}(1/\sqrt{d}), n/d=\mathcal{O}(1) and dd is sufficiently large. We will include a full theorem stating the precise risk in the revised manuscript.

Q2: (A3) follows from Lemma 1 and Eq. (14). Recall that Σ\Sigma and Σnew\Sigma^{new} are the effective covariances of the source and target distributions. Here, effective covariance refers to the product of the task and feature covariances. Lemma 1 studies the landscape of low-rank parameterized attention: Eq. (13) is a strict generalization of the risk formula L\cal{L}_* of Theorem 1 where the model reduces the risk precisely along the top-rr eigendirections. Similarly, Eq. (14) considers the fine-tuning setting and shows that LoRA can improve the risk of an initial model by updating the attention weights along the top-rr eigendirections with the maximal gain. Additional discussion is under the Common Response.

Q3: The notion of α\alpha-correlation is defined in Eq.(11) where E[cos(xi,x)]=α\mathbb{E}[\cos(x_i,x)]=\alpha. Thus, α[0,1]\alpha\in[0,1] is the Pearson correlation coefficient. We will revise the manuscript ensuring that the term "α\alpha-correlated" is clearly linked to its definition, especially in the introduction section.

Q4: Thanks for identifying the typo error and we have corrected it.

We hope that the clarifications provided in our responses have addressed the concerns highlighted by the reviewer. We would be happy to respond to further feedback.

评论

Thank you for your response. My concerns have been addressed and I updated my ranking to Accept.

评论

Many thanks for your encouraging assessment and positive recommendations! Your suggestions have been very valuable, and we will revise and further improve our manuscript based on them.

审稿意见
6

The authors examine the capabilities of Transformers with linear attention in performing in-context learning (ICL) by implementing a linear estimator through gradient descent. The existing studies mostly consider IID task and feature vectors and fully parameterized attention weights. This work expands on these studies by analyzing:

  1. The landscape of 1-layer linear attention and 1-layer H3 (a state-space model), proving both can implement 1-step preconditioned gradient descent.

  2. New risk bounds for retrieval augmented generation (RAG), showing benefits from distributional alignment.

  3. The optimal risk for low-rank parameterized attention weights, illustrating how LoRA adapts to new distributions.

优点

The authors provide a comprehensive theoretical analysis of the landscape for ICL, extending existing results to more practical settings. It offers new insights into the benefits of distributional alignment and the capabilities of low-rank parameterization in attention mechanisms. The theoretical findings are supported by experimental results, enhancing the credibility of the conclusions.

缺点

  1. The analysis is limited to single-layer linear attention and linear distribution data, which might not fully capture the complexities of multi-layer architectures. 2. The RAG and LoRA analyses are not precise and fully formal.

问题

  1. How does the distributional alignment in RAG quantitatively affect the sample complexity of ICL? Are there specific metrics or case studies that illustrate these benefits in practical applications?

  2. What are the limitations of LoRA in your cases? I believe LoRA significantly underperforms compared to fully parameterized models. If there are no limitations compared with the fully parameterized models because of the linear distribution data, can you talk about the relationship between the data distribution and LoRA.

  3. The paper focuses on single-layer (linear) models. How do the findings extend to multi-layer (linear) architectures?

  4. The paper discusses LoRA adaptation for distribution shifts. Are there practical examples or case studies where LoRA has been successfully applied to handle real-world distribution shifts?

  5. The study primarily considers linear regression tasks. How would the insights gained from this study apply to more complex tasks, such as natural language processing or computer vision tasks that involve non-linear relationships?

局限性

As mentioned in Weakness: 1. The analysis is limited to single-layer linear attention and linear distribution data, which might not fully capture the complexities of multi-layer architectures. 2. The RAG and LoRA analyses are not precise and fully formal.

作者回复

We appreciate the reviewer's constructive feedback and the recognition of the credibility of our findings. Below, we address the questions and concerns raised by the reviewer.

W1: While our study focuses on single-layer architectures similar to previous work, it nonetheless presents novel and insightful conclusions for various practical settings pertaining to in-context learning (ICL). We have briefly summarized the contributions in the Common Response. Although there exist works investigating the ICL performance using multi-layer models [Ahn et al., Von Oswald et al.], analyzing multi-layer architectures still introduces challenges, particularly when exploring the optimization landscape rather than existence or approximation results. We acknowledge this as an important area for future research.

W2: Given that this concern appears to be common, we have derived the exact formula for the RAG setting and provided empirical evidence supporting its validity in the Common Response. We have also included additional discussion regarding the LoRA setting. The agreement between our theoretical analysis and empirical results in the figures (Fig. 1 for RAG in the provided pdf file and Fig. 2(c) for LoRA in the submission) further validate our findings.

Q1: The benefit of RAG has been empirically studied in existing literature [Lewis et al., Nakano et al., Izacard et al.] which allows for selecting relevant demonstrations from a collection of instances. In our work, we model it within a linear data setting in Eq. (11) such that query token xx is correlated with the in-context features (xi)i=1n(x_i)_{i=1}^n. We derive results showing that α\alpha-correlated RAG data model achieves a reduction in sample complexity by a factor of (α2d+1)(\alpha^2d+1). This implies that highly relevant demonstrations require significantly fewer in-context samples to achieve comparable performance compared to unrelated/fixed demonstrations.

Q2: It is acknowledged that LoRA underperforms fully parameterized models, and our results reflect this. As demonstrated in Eq. (14), setting r=dr=d returns the optimal risk when the model is fully parameterized, highlighting a discrepancy when rdr\neq d or when ΣoldΣnew\Sigma^{old}\neq\Sigma^{new}. Fig. 2(c) also empirically validates this by varying the rank of LoRA weights; a rank of 20 corresponds to full parameterization, achieving the lowest risk. Reducing the rank rr increases the test risk, as expected.

Q3: Multilayer architectures will be more challenging due to their non-convex loss landscape. The most related work is by Ahn et al. who characterizes the critical points of ICL with multi-layer linear attention. Extrapolating from our and their results, we suspect that certain critical points of the LL-layer SSM would correspond to implementing a LL-step weighted PGD algorithm. Ahn et al. also makes stringent assumptions on data (besides the IID data model, they also require specific covariance structures as shown in their Table 1). It would be interesting to see if our general data assumptions can be imported to the multilayer models.

Q4: LoRA [Hu et al.] is a popular technique proposed to adapt large models to downstream tasks in a parameter-efficient fashion. Typically, downstream tasks have different distributions from pretrained tasks, and traditional methods such as fine-tuning the entire model over the new task are expensive and inefficient. LoRA uses fewer parameters to adapt the model to new tasks without modifying the model weights, making it both memory- and parameter-efficient. In practice, it has been widely applied and shown to realize significant improvements (e.g., Llama3). Our focus is on studying its theoretical aspects for ICL settings.

Q5: As noted in our discussion of relevant literature in the submission, Mahankali et al.'s work has demonstrated that even for nonlinear in-context tasks, a linear attention model still implements one step of gradient descent on the linear regression objective. This suggests that for nonlinear tasks, a single-layer linear attention model may still achieve the same optimal loss as optimally PGD, and the challenge lies in finding the optimal preconditioning weight. While we have not yet explored nonlinear settings, we recognize this as an important direction for future research.

We hope this response adequately addresses the reviewer's concerns. We are committed to enhancing our submission based on their feedback.

审稿意见
5

This paper studies how transformers can use ICL to solve linear regression problems. It is shown that state space models, transformers are both capable of performing linear regression as well as gradient descent (which implements the least squares solution). There are results about LORA and RAG.

优点

The presentation is clear.

缺点

The analysis of vanilla regression is not novel, I think, and has been done for instance in Theorem 1 of Ahn et al (transformers learn to do preconditioned gradient descent) where it is even shown that a pretrained transformer (with GD) learns to do this in a similar setting (which is stronger than showing that it can learn it). Maybe for SSMs this result is new but I am not sure how strong of a contribution this is considering that it only establishes the existence of such a solution.

问题

What is the significance of the RAG result? It makes sense that giving the learner access to β\beta through XX directly improves the error. A comparison to some reasonable baseline might be a nice story if it came about that transformers are able to leverage this kind of side information particularly well. What i mean is to compare this with a loss like yXβ2Xβ2+β2\Vert y-X\beta\Vert^2 -\Vert X\beta\Vert^2 + \Vert \beta\Vert^2 which is aware of this kind of side information.

What is the significance of the LoRA result? A decomposition of the optimal error in terms of the singular values of a covariance is shown and then it is observed that if the covariance changes in a specific way (I think the singular vectors have to remain fixed), then LoRA can help. What role does Lemma 1 play? This is not about LoRA, but rather for an entirely low rank model.

Why does linear attention perform better for shorter contexts than it was trained on (Figure 3a)? How does Figure 3b show that H3 is better at generalizing to a longer context length if the plot is discontinued at the context it was trained on? I noticed that the performance on varying β\beta in Figure 3d better than the iid noiseless case. Can you elaborate on that? Is this for a fixed (that is, rather than varying) β\beta?

局限性

Yes

作者回复

We thank the reviewer for the detailed comments and questions on our submission.

W1: Novelty of contribution and prior art. Ahn et al. analyzes the loss landscape of linear transformers. Their results only apply to special IID data models (see their Table 1) but they also characterize critical points of multilayer attention. Also the more recent [Zhang et al.] studies gradient flow under the IID data model with isotropic task covariance.

Similar to Ahn et al., we characterize the loss landscape. While our Theorem 1 may appear similar to them, our work makes multiple novel contributions discussed under the Common Response. To recap,

  • We provide the first analysis establishing the optimization-theoretic equivalence of single-layer SSM/H3, linear attention, and optimal PGD under suitable data settings.
  • Comparable works assume IID data whereas our theory allows for correlated data under Assumptions 1 and 2.
  • Comparable works assume full parameterization (d×dd\times d weights). This is not realistic in practice. Our Lemma 1 characterizes the landscape of low-rank attention for the first time.

W2: "The results only show the existence of the solution." As explained above, our study extends beyond showing the existence of a solution and thoroughly investigates the optimization landscape of ICL across various architectures and data settings.

Q1: Significance of RAG. In a typical RAG setup, given a query, one retrieves relevant demonstrations to create in-context prompts which often leads to improved performance compared to utilizing query-independent demonstrations for ICL. With the objective of providing a theoretical justification to this phenomenon observed in practice, we consider the data model in Eq. (11), where the query token xx and the in-context features (xi)i=1n(x_i)_{i=1}^n are defined to be relevant with α\alpha being the Pearson correlation coefficient among them (motivated by RAG practice). However, it’s important to note that the query and in-context features are all independent of the task vector β\beta. Thus, contrary to the reviewer’s statement, we are not providing the learner access to β\beta through XX. As a key contribution, under the data model in Eq. (11), we theoretically quantify the improvement in the ICL sample complexity realized via RAG. As shown in Eq. (12), the improvement factor is (α2d+1)(\alpha^2d+1). Higher α\alpha values lead to greater sample efficiency benefits.

Fair comparison baseline: In Eq. (11), we sample (xi)i=1n(x_i)_{i=1}^n via N(αx,(1α2)I)\mathcal{N}(\alpha x,(1−\alpha^2)I) to ensure that the features and labels for different correlation coefficients α\alpha have the same norm, i.e., E[xi2]=dE[||x_i||^2]=d and E[yi2]=d+σ2E[y_i^2]=d+\sigma^2. With this normalization, α=0\alpha=0 serves as the baseline, and increasing α\alpha benefits in reducing the sample complexity of ICL as validated in Fig 1(b). Finally, under the RAG setting where XX and β\beta are independent, the loss yXβ2Xβ2+β2||y-X\beta||^2-||X\beta||^2+||\beta||^2 suggested by the reviewer will yield similar results, subject to some normalization.

Q2: Significance of LoRA. We study low-rank attention and LoRA adaptation because these are typically what is used in real applications. Lemma 1 shows that attention with low-rank parameterization learns to recover the rr-dimensional data-task eigenspace corresponding to the top eigenvalues to achieve optimal loss. Based on this insight from Lemma 1, we derive the LoRA result, by considering the distance between the old and new covariance eigen-spectrums. Also see Common Response.

Q3: Questions on Fig.3.

  • Why linear attention performs better for shorter contexts. In Fig. 3(a)-(c), we train the model to minimize the average risk over all in-context instances, not just the last/query token as in other results. According to Theorem 1, the optimal weight WW_\star varies with context length nn, so the model must learn to optimize WW across different lengths. Thus, a model trained on shorter in-context windows can outperform the one trained on longer windows when tested on shorter contexts.
  • How Fig. 3(b) shows that H3 is better for longer context... Inspired by this question, we tested both linear attention and SSM models (from Fig. 3 (a) and (b)) over unseen context lengths up to 100100. The results are provided in Fig. 2 of the provided pdf file in the Common Response. These results clearly show that, compared to linear attention, SSM generalizes much better to unseen context lengths, supporting our claim. Additionally, these results indicate that while models trained on shorter in-context lengths perform better over shorter ranges, their performance does not compare as well over longer ranges. We will include these new experiments.
  • ...performance on varying β\beta in Fig. 3(d) better than the IID case… Nice question! The reviewer is correct in suggesting an IID baseline. We have identified that the experimental setup led to varying expected norms of the label E[yi2]E[y_i^2] due to the way βi\beta_i is defined in the experiment section. To recap, given two random vectors β1\beta_1 and β2\beta_2 that follow the standard Gaussian distribution, the task vector at the iith position is defined by βi=αiβ1+(1αi)β2\beta_i=\alpha_i\beta_1+(1-\alpha_i)\beta_2. Then the expected norm of the label E[yi2]=αi2E[xβ1]+(1αi)2E[xβ22]=(αi2+(1αi)2)dE[y_i^2]=\alpha_i^2E[||x^\top\beta_1||]+(1-\alpha_i)^2E[||x^\top\beta_2||^2]=(\alpha_i^2+(1-\alpha_i)^2)d varies with ii. Based on this finding, we have updated the black curve of Fig.3(d) in the paper and the revised figure is shown in the Fig.3 in the provided pdf file. The updated curve now provides a reasonable IID baseline. We appreciate the reviewer's helpful comment and will revise the paper accordingly.

We thank the reviewer for the detailed comments. Although the reviewer noted a lack of novelty and recommended rejection, we hope our responses highlight the novelty and clarify any misunderstandings.

评论

I had a misunderstanding about the RAG section. Please disregard what I said. I thought xx was sampled as (αβ,(1α)I)\mathcal(\alpha \beta, (1-\alpha)I).

Because of the clarifications on the experiments, and the contributions relating to SSMs (equivalences to PGD), I change my score from 3 to 5.

However, I still feel the following about the Transformer results:

  1. The LoRA discussion holds only when the new and old Σ\Sigma are jointly diagonalizable and this makes the message a little weak, but it is possible that this is truly the only benefit from LoRA on a single layer.
  2. Using the present model to study RAG still seems somewhat simplistic to me.

These are difficult to fix, since this is a critique of the model itself. The results themselves seem reasonable to me.

评论

We thank the reviewer for reassessing our work and for raising their rating. We would like to provide further clarifications on their comments:

  1. For low-rank parameterization, our main technical result is Lemma 1. This result does not rely on the assumption of diagonalization and establishes a natural generalization of fully-parameterized results (Theorem 1) to the low-rank setting. In LoRA analysis, we assume diagonalization mostly to be able to provide closed form interpretable bounds on its impact in terms of eigenspace. We anticipate one can establish similar upper bounds on the impact of LoRA in terms of the angle between the eigenspaces of old and new covariance matrices. Finally, it worths noting that the analysis of general covariance is challenging even for vanilla ICL (fully parameterized) as both Ahn et al. and Mahankali et al. make assumptions about the covariance of β\beta and/or xx while we allow for arbitrary xx, β\beta covariances in our main results.
  2. Our RAG model is inspired from real RAG systems [Karpukhin et al., Lewis et al.], which utilize cosine similarity of the document (feature) embeddings. However, we agree with the reviewer that exploring more complex RAG models, either with general covariance models or allowing for nonlinear feature/document representations learned via a separate retriever model, would be valuable. This is certainly an exciting direction for future work and we appreciate their suggestion.
作者回复

Common response to all the reviewers

We thank the reviewers for their constructive comments and insightful questions. We are glad that Reviewer HGXo acknowledges the credibility of our conclusions and Reviewer EE6j notes the high relevance of our study to contemporary machine learning research and applications of AI. Here, we would like to restate our key contributions and respond to other shared concerns raised by the reviewers.

Our main contributions are as follows:

  1. We establish the theoretical and empirical equivalence among optimizing single-layer linear attention, single-layer SSM, and implementing optimally-preconditioned gradient descent (PGD). While previous works (e.g., Ahn et al.) have noted the equivalence between linear attention and PGD, to the best of our knowledge, our work is the first to elucidate the equivalence between SSM and PGD.

  2. Our Proposition 1 extends the equivalence among attention, SSM, and PGD to more general and realistic settings, subsuming but going beyond the independent data scenario (as in Ahn et al. and our Theorem 1). Two key contributions are:

  • Our contribution on SSM is entirely novel and relies on establishing an optimization-theoretic equivalence between gating (within the SSM) and linear attention.
  • By considering dependent data (e.g., in RAG) and low-rank parameterizations (e.g., for LoRA adaption) — factors not assumed or analyzed in previous studies — we enhance the understanding of model behavior under more complex yet highly practical settings.
  1. The alignments between theoretical predictions and empirical results demonstrate the accuracy and value of our theoretical insights.

To proceed, we address the shared concerns by the reviewers:

  • The exact formula of RAG analysis: We recognize that the analysis of the RAG setting in our submission is not fully precise due to the complexity involved in the high-order (up to 6-order) moments of xx, XX, and β\beta. To address this main concern, we have recalculated the exact formulations for the RAG data setting. In particular, the final solution takes the following exact form: W=cIandL=d+σ2cnd(α2(d+1)+1)W_\star=cI\quad\text{and}\quad {\mathcal{L}_\star}=d+\sigma^2-cnd(\alpha^2(d+1)+1) where c=α2(d+1)+1α4n(d+2)(d+4)+α2(1α2)(d+2)(d+2n+3)+(1α2)2(d+n+1)+σ2(α2(d+1)+1).c=\frac{\alpha^2(d+1)+1}{\alpha^4n(d+2)(d+4)+\alpha^2(1-\alpha^2)(d+2)(d+2n+3)+(1-\alpha^2)^2(d+n+1)+\sigma^2(\alpha^2(d+1)+1)}. Here α=0\alpha=0 corresponds to iid setting. Note that Eq. (12) in our submission provided an approximate solution assuming that α=O(1/d)\alpha=\mathcal{O}(1/\sqrt{d}), d/n=O(1)d/n=\mathcal{O}(1) and large enough dd. Additionally, we have also updated the RAG figure (Fig.1(b)) based on the exact formula provided above and the results are shown in Fig.1 of the provided pdf file. The new theoretical predictions now perfectly align with empirical observations. We will incorporate these updates in the final version of the paper.

  • The interpretation of LoRA results: Our LoRA results in Eq. (14) show that when there is distribution shift and the joint diagonalizability assumption holds, LoRA leads to the model adapting the initial preconditioning weights to a target distribution over the principal rr-dimensional eigenspace with the maximal gain. Though the analysis presents only an upper bound due to the complexity of arbitrary distribution shifts (e.g., arbitrary Σold\Sigma^{old} and Σnew\Sigma^{new} matrices), it marks the first optimization-theoretic exploration of LoRA in ICL. Our empirical results in Figure 2(c) validate the tightness of the prediction of Eq. (14). Notably, Lemma 1 on low-rank attention provides a stronger theoretical guarantee on the landscape, namely, Eq. (13). This can be viewed as a special case of the LoRA setting where the weights of the initial model are set to zero. We will enhance the discussion of our LoRA results. We believe that extending our analysis of LoRA to broader settings is an interesting avenue for future work.

最终决定

This paper shows that linear transformers can perform ICL for a linear regression setting. In particular, this paper considers SSM as well as the usual linear attention model. It is shown that SSM can also perform ICL even when the distribution of input and task has non-diagonal covariance.

I think the PGD part of this paper is not new. That is already provided in the previous papers. The novelty of this paper is to show the ICL ability of SSM. This is not trivial.
The writing is clear and precise. The authors provided detailed results in some concrete situations such as RAG and LoRA. This provides better understanding on the authors' contribution.

Overall, this paper provides an interesting contribution on the literature especially about understanding learning capability of SSM. The reviewers are all positive on this paper. Hence, I recommend acceptance of this paper.