PaperHub
6.5
/10
Poster4 位审稿人
最低4最高8标准差1.5
7
8
7
4
3.3
置信度
正确性3.8
贡献度2.8
表达3.5
NeurIPS 2024

Universal In-Context Approximation By Prompting Fully Recurrent Models

OpenReviewPDF
提交: 2024-05-12更新: 2024-11-06
TL;DR

We show that fully recurrent models, e.g., RNNs, LSTMs, GRUs and SSMs, can be universal in-context approximators

摘要

关键词
promptinguniversal approximationin-context learningrecurrent modelsrnnssm

评审与讨论

审稿意见
7

In-context learning has emerged as one of the puzzling properties of language models at scale. While a lot of work has been done to understand the mechanics supporting this behavior in attention-based models, much less has been done on recurrent models. Given the renewed interest in these architectures (e.g. Mamba or Griffin), a better understanding of in-context learning in these models is needed. The paper proposes to study these abilities through the notion of in-context approximation, that is whether a model can produce a certain function if it's given the right prompt. The authors show two universality results by introducing a low-level programming language that implements the canonical proof ideas for this kind of result within the neural dynamics.

优点

  • The paper studies in-context learning from an interesting perspective: in-context approximation theory.
  • Abstracting neural dynamics in a programming language is an elegant way to prove the approximation results of the paper.
  • The paper is overall very well written and easy to follow.

缺点

  • The review of the different recurrent architectures is inaccurate. For example, Mamba and Hawk do not have an A matrix that is constant: it is x-dependent. On top of that, this matrix is diagonal with real values, which makes the implementation of the rotation algorithm presented in 4.2 impossible. More clarity on these questions, starting in Section 2 is necessary.
  • Some synthetic numerical experiments involving the training networks to solve the task, e.g. comparing the solution found by gradient descent to the one of the construction or confirming the importance of gating to get more compact solutions, would be an interesting add-on. I am aware that it is a theoretical paper and I am not expecting such experiments to recommend acceptance. Yet, it would still nicely complement the paper.
  • Section 6 does not bring much to the story.

问题

  • Can the authors position their work compared to:
    • Zucchet et al. 2023: they study the link between gated RNNs (very similar to the one you consider) and attention, therefore connecting to the literature on how attention solves in-context learning. Additionally, they also highlight the importance of gating in practice.
    • Orvieto et al. 2023 and the references mentioned there on the universal approximation capabilities of linear RNNs. It would be valuable to compare with those results in more detail, c.f. next question.
  • I am wondering about how fundamentally different is in-context approximation compared to the universal approximation. Is one a subset of the other? How are they related / what are their difference? Given that one of the contributions of the paper is the introduction of this notion, answers to these questions would be particularly valuable.
  • What is the interest of introducing such a low-level programming language? It feels to me that LSLR is a tiny abstraction on top of the neural equations that is not necessarily worth it. Why not directly include the primitives used in e.g. Figure 3 in this language?

Some more open questions:

  • The paper shows that deep linear RNNs can implement a variety of functions in context, somehow supporting the argument that linear recurrence + nonlinear instantaneous processing is quite powerful. Does the framework provided by the authors give hints about which kinds of functions are difficult to learn with such networks? The authors mention something around these lines in lines 165-167, but more detail would be appreciated.
  • Are there some links between the in-context learning approximation notion introduced here and the notion of controllability in control theory?

局限性

The authors have correctly addressed the limitation of the applicability of their construction to more practical settings as well as highlighted that better understanding in-context learning is critical in a world where LLMs are available through APIs.

作者回复

The review of the different recurrent architectures is inaccurate. For example, Mamba and Hawk do not have an A matrix that is constant: it is x-dependent. On top of that, this matrix is diagonal with real values, which makes the implementation of the rotation algorithm presented in 4.2 impossible.

The reviewer is right that Mamba and Hawk do have an AA matrix that is input-dependent and diagonal. However, the AA matrix of the Linear RNN does not directly map to the AA matrix of the Mamba/Hawk model. The Mamba/Hawk has a number of channel-mixing operations which we can leverage to this end. This is the basis of our proof in Appendix E that any Gated Linear RNN can be expressed as a Hawk or a Griffin model. Our construction respects their diagonal transition matrix structure, as can be seen in Eq. 43.

Can the authors position their work compared to: Zucchet et al. 2023: they study the link between gated RNNs [...] and attention, therefore connecting to the literature on how attention solves in-context learning. [...]

Thank you for bringing the work of Zucchet et al. 2023 to our attention. The connection is quite interesting and possibly opens a different route for proving the same properties via transformers. However, while it has been shown that transformers with softmax attention are universal in-context approximators (Petrov et al., 2024), to the best of our knowledge this hasn't been proved for linear attention yet. However, our intuition is that this is likely possible and a proof for the universal in-context approximation properties of linear attention transformers might be possible via our present work. Still, it is not clear what the prompt complexity in this case would be: whether it would be the most efficient regime for fully recurrent models from this paper or the less efficient regime for transformers from (Petrov et al., 2024).

I am wondering about how fundamentally different is in-context approximation compared to the universal approximation. Is one a subset of the other? How are they related / what are their difference? [...]

Universal in-context approximation is a very different beast than the classical universal approximation. Classically, one can choose the model parameters conditional on the target function. With universal in-context approximation, however, the model parameters are fixed and independent of the target function. In other words, a single model needs to approximate all target functions (given a prompt that specifies the target function). In our experience, proving in-context properties has been more difficult because of the complex interactions between different prompt tokens. While it might appear that universal in-context approximation is a more demanding property than universal approximation and hence should be a subset of it, the two hypothesis classes could be very different and hence we could not formally define a notion of an inclusion or subset. For example, in-context approximation makes sense only for sequence-to-sequence models. However, intuitively, it does feel that if an architecture could be a universal in-context approximatior, then it should also be able to be a universal approximator.

What is the interest of introducing such a low-level programming language? It feels to me that LSLR is a tiny abstraction on top of the neural equations that is not necessarily worth it. Why not directly include the primitives used in e.g. Figure 3 in this language?

Great question! We found the bookkeeping required to work directly with the layer equations to be very messy, error-prone and not particularly intellectually stimulating. LSRL helped us streamline the proof process significantly. In a way, you could also think of it as a proof assistant of sorts, where one specifies the high-level strategy and LSRL fills in the formal details.

The paper shows that deep linear RNNs can implement a variety of functions in context, somehow supporting the argument that linear recurrence + nonlinear instantaneous processing is quite powerful. Does the framework provided by the authors give hints about which kinds of functions are difficult to learn with such networks? [...]

For any neural network architecture, one can always construct functions that are difficult for it to learn. Especially in the current "war" between Transformers and Linear RNNs, the literature is full of carefully constructed examples that trip one or the other. Linear RNNs will always be bad at recall within large context sizes due to their fixed-size hidden state. No non-linear activations, gating, or any other fancy mechanism can fix this problem without breaking the full recurrence and bringing some flavour of attention. That is why we put the query before the prompt, as that changes to problem into search rather than memorization and recall (See footnote 1). Simply flipping the order, i.e., putting the query after the prompt, would be not possible for arbitrary precision and fixed hidden state.

Are there some links between the in-context learning approximation notion introduced here and the notion of controllability in control theory?

Another very good question! We did look into it but there is a key difference in the setup. With controllability, one seeks a control signal uu that can bring a (linear) system to a target state xx'. However, uu depends on the target state xx'. Instead, in the universal approximation setting, we are looking for a single control signal that is a map from all possible inputs to their respective target states. Controllability would be more closely related to the ability of a model to be prompted to produce any desirable output xx'. Note that our universal in-context approximation results subsume this prompt controllability setting as one can always define a constant function that produces xx'.

A. Petrov, P. HS Torr, and A. Bibi. Prompting a pretrained transformer can be a universal approximator. In ICML 2024.

评论

Thanks to the authors for their answers! I'll keep my score as it is.

[Minor point] In their rebuttal, the authors mention that the universality of linear RNNs was only proven in 2023. To the best of my understanding, Boyd and Chua already proved a universality result in 1985 (https://stanford.edu/~boyd/papers/pdf/fading_volterra.pdf).

评论

Thank you for highlighting Boyd and Chua's work! The connection between what we now call Linear RNNs and the results in the Boyd and Chua (due Wang and Xue) was recent but the fundamental results are indeed quite old. Thank you for the insight, we will add the Boyd and Chua reference to our updated manuscript for completeness.

审稿意见
8

This paper designs a new programming language LSRL that compiles to a fully recurrent structure and shows that multiple recurrent architectures including RNNs, LSTMs, and SSMs can serve as universal in-context approximators. They also show that multiplicative gating allows more numerically stable construction.

优点

  1. This paper reveals interesting properties of multiple recurrent structures and shows they can perform universal in-context approximation.
  2. The discovery of the numerical instabilities when gating is missing is also interesting and provides intuitions on the role of gating.
  3. The programming language created by this paper may be of independent interest.

缺点

  1. In the introduction, the paper mentions that RNNs can be prompted to act as any token-to-token function over a finite token sequence. However, in the actual construction, the prompt specifying the function must be put after the query. This claim, while necessary for RNNs, is not very standard. Clarity will be improved if this caveat is mentioned earlier in the paper.

问题

  1. If one considers a hybrid architecture that consists of both RNN and Transformer Layer, can one use a combination of LSRL and RaSP language to compile such a model?
  2. In the conclusion, it is mentioned that the compiled transition matrix is often diagonal. Are those matrices exactly diagonal here?

局限性

Limitation are adequately addressed.

作者回复

We appreciate the reviewer's positive opinion of LSRL and our investigation into the numerical instabilities without gating.

In the introduction, the paper mentions that RNNs can be prompted to act as any token-to-token function over a finite token sequence. However, in the actual construction, the prompt specifying the function must be put after the query. This claim, while necessary for RNNs, is not very standard. Clarity will be improved if this caveat is mentioned earlier in the paper.

We have mentioned at the beginning of Section 2 (Line 74/footnote 1) why it is necessary to put the query before the prompt. However, we agree with the reviewer that mentioning this in the Introduction could further improve the readability of the paper.

If one considers a hybrid architecture that consists of both RNN and Transformer Layer, can one use a combination of LSRL and RASP language to compile such a model?

This is indeed an interesting proposition. We do not see any immediate reasons why that would not be possible. As long as one is willing to specify different layers in different languages, the layers could be compiled with the respective compilers and then fused together. Some care would have to be exercised to align how variables are represented in order to ensure interoperability but that should not be a problem.

In the conclusion, it is mentioned that the compiled transition matrix is often diagonal. Are those matrices exactly diagonal here?

The transition matrix depends on the program that is to be implemented. For the two universal in-context approximation programs we have in the paper the transition matrices are not diagonal but highly sparse with many blocks being diagonal.

评论

Thank the authors for the detailed response. I will keep my score.

审稿意见
7

The paper shows that sequence-to-sequence networks like RNNs, LSTMs, and Mamba are universal in-context approximators, which was defined as the ability to compute any function with an appropriate prompt. The authors propose a programming language called LSRL which can express any language expressible by a linear RNN (and vice-versa). The authors discuss numerical stability issues in their proposed language and discuss possible solutions. Overall, the paper streamlines a theoretical framework to understand expressivity of recurrent models and will prove to be an interesting contribution to the wider community.

优点

The strength of the paper lies in its simplistic exposition of its motivation, proposed framework, and the discussions of various frailties and solutions. Sequence-to-sequence models have proven to be competitive to transformers, but the gaps between the two architectures are still under exploration. Expressivity for transformers has been studied through RASP (Weiss et al.'21), and this paper shows a similar programming language to understand linear RNNs. Through LSRL, the authors argue about the kind of languages linear RNNs can express and the impact of gating in RNNs. Overall, the paper takes an important step towards understanding sequence-to-sequence models.

缺点

As such, I don't see clear weakness with the work. However, I would like the authors to be more clearer on their contributions. How is the theoretical framework different from the expressivity studies on feedforward networks and RNNs [1, 2]? That is, as far as I understand, the constructions in section 4.1 and 4.2 still construct a RNN that can do a dictionary lookup for each possible query, very similar to ones constructed in previous works. If I understand correctly, the strength of the paper is more on creating a systematic programming language to compile languages into RNNs. The universal expressivity happens to be a by-product of this framework and is very similar to previous constructions.

References:

1: Andrew R Barron. 1993. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945.

2: Recurrent Neural Networks Are Universal Approximators. AM Schäfer, HG Zimmermann. In Artificial Neural Networks–ICANN 2006: 16th International Conference, Athens, Greece, September 10-14, 2006. Proceedings, Part I 16, pages 632–640. Springer.

问题

Please check my questions above.

局限性

The authors discuss the limitations of their work in section 7, and clearly indicate the future directions that the community can pursue starting from their work.

作者回复

We would like to thank the reviewer for recognizing our work's contributions towards understanding sequence-to-sequence models.

As such, I don't see clear weakness with the work. However, I would like the authors to be more clearer on their contributions. How is the theoretical framework different from the expressivity studies on feedforward networks and RNNs [1, 2]? That is, as far as I understand, the constructions in section 4.1 and 4.2 still construct a RNN that can do a dictionary lookup for each possible query, very similar to ones constructed in previous works. If I understand correctly, the strength of the paper is more on creating a systematic programming language to compile languages into RNNs. The universal expressivity happens to be a by-product of this framework and is very similar to previous constructions.

There are two key differences between the classical results the reviewer has shared and this work. First, removing the non-linearity from the state update (going from an RNN (Eq.2) to a Linear RNN (Eq. 3)) is non-trivial and complicates the analysis significantly. In fact, it was only last year that it was shown that Linear RNNs can be universal approximators (see Wang and Xue, 2023). Second, and that's the key novelty of our work, universal in-context approximation is a very different beast than the classical universal approximation. Classically, one can choose the model parameters conditional on the target function. With universal in-context approximation, however, the model parameters are fixed and independent of the target function. In other words, a single model needs to approximate all target functions. That makes our proofs very different and, at least in our view, the results much more interesting as they extend to how we often currently interact with large pretrained models: via prompting, rather than via training.

Shida Wang and Beichen Xue. State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory. NeurIPS 2023

评论

I thank the authors for their response. After going through the response, I am maintaining my score.

审稿意见
4

The paper explores the potential of various recurrent neural network architectures (RNNs, LSTMs, GRUs, Linear RNNs, and gated architectures) to act as universal in-context approximators, crucial for zero-shot and in-context learning without fine-tuning. It introduces the LSRL programming language to facilitate this, highlighting the role of multiplicative gating in enhancing model stability for practical applications.

优点

The concept of universal approximation in the context of in-context learning is both intriguing and innovative. This research presents a unique angle that adds valuable insights to the field.

缺点

  • The paper is very dense and hard to read and follow. The construction is rather tricky. Do you have any high-level idea in the proof?

  • I don't understand the practical value of proving that linear RNNs are in-context universal approximators. Linear RNNs are known to be inferior at in-context learning and retrieval (see references [1], [2], [3]). How does demonstrating that linear RNNs are in-context universal approximators mitigate this issue in practice? Isn't it somewhat contradictory to claim that RNNs are in-context universal approximators but cannot perform in-context learning well?

  • Many theoretical papers claim that non-linear RNNs are universal approximators or Turing-complete, often under impractical assumptions like infinite precision and exponential hidden state sizes. What about the assumptions in this paper? Do you think all assumptions are practical?

  • The definition of gated (linear) RNNs in this paper is very unusual. Typically, gating refers to the recurrence itself being gated by forget gates or data-dependent decays, rather than gating the output.

[1] In-Context Language Learning: Architectures and Algorithms https://arxiv.org/abs/2401.12973

[2] Simple linear attention language models balance the recall-throughput tradeoff https://arxiv.org/abs/2402.18668

[3] RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval https://arxiv.org/abs/2402.18510

问题

See weakness.

局限性

N/A

作者回复

We thank the reviewer for finding our work to be intriguing and innovative and for adding valuable insights to the field. We would like to address their concerns.

The paper is very dense and hard to read and follow. The construction is rather tricky. Do you have any high-level idea in the proof?

The high-level ideas behind the two LSRL programs in Listing 1 and Listing 2 are rather succinct, although their implementation might indeed be a bit tricky.

For the continous case (Listing 1, Section 4.1), the high-level idea is to discretize the domain into cells and to approximate the target function ff with a piece-wise constant function gg that is constant in each cell with a value equal to ff evaluated at the centre point of the cell. Then, evaluating this approximation at an input xx in an iterative fashion requires one to go through the cells one by one, find the cell containing xx (Lines 16-19) and return the value of ff at the centre of that cell (Lines 21-23). This is a variant of the classic universal approximation results that used similar discretization constructions.

For the discrete case (Listing 2, Section 4.2), the high-level idea is to recognize that any discrete function can be represented as a map or a key-value dictionary. This dictionary can be provided in-context, iterating keys and values. Then, given a query, we can process the key-value pairs one by one until we find the key that matches the query (Lines 19-25). Then, we know that the following value is the one we should output, so we copy it into a state variable and output it repeatedly (Lines 26-30).

We understand that this is the trickiest part of the paper and will add the above explanations as a way to improve the presentation for the camera-ready version of the paper.

I don't understand the practical value of proving that linear RNNs are in-context universal approximators. Linear RNNs are known to be inferior at in-context learning and retrieval (see references [1], [2], [3]). How does demonstrating that linear RNNs are in-context universal approximators mitigate this issue in practice? Isn't it somewhat contradictory to claim that RNNs are in-context universal approximators but cannot perform in-context learning well?

The reviewer is raising some very interesting questions.

It is true that, especially in deep learning, there is often a tension between theory and practice. The fact that we provide a proof by construction that something is possible does not mean that these are solutions that models will learn via gradient descent on real world data. However, our universal in-context approximation results might imply that perhaps the low empirical performance is an issue with how we train, prompt, or evaluate these models, rather than some fundamental limitations of the in-context learning abilities of Linear RNNs. The very fact that models with good in-context learning abilities exist for these architectures (as shown in our results) might indicate that we might be doing something suboptimal with the training and inference of these models.

Furthermore, there is no free lunch: for any given architecture, we can design problems that it won't perform well on. In our view, the question should not be to figure out if Linear RNNs or Transformers are the "better" architecture. They offer different trade-offs and are useful in different settings.

We would like to highlight that we have discussed the fundamental properties of such theoretical results in our Limitations section.

Many theoretical papers claim that non-linear RNNs are universal approximators or Turing-complete, often under impractical assumptions like infinite precision and exponential hidden state sizes. What about the assumptions in this paper? Do you think all assumptions are practical?

We make no assumptions about infinite precision and exponential hidden state sizes. As we anyways use discretization, infinite precision is neither necessary nor helpful for our constructions. The hidden state sizes of our constructions are also independent of the target precision ϵ\epsilon (in the continuous setting) or the key-value dictionary size (in the discrete setting). The error bound on line 228 does assume Lipschitzness, but that is a standard condition. Therefore, we work with very few assumptions on the architecture and, hence, our assumptions are practical.

The definition of gated (linear) RNNs in this paper is very unusual. Typically, gating refers to the recurrence itself being gated by forget gates or data-dependent decays, rather than gating the output.

This paper focuses on Gated Linear RNNs and Gated RNNs. By definition, the gating of Gated Linear RNNs cannot be on the recurrence itself as the model will not be linear anymore. Similarly, the gating of Gated RNNs cannot be on the recurrence because the model would not be an RNN anymore (the standard RNN is a linear state update followed by an element-wise non-linearity). Perhaps the reviewer has LSTMs and GRUs in mind where the gating is indeed on the recurrence. However, as we show in Appendices B, C and D, our models subsume LSTMs, GRUs and the Hawk/Griffin architecture and hence they apply to these architectures as well.

评论
  • How does the proof of universal in-context learning of linear RNNs guide the practical development of these models? While I am not opposed to theoretical papers, I'm not convinced that this proof will significantly benefit the development of the field. As a researcher who leans more towards empirical work, I cannot recommend acceptance for this paper unless it clearly guides practice or explains some successes or flaws in existing empirical phenomena. However, I would not oppose others in the theory community who may find this work valuable. I believe my perspective is typical among empirical researchers; theory should either guide empirical study or explain existing empirical phenomena.

  • Regarding gated RNNs, gating the recurrence remains linear if no activation is involved in the middle of the recurrence. For example, current linear models like Griffin, HGRN, Mamba, GLA, and RWKV6 all incorporate such gating in the recurrence and can leverage either parallel scan [1] or chunkwise form [2] for parallel training. Gated linear recurrence should definitely reference these linear RNN models with data-dependent forget gates or decays. The definition used in this work seems questionable to me.

[1] https://arxiv.org/abs/1709.04057 Parallelizing Linear Recurrent Neural Nets Over Sequence Length [2] https://arxiv.org/abs/2312.06635 Gated Linear Attention Transformers with Hardware-Efficient Training

评论

How does the proof of universal in-context learning of linear RNNs guide the practical development of these models? While I am not opposed to theoretical papers, I'm not convinced that this proof will significantly benefit the development of the field. As a researcher who leans more towards empirical work, I cannot recommend acceptance for this paper unless it clearly guides practice or explains some successes or flaws in existing empirical phenomena. However, I would not oppose others in the theory community who may find this work valuable. I believe my perspective is typical among empirical researchers; theory should either guide empirical study or explain existing empirical phenomena.

Our paper has the following contributions that are directly applicable to empirical research:

  1. LSRL is a tool with practical applications in itself. We develop LSRL, a new programming language that is isomorphic to a large class of recurrent models. Using this language and the compiler we have developed, one can directly program behaviours that can be then incorporated into RNNs, LSTMs, GRUs and Hawk/Griffin models. Due to this isomorphism, conversely, given a model with these architectures, we can “decompile” it as an LSRL program. This would be quite useful for people doing interpretability, explainability and safety research for models with these architectures.
  2. We explain the successes of architectures with multiplicative gating. Recurrent architectures with multiplicative gating such as Mamba and Hawk have outperformed architectures without it. With our experiments, we show that multiplicative gating is much more numerically stable when one wants the model to follow an algorithm in-context precisely, which is how a lot of the evaluation and benchmarks are set up. We also show that multiplicative gating enables a much more compact implementation of concrete algorithms, and hence can explain why smaller models with multiplicative gates are sometimes comparable in performance to larger models that do not have them. Therefore, our work explains one aspect that can drive the success of gating architectures and guides the practice to use multiplicative gating.

We believe that beyond serving as a basis for engineering and applied research, the foundation of science is a desire to understand the world around us for the joy of discovery and knowledge. The development of machine learning is not restricted to building new models, but also includes understanding, characterizing and organizing the properties of the methods, tools and mathematical models we already have. We are positive that a large part of the NeurIPS audience feels the same way, hence why “Theory” is one of the areas explicitly mentioned in this year’s Call for Papers. In this way, our work fills key gaps in the characterization of the universal approximation properties of neural network models, an area with a long history, while also serving as a guide for design decisions for recurrent architectures and explaining some successes of multiplicative gating.

Regarding gated RNNs, gating the recurrence remains linear if no activation is involved in the middle of the recurrence. For example, current linear models like Griffin, HGRN, Mamba, GLA, and RWKV6 all incorporate such gating in the recurrence and can leverage either parallel scan [1] or chunkwise form [2] for parallel training. Gated linear recurrence should definitely reference these linear RNN models with data-dependent forget gates or decays. The definition used in this work seems questionable to me.

Perhaps we are using two different definitions of “linear recurrence”. We mean it in the sense of a “linear time-invariant system” (LTI). As such, models like Mamba and Griffin would not satisfy this definition. They would be something like a “linear time-variant system” (LTV). However, as every time-invariant system is a trivial time-variant system, the class of LTV systems is a proper superset of the class of LTI systems. Therefore, if there is an instance of an LTI system with a certain property (in our case, universal in-context approximation), then it immediately follows that there is an instance for an LTV system with that property.

In our work we show that the more restricted setting already is a universal in-context approximator, automatically making the more general setting you are referring to a universal in-context approximator as well. We show this connection explicitly and formally in Appendix D, where we show that the Hawk/Griffin architecture is also a universal in-context approximator. Hence, the definition we use is not limiting our results in any way whatsoever, it is just the most general (or, if one wants, abstract) definition that serves the goal of our work.

最终决定

This work explores the ability for recurrent neural nets to be universal approximators through prompting. A major question of interest is what potential benefits there are to transformer-based architectures vs. state space models like RNNs. One potential explanation is a difference of approximation-theoretic: perhaps transformers are simply better function approximators. This work shows this is not the case since both are capable of being universal approximators in-context. The majority of the reviewers were in support of this work, and I concur, and recommend acceptance.