Quasi-Recurrent Gist Attention: Efficiently Modeling Long Context in Large Language Model

Jing Qian,Yu Yan,Yadong Lu,Yeyun Gong,Yang Liu,yelong shen

提交: 2023-09-21更新: 2024-03-26

摘要

Transformer-based Large Language Models (LLMs) have achieved state-of-the-art on numerous Natural Language Processing tasks. However, LLMs typically come with a predetermined context window size. This limitation, combined with the quadratic complexity of self-attention, makes pretrained LLMs struggle with long sequences. In this work, we introduce a quasi-recurrent gist attention mechanism designed to effectively capture long contextual information within LLMs. The proposed approach employs quasi-recurrent context compression techniques to iteratively integrate historical context details into the gist representation. The quasi-recurrent gist attention reduces the computation complexity from $O(n^2)$ by full-attention to $O(n)$ with no change of the original Transformer model architecture, which enables seamless fine-tuning from pretrained language models such as Llama \cite{touvron2023llama} and facilitates the natural extension of the context window. Experimental results indicate that the proposed attention mechanism yields better performance to the full-attention approach on multiple public benchmarks, while reducing the latency for modeling long context significantly.

关键词

Large Language Modelsgist attentionlong contextual information

评审与讨论

暂无评审记录