LongT5-Mulla: LongT5 With Multi-Level Local Attention for a Longer Sequence

TLDR

This paper proposes multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity.

Abstract

Efficient Transformer models typically employ local and global attention methods, or utilize hierarchical or recurrent architectures, to process long text inputs in natural language processing tasks. However, these models face challenges in terms of sacrificing either efficiency, accuracy, or compatibility to develop their application in longer sequences. To maintain both the accuracy of global attention and the efficiency of local attention, while keeping a good compatibility to be easily applied to an existing pre-trained model, in this paper, we propose multi-level local attention (Mulla attention), which is a hierarchical local attention that acts on both the input sequence and multiple pooling sequences of different granularity simultaneously, thus performing long-range modeling while maintaining linear or log-linear complexity. We apply Mulla attention to LongT5 and implement our LongT5-Mulla sequence-to-sequence model, without introducing new parameters except for positional embeddings. Experiments show that our model can surpass all baseline models, including two original variants of LongT5, in the 816k-input long text summarization task on the Multi-News, arXiv and WCEP-10 datasets, with improvements of at least +0.22, +0.01, +0.52 percentage points (pp) averaged Rouge scores respectively, while at the meantime being able to effectively process longer sequences that have 1648k tokens with at least 52.6% lower memory consumption than LongT5-tglobal, and +0.56~1.62 pp averaged Rouge scores higher than LongT5-local. These results demonstrate that our proposed LongT5-Mulla model can effectively process long sequences and extend the maximum input length for long text tasks from 16k to 48k while maintaining accuracy and efficiency.