First published at 07:49 UTC on August 2nd, 2020.
#ai #nlp #attention
The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention,…
MORE
#ai #nlp #attention
The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention. Not only does this allow the processing of longer sequences, translating to state-of-the-art experimental results, but also the paper shows that BigBird comes with theoretical guarantees of universal approximation and turing completeness.
OUTLINE:
0:00 - Intro & Overview
1:50 - Quadratic Memory in Full Attention
4:55 - Architecture Overview
6:35 - Random Attention
10:10 - Window Attention
13:45 - Global Attention
15:40 - Architecture Summary
17:10 - Theoretical Result
22:00 - Experimental Parameters
25:35 - Structured Block Computations
29:30 - Recap
31:50 - Experimental Results
34:05 - Conclusion
Paper: https://arxiv.org/abs/2007.14062
My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM
My Video on Longformer: https://youtu.be/_8KNb5iqblE
... and its memory requirements: https://youtu.be/gJR28onlqzs
Abstract:
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was prev..
LESS