You are about to action:

Object being modified by the action

Do you want to proceed?

Share Video Link

https://www.bitchute.com/video/WVPE62Gk3EM/

Click to copy, then share by pasting into your messages, comments, social media posts and websites.

Embed Video HTML

Click to copy, then add into your webpages so users can view and engage with this video from your site.

Share to Social Media

Share to social media by clicking on the quick share links.

Report Content

Reason

Please select the most appropriate reason from the list provided.

Note: For a more detailed description of each reason, see our Community Guidelines.

Additional Comments

Please add any additonal comments that will help with the assessment of your request.

Note: Copyright claims must contain all the items specified within the Copyright Policy.

Email Submissions

We also accept reports via email. Please see the Guidelines Enforcement Process for instructions on how to make a request via email.

Thank you for submitting your report

We will investigate and take the appropriate action.

Add to Playlist

Big Bird: Transformers for Longer Sequences (Paper Explained)

Next video playing soon

Click to cancel

Autoplay has been paused

Click to watch next video

First published at 07:49 UTC on August 2nd, 2020.

Yannic Kilcher

subscribers

Your platform for uncensored ideas. BitChute Premium.

BitChute Advertisement

#ai #nlp #attention

The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention. Not only does this allow the processing of longer sequences, translating to state-of-the-art experimental results, but also the paper shows that BigBird comes with theoretical guarantees of universal approximation and turing completeness.

OUTLINE:
0:00 - Intro & Overview
1:50 - Quadratic Memory in Full Attention
4:55 - Architecture Overview
6:35 - Random Attention
10:10 - Window Attention
13:45 - Global Attention
15:40 - Architecture Summary
17:10 - Theoretical Result
22:00 - Experimental Parameters
25:35 - Structured Block Computations
29:30 - Recap
31:50 - Experimental Results
34:05 - Conclusion

Paper: https://arxiv.org/abs/2007.14062

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM
My Video on Longformer: https://youtu.be/_8KNb5iqblE
... and its memory requirements: https://youtu.be/gJR28onlqzs

Abstract:
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was prev..

LESS

Category	Science & Technology
Sensitivity	Normal - Content that is suitable for ages 16 and over

DISCUSS THIS VIDEO

The owner has disabled comments on this channel.

RANT

RAVE

Playing Next

89 40:34

Self-training with Noisy Student improves ImageNet classification (Paper Explained)

Yannic Kilcher

3 years, 9 months ago

Big Bird: Transformers for Longer Sequences (Paper Explained)

Playing Next

Related Videos

Warning - This video exceeds your sensitivity preference!