Yannic Kilcher

#ai #neuroscience #rl

Reinforcement Learning is a powerful tool, but it lacks biological plausibility because it learns a fixed policy network. Animals use neuroplasticity to reconfigure their policies on the fly and quickly adapt to new situations. This paper uses Hebbian Learning, a biologically inspired technique, to have agents adapt random networks to high-performing solutions as an episode is progressing, leading to agents that can reconfigure themselves in response to new observations.

OUTLINE:
0:00 - Intro & Overview
2:30 - Reinforcement Learning vs Hebbian Plasticity
9:00 - Episodes in Hebbian Learning
10:00 - Hebbian Plasticity Rules
18:10 - Quadruped Experiment Results
21:20 - Evolutionary Learning of Hebbian Plasticity
29:10 - More Experimental Results
34:50 - Conclusions
35:30 - Broader Impact Statement

Videos: https://twitter.com/risi1979/status/1280544779630186499
Paper: https://arxiv.org/abs/2007.02686

Abstract:
Lifelong learning and adaptability are two defining aspects of biological agents. Modern reinforcement learning (RL) approaches have shown significant progress in solving complex tasks, however once training is concluded, the found solutions are typically static and incapable of adapting to new information or perturbations. While it is still not completely understood how biological brains learn and adapt so efficiently from experience, it is believed that synaptic plasticity plays a prominent role in this process. Inspired by this biological mechanism, we propose a search method that, instead of optimizing the weight parameters of neural networks directly, only searches for synapse-specific Hebbian learning rules that allow the network to continuously self-organize its weights during the lifetime of the agent. We demonstrate our approach on several reinforcement learning tasks with different sensory modalities and more than 450K trainable plasticity parameters. We find that starting from completely random weights, the discovered Hebbian rules enable an agent to navigate a dynamical 2D-pixel environment; likewise they allow a simulated 3D quadrupedal robot to learn how to walk while adapting to different morphological damage in the absence of any explicit reward or error signal.

Authors: Elias Najarro, Sebastian Risi

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #transformer #attention

Hopfield Networks are one of the classic models of biological memory networks. This paper generalizes modern Hopfield Networks to continuous states and shows that the corresponding update rule is equal to the attention mechanism used in modern Transformers. It further analyzes a pre-trained BERT model through the lens of Hopfield Networks and uses a Hopfield Attention Layer to perform Immune Repertoire Classification.

OUTLINE:
0:00 - Intro & Overview
1:35 - Binary Hopfield Networks
5:55 - Continuous Hopfield Networks
8:15 - Update Rules & Energy Functions
13:30 - Connection to Transformers
14:35 - Hopfield Attention Layers
26:45 - Theoretical Analysis
48:10 - Investigating BERT
1:02:30 - Immune Repertoire Classification

Paper: https://arxiv.org/abs/2008.02217
Code: https://github.com/ml-jku/hopfield-layers
Immune Repertoire Classification Paper: https://arxiv.org/abs/2007.13505

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM

Abstract:
We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: this https URL

Authors: Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #tech #code

A whole bunch of humans are arguing whether 2+2=4 or 2+2=5. Pointless! Let the machines handle this!

Colab: https://colab.research.google.com/drive/1tDjFW7CFGQG8vHdUAVNpr2EG9z0JZGYC?usp=sharing

Disclaimer: This is a joke.

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

The abundance of data on the internet is vast. Especially unlabeled images are plentiful and can be collected with ease. This model investigates a new method for incorporating unlabeled data into a supervised learning pipeline. First, a teacher model is trained in a supervised fashion. Then, that teacher is used to label the unlabeled data. Next, a larger student model is trained on the combination of all data and achieves better performance than the teacher by itself.

OUTLINE:
0:00 - Intro & Overview
1:05 - Semi-Supervised & Transfer Learning
5:45 - Self-Training & Knowledge Distillation
10:00 - Noisy Student Algorithm Overview
20:20 - Noise Methods
22:30 - Dataset Balancing
25:20 - Results
30:15 - Perturbation Robustness
34:35 - Ablation Studies
39:30 - Conclusion & Comments

Paper: https://arxiv.org/abs/1911.04252
Code: https://github.com/google-research/noisystudent
Models: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet

Abstract:
We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2.
Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Models are available at this https URL. Code is available at this https URL.

Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #research #machinelearning

Neural Architecture Search is typically very slow and resource-intensive. A meta-controller has to train many hundreds or thousands of different models to find a suitable building plan. This paper proposes to use statistics of the Jacobian around data points to estimate the performance of proposed architectures at initialization. This method does not require training and speeds up NAS by orders of magnitude.

OUTLINE:
0:00 - Intro & Overview
0:50 - Neural Architecture Search
4:15 - Controller-based NAS
7:35 - Architecture Search Without Training
9:30 - Linearization Around Datapoints
14:10 - Linearization Statistics
19:00 - NAS-201 Benchmark
20:15 - Experiments
34:15 - Conclusion & Comments

Paper: https://arxiv.org/abs/2006.04647
Code: https://github.com/BayesWatch/nas-without-training

Abstract:
The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be extremely slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be remedied if we could infer a network's trained accuracy from its initial state. In this work, we examine how the linear maps induced by data points correlate for untrained network architectures in the NAS-Bench-201 search space, and motivate how this can be used to give a measure of modelling flexibility which is highly indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU. Code to reproduce our experiments is available at this https URL.

Authors: Joseph Mellor, Jack Turner, Amos Storkey, Elliot J. Crowley

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #research #word2vec

Word vectors have been one of the most influential techniques in modern NLP to date. This paper describes Word2Vec, which the most popular technique to obtain word vectors. The paper introduces the negative sampling technique as an approximation to noise contrastive estimation and shows that this allows the training of word vectors from giant corpora on a single machine in a very short time.

OUTLINE:
0:00 - Intro & Outline
1:50 - Distributed Word Representations
5:40 - Skip-Gram Model
12:00 - Hierarchical Softmax
14:55 - Negative Sampling
22:30 - Mysterious 3/4 Power
25:50 - Frequent Words Subsampling
28:15 - Empirical Results
29:45 - Conclusion & Comments

Paper: https://arxiv.org/abs/1310.4546
Code: https://code.google.com/archive/p/word2vec/

Abstract:
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #deeplearning #gan

GANs are of the main models in modern deep learning. This is the paper that started it all! While the task of image classification was making progress, the task of image generation was still cumbersome and prone to artifacts. The main idea behind GANs is to pit two competing networks against each other, thereby creating a generative model that only ever has implicit access to the data through a second, discriminative, model. The paper combines architecture, experiments, and theoretical analysis beautifully.

OUTLINE:
0:00 - Intro & Overview
3:50 - Motivation
8:40 - Minimax Loss Function
13:20 - Intuition Behind the Loss
19:30 - GAN Algorithm
22:05 - Theoretical Analysis
27:00 - Experiments
33:10 - Advantages & Disadvantages
35:00 - Conclusion

Paper: https://arxiv.org/abs/1406.2661

Abstract:
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

Authors: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

AlexNet was the start of the deep learning revolution. Up until 2012, the best computer vision systems relied on hand-crafted features and highly specialized algorithms to perform object classification. This paper was the first to successfully train a deep convolutional neural network on not one, but two GPUs and managed to outperform the competition on ImageNet by an order of magnitude.

OUTLINE:
0:00 - Intro & Overview
2:00 - The necessity of larger models
6:20 - Why CNNs?
11:05 - ImageNet
12:05 - Model Architecture Overview
14:35 - ReLU Nonlinearities
18:45 - Multi-GPU training
21:30 - Classification Results
24:30 - Local Response Normalization
28:05 - Overlapping Pooling
32:25 - Data Augmentation
38:30 - Dropout
40:30 - More Results
43:50 - Conclusion

Paper: http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf

Abstract:
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #dqn #deepmind

After the initial success of deep neural networks, especially convolutional neural networks on supervised image processing tasks, this paper was the first to demonstrate their applicability to reinforcement learning. Deep Q Networks learn from pixel input to play seven different Atari games and outperform baselines that require hand-crafted features. This paper kicked off the entire field of deep reinforcement learning and positioned DeepMind as one of the leading AI companies in the world.

OUTLINE:
0:00 - Intro & Overview
2:50 - Arcade Learning Environment
4:25 - Deep Reinforcement Learning
9:20 - Deep Q-Learning
26:30 - Experience Replay
32:25 - Network Architecture
33:50 - Experiments
37:45 - Conclusion

Paper: https://arxiv.org/abs/1312.5602

Abstract:
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Authors: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #research #gaming

Deep RL is usually used to solve games, but this paper turns the process on its head and applies RL to game level creation. Compared to traditional approaches, it frames level design as a sequential decision making progress and ends up with a fast and diverse level generator.

OUTLINE:
0:00 - Intro & Overview
1:30 - Level Design via Reinforcement Learning
3:00 - Reinforcement Learning
4:45 - Observation Space
5:40 - Action Space
15:40 - Change Percentage Limit
20:50 - Quantitative Results
22:10 - Conclusion & Outlook

Paper: https://arxiv.org/abs/2001.09212
Code: https://github.com/amidos2006/gym-pcgrl

Abstract:
We investigate how reinforcement learning can be used to train level-designing agents. This represents a new approach to procedural content generation in games, where level design is framed as a game, and the content generator itself is learned. By seeing the design problem as a sequential task, we can use reinforcement learning to learn how to take the next action so that the expected final level quality is maximized. This approach can be used when few or no examples exist to train from, and the trained generator is very fast. We investigate three different ways of transforming two-dimensional level design problems into Markov decision processes and apply these to three game environments.

Authors: Ahmed Khalifa, Philip Bontrager, Sam Earle, Julian Togelius

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #nlp #attention

The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention. Not only does this allow the processing of longer sequences, translating to state-of-the-art experimental results, but also the paper shows that BigBird comes with theoretical guarantees of universal approximation and turing completeness.

OUTLINE:
0:00 - Intro & Overview
1:50 - Quadratic Memory in Full Attention
4:55 - Architecture Overview
6:35 - Random Attention
10:10 - Window Attention
13:45 - Global Attention
15:40 - Architecture Summary
17:10 - Theoretical Result
22:00 - Experimental Parameters
25:35 - Structured Block Computations
29:30 - Recap
31:50 - Experimental Results
34:05 - Conclusion

Paper: https://arxiv.org/abs/2007.14062

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM
My Video on Longformer: https://youtu.be/_8KNb5iqblE
... and its memory requirements: https://youtu.be/gJR28onlqzs

Abstract:
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

Authors: Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

#ai #research #resnet

ResNets are one of the cornerstones of modern Computer Vision. Before their invention, people were not able to scale deep neural networks beyond 20 or so layers, but with this paper's invention of residual connections, all of a sudden networks could be arbitrarily deep. This led to a big spike in the performance of convolutional neural networks and rapid adoption in the community. To this day, ResNets are the backbone of most vision models and residual connections appear all throughout deep learning.

OUTLINE:
0:00 - Intro & Overview
1:45 - The Problem with Depth
3:15 - VGG-Style Networks
6:00 - Overfitting is Not the Problem
7:25 - Motivation for Residual Connections
10:25 - Residual Blocks
12:10 - From VGG to ResNet
18:50 - Experimental Results
23:30 - Bottleneck Blocks
24:40 - Deeper ResNets
28:15 - More Results
29:50 - Conclusion & Comments

Paper: https://arxiv.org/abs/1512.03385

Abstract:
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.
The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar (preferred to Patreon): https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Neural networks for implicit representations, such as SIRENs, have been very successful at modeling natural signals. However, in the classical approach, each data point requires its own neural network to be fit. This paper extends implicit representations to an entire dataset by introducing latent vectors of data points to SIRENs. Interestingly, the paper shows that such latent vectors can be obtained without the need for an explicit encoder, by simply looking at the negative gradient of the zero-vector through the representation function.

OUTLINE:
0:00 - Intro & Overview
2:10 - Implicit Generative Models
5:30 - Implicitly Represent a Dataset
11:00 - Gradient Origin Networks
23:55 - Relation to Gradient Descent
28:05 - Messing with their Code
37:40 - Implicit Encoders
38:50 - Using GONs as classifiers
40:55 - Experiments & Conclusion

Paper: https://arxiv.org/abs/2007.02798
Code: https://github.com/cwkx/GON
Project Page: https://cwkx.github.io/data/GON/

My Video on SIREN: https://youtu.be/Q5g3p9Zwjrk

Abstract:
This paper proposes a new type of implicit generative model that is able to quickly learn a latent representation without an explicit encoder. This is achieved with an implicit neural network that takes as inputs points in the coordinate space alongside a latent vector initialised with zeros. The gradients of the data fitting loss with respect to this zero vector are jointly optimised to act as latent points that capture the data manifold. The results show similar characteristics to autoencoders, but with fewer parameters and the advantages of implicit representation networks.

Authors: Sam Bond-Taylor, Chris G. Willcocks

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher

VAEs have been traditionally hard to train at high resolutions and unstable when going deep with many layers. In addition, VAE samples are often more blurry and less crisp than those from GANs. This paper details all the engineering choices necessary to successfully train a deep hierarchical VAE that exhibits global consistency and astounding sharpness at high resolutions.

OUTLINE:
0:00 - Intro & Overview
1:55 - Variational Autoencoders
8:25 - Hierarchical VAE Decoder
12:45 - Output Samples
15:00 - Hierarchical VAE Encoder
17:20 - Engineering Decisions
22:10 - KL from Deltas
26:40 - Experimental Results
28:40 - Appendix
33:00 - Conclusion

Paper: https://arxiv.org/abs/2007.03898

Abstract:
Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels.

Authors: Arash Vahdat, Jan Kautz

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher

I take a closer look at "Supermasks in Superposition" after I've already done a video on it. Specifically, I look at: 1. The intuition and theoretical justification behind the G objective, 2. Whether Supermasks and Superposition can be viewed as two distinct ideas and 3. The Paper's Broader Impact Statement.

OUTLINE:
0:00 - Intro & Overview
2:00 - SupSup Recap
4:00 - In-Depth Analysis of the G Objective
20:30 - Superposition without Supermasks
25:40 - Broader Impact Statement
36:40 - Conclusion
37:20 - Live Coding

Part 1 on SupSup: https://youtu.be/3jT1qJ8ETzk
My Code: https://colab.research.google.com/drive/1bEcppdN6qZRpEFplIiv41ZI3vDwDjcvC?usp=sharing
Paper: https://arxiv.org/abs/2006.14769

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher

#machinelearning #ai #google

The high-level architecture of CNNs has not really changed over the years. We tend to build high-resolution low-dimensional layers first, followed by ever more coarse, but deep layers. This paper challenges this decades-old heuristic and uses neural architecture search to find an alternative, called SpineNet that employs multiple rounds of re-scaling and long-range skip connections.

OUTLINE:
0:00 - Intro & Overview
1:00 - Problem Statement
2:30 - The Problem with Current Architectures
8:20 - Scale-Permuted Networks
11:40 - Neural Architecture Search
14:00 - Up- and Downsampling
19:10 - From ResNet to SpineNet
24:20 - Ablations
27:00 - My Idea: Attention Routing for CNNs
29:55 - More Experiments
34:45 - Conclusion & Comments

Papers: https://arxiv.org/abs/1912.05027
Code: https://github.com/tensorflow/tpu/tree/master/models/official/detection

Abstract:
Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by ~3% AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.5% AP with a MaskR-CNN detector and achieves 52.1% AP with a RetinaNet detector on COCO for a single model without test-time augmentation, significantly outperforms prior art of detectors. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: this https URL.

Authors: Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Golnaz Ghiasi, Mingxing Tan, Yin Cui, Quoc V. Le, Xiaodan Song

Thumbnail art by Lucas Ferreira

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Supermasks are binary masks of a randomly initialized neural network that result in the masked network performing well on a particular task. This paper considers the problem of (sequential) Lifelong Learning and trains one Supermask per Task, while keeping the randomly initialized base network constant. By minimizing the output entropy, the system can automatically derive the Task ID of a data point at inference time and distinguish up to 2500 tasks automatically.

OUTLINE:
0:00 - Intro & Overview
1:20 - Catastrophic Forgetting
5:20 - Supermasks
9:35 - Lifelong Learning using Supermasks
11:15 - Inference Time Task Discrimination by Entropy
15:05 - Mask Superpositions
24:20 - Proof-of-Concept, Task Given at Inference
30:15 - Binary Maximum Entropy Search
32:00 - Task Not Given at Inference
37:15 - Task Not Given at Training
41:35 - Ablations
45:05 - Superfluous Neurons
51:10 - Task Selection by Detecting Outliers
57:40 - Encoding Masks in Hopfield Networks
59:40 - Conclusion

Paper: https://arxiv.org/abs/2006.14769
Code: https://github.com/RAIVNLab/supsup

My Video about Lottery Tickets: https://youtu.be/ZVVnvZdUMUk
My Video about Supermasks: https://youtu.be/jhCInVFE2sc

Abstract:
We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting. Our approach uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask) that achieves good performance. If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. In practice we find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. We also showcase two promising extensions. First, SupSup models can be trained entirely without task identity information, as they may detect when they are uncertain about new data and allocate an additional supermask for the new training distribution. Finally the entire, growing set of supermasks can be stored in a constant-sized reservoir by implicitly storing them as attractors in a fixed-sized Hopfield network.

Authors: Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, Ali Farhadi

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher

I share my progress of implementing a research idea from scratch. I attempt to build an ensemble model out of students of label-free self-distillation without any additional data or augmentation. Turns out, it actually works, and interestingly, the more students I employ, the better the accuracy. This leads to the hypothesis that the ensemble effect is not a process of extracting more information from labels.

OUTLINE:
0:00 - Introduction
2:10 - Research Idea
4:15 - Adjusting the Codebase
25:00 - Teacher and Student Models
52:30 - Shipping to the Server
1:03:40 - Results
1:14:50 - Conclusion

Code: https://github.com/yk/PyTorch_CIFAR10

References:
My Video on SimCLRv2: https://youtu.be/2lkUNDZld-4
Born-Again Neural Networks: https://arxiv.org/abs/1805.04770
Deep Ensembles: A Loss Landscape Perspective: https://arxiv.org/abs/1912.02757

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher

In this part, we look at the ARC challenge as a proposed test of machine intelligence. The dataset features 1000 tasks that test rapid generalization based on human core knowledge priors, such as object-ness, symmetry, and navigation.

OUTLINE:
0:00 - Intro
0:55 - What is ARC?
6:30 - The Goals of ARC
10:40 - Assumed Priors & Examples
21:50 - An Imagined Solution
28:15 - Consequences of a Solution
31:00 - Weaknesses
31:25 - My Comments & Ideas

Paper: https://arxiv.org/abs/1911.01547
ARC: https://github.com/fchollet/ARC

Abstract:
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.

Authors: François Chollet

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

#ai #attention #transformer #deeplearning

Transformers are famous for two things: Their superior performance and their insane requirements of compute and memory. This paper reformulates the attention mechanism in terms of kernel functions and obtains a linear formulation, which reduces these requirements. Surprisingly, this formulation also surfaces an interesting connection between autoregressive transformers and RNNs.

OUTLINE:
0:00 - Intro & Overview
1:35 - Softmax Attention & Transformers
8:40 - Quadratic Complexity of Softmax Attention
9:40 - Generalized Attention Mechanism
13:45 - Kernels
20:40 - Linear Attention
25:20 - Experiments
28:30 - Intuition on Linear Attention
33:55 - Connecting Autoregressive Transformers and RNNs
41:30 - Caveats with the RNN connection
46:00 - More Results & Conclusion

Paper: https://arxiv.org/abs/2006.16236
Website: https://linear-transformers.com/
Code: https://github.com/idiap/fast-transformers

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM

Abstract:
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from (N2) to (N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

Authors: Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Proteins are the workhorses of almost all cellular functions and a core component of life. But despite their versatility, all proteins are built as sequences of the same 20 amino acids. These sequences can be analyzed with tools from NLP. This paper investigates the attention mechanism of a BERT model that has been trained on protein sequence data and discovers that the language model has implicitly learned non-trivial higher-order biological properties of proteins.

OUTLINE:
0:00 - Intro & Overview
1:40 - From DNA to Proteins
5:20 - BERT for Amino Acid Sequences
8:50 - The Structure of Proteins
12:40 - Investigating Biological Properties by Inspecting BERT
17:45 - Amino Acid Substitution
24:55 - Contact Maps
30:15 - Binding Sites
33:45 - Linear Probes
35:25 - Conclusion & Comments

Paper: https://arxiv.org/abs/2006.15222
Code: https://github.com/salesforce/provis

My Video on BERT: https://youtu.be/-9evrZnBorM
My Video on Attention: https://youtu.be/iDulhoQ2pro

Abstract:
Transformer architectures have proven to learn useful representations for protein classification and generation tasks. However, these representations present challenges in interpretability. Through the lens of attention, we analyze the inner workings of the Transformer and explore how the model discerns structural and functional properties of proteins. We show that attention (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We also present a three-dimensional visualization of the interaction between attention and protein structure. Our findings align with known biological processes and provide a tool to aid discovery in protein engineering and synthetic biology. The code for visualization and analysis is available at this https URL.

Authors: Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!

OUTLINE:
0:00 - Intro & Overview
4:10 - Main Results
5:10 - Mixture-of-Experts
16:00 - Difference to Scaling Classic Transformers
18:50 - Backpropagation in Mixture-of-Experts
20:05 - MoE Routing Algorithm in GShard
38:20 - GShard Einsum Examples
47:40 - Massively Multilingual Translation
56:00 - Results
1:11:30 - Conclusion & Comments

ERRATA:
I said the computation of MoE scales linearly, but actually, it's sub(!)-linear.

Paper: https://arxiv.org/abs/2006.16668

Abstract:
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Authors:
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Visual scenes are often comprised of sets of independent objects. Yet, current vision models make no assumptions about the nature of the pictures they look at. By imposing an objectness prior, this paper a module that is able to recognize permutation-invariant sets of objects from pixels in both supervised and unsupervised settings. It does so by introducing a slot attention module that combines an attention mechanism with dynamic routing.

OUTLINE:
0:00 - Intro & Overview
1:40 - Problem Formulation
4:30 - Slot Attention Architecture
13:30 - Slot Attention Algorithm
21:30 - Iterative Routing Visualization
29:15 - Experiments
36:20 - Inference Time Flexibility
38:35 - Broader Impact Statement
42:05 - Conclusion & Comments

Paper: https://arxiv.org/abs/2006.15055

My Video on Facebook's DETR: https://youtu.be/T35ba_VXkMY
My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on Capsules: https://youtu.be/nXGHJTtFYRU

Abstract:
Learning object-centric representations of complex scenes is a promising step towards enabling efficient abstract reasoning from low-level perceptual features. Yet, most deep learning approaches learn distributed representations that do not capture the compositional properties of natural scenes. In this paper, we present the Slot Attention module, an architectural component that interfaces with perceptual representations such as the output of a convolutional neural network and produces a set of task-dependent abstract representations which we call slots. These slots are exchangeable and can bind to any object in the input by specializing through a competitive procedure over multiple rounds of attention. We empirically demonstrate that Slot Attention can extract object-centric representations that enable generalization to unseen compositions when trained on unsupervised object discovery and supervised property prediction tasks.

Authors: Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, Thomas Kipf

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

We've become very good at making generative models for images and classes of images, but not yet of sets of images, especially when the number of sets is unknown and can contain sets that have never been encountered during training. This paper builds a probabilistic framework and a practical implementation of a generative model for sets of images based on variational methods.

OUTLINE:
0:00 - Intro & Overview
1:25 - Problem Statement
8:05 - Architecture Overview
20:05 - Probabilistic Model
33:50 - Likelihood Function
40:30 - Model Architectures
44:20 - Loss Function & Optimization
47:30 - Results
58:45 - Conclusion

Paper: https://arxiv.org/abs/2006.10705

Abstract:
Images with shared characteristics naturally form sets. For example, in a face verification benchmark, images of the same identity form sets. For generative models, the standard way of dealing with sets is to represent each as a one hot vector, and learn a conditional generative model p(x|y). This representation assumes that the number of sets is limited and known, such that the distribution over sets reduces to a simple multinomial distribution. In contrast, we study a more generic problem where the number of sets is large and unknown. We introduce Set Distribution Networks (SDNs), a novel framework that learns to autoencode and freely generate sets. We achieve this by jointly learning a set encoder, set discriminator, set generator, and set prior. We show that SDNs are able to reconstruct image sets that preserve salient attributes of the inputs in our benchmark datasets, and are also able to generate novel objects/identities. We examine the sets generated by SDN with a pre-trained 3D reconstruction network and a face verification network, respectively, as a novel way to evaluate the quality of generated sets of images.

Authors: Shuangfei Zhai, Walter Talbott, Miguel Angel Bautista, Carlos Guestrin, Josh M. Susskind

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Backpropagation is one of the central components of modern deep learning. However, it's not biologically plausible, which limits the applicability of deep learning to understand how the human brain works. Direct Feedback Alignment is a biologically plausible alternative and this paper shows that, contrary to previous research, it can be successfully applied to modern deep architectures and solve challenging tasks.

OUTLINE:
0:00 - Intro & Overview
1:40 - The Problem with Backpropagation
10:25 - Direct Feedback Alignment
21:00 - My Intuition why DFA works
31:20 - Experiments

Paper: https://arxiv.org/abs/2006.12878
Code: https://github.com/lightonai/dfa-scales-to-modern-deep-learning
Referenced Paper by Arild Nøkland: https://arxiv.org/abs/1609.01596

Abstract:
Despite being the workhorse of deep learning, the backpropagation algorithm is no panacea. It enforces sequential layer updates, thus preventing efficient parallelization of the training process. Furthermore, its biological plausibility is being challenged. Alternative schemes have been devised; yet, under the constraint of synaptic asymmetry, none have scaled to modern deep learning tasks and architectures. Here, we challenge this perspective, and study the applicability of Direct Feedback Alignment to neural view synthesis, recommender systems, geometric learning, and natural language processing. In contrast with previous studies limited to computer vision tasks, our findings show that it successfully trains a large range of state-of-the-art deep learning architectures, with performance close to fine-tuned backpropagation. At variance with common beliefs, our work supports that challenging tasks can be tackled in the absence of weight transport.

Authors: Julien Launay, Iacopo Poli, François Boniface, Florent Krzakala

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

SHOW MORE

Created 1 year, 2 months ago.

145 videos

CategoryScience & Technology