Yannic Kilcher

Ever wanted to do a convolution on a Klein Bottle? This paper defines CNNs over manifolds such that they are independent of which coordinate frame you choose. Amazingly, this then results in an efficient practical method to achieve state-of-the-art in several tasks!


The principle of equivariance to symmetry transformations enables a theoretically grounded approach to neural network architecture design. Equivariant networks have shown excellent performance and data efficiency on vision and medical imaging problems that exhibit symmetries. Here we show how this principle can be extended beyond global symmetries to local gauge transformations. This enables the development of a very general class of convolutional neural networks on manifolds that depend only on the intrinsic geometry, and which includes many popular methods from equivariant and geometric deep learning. We implement gauge equivariant CNNs for signals defined on the surface of the icosahedron, which provides a reasonable approximation of the sphere. By choosing to work with this very regular manifold, we are able to implement the gauge equivariant convolution using a single conv2d call, making it a highly scalable and practical alternative to Spherical CNNs. Using this method, we demonstrate substantial improvements over previous methods on the task of segmenting omnidirectional images and global climate patterns.

Authors: Taco S. Cohen, Maurice Weiler, Berkay Kicanaoglu, Max Welling

Current CNNs have to downsample large images before processing them, which can lose a lot of detail information. This paper proposes attention sampling, which learns to selectively process parts of any large image in full resolution, while discarding uninteresting bits. This leads to enormous gains in speed and memory consumption.


Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution view of the input. We refer to our method as attention sampling and it can process images of several megapixels with a standard single GPU setup. We show that sampling from the attention distribution results in an unbiased estimator of the full model with minimal variance, and we derive an unbiased estimator of the gradient that we use to train our model end-to-end with a normal SGD procedure. This new method is evaluated on three classification tasks, where we show that it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. We also show the consistency of the sampling that indeed focuses on informative parts of the input images.

Authors: Angelos Katharopoulos, François Fleuret

Standard neural networks suffer from problems such as un-smooth classification boundaries and overconfidence. Manifold Mixup is an easy regularization technique that rectifies these problems. It works by interpolating hidden representations of different data points and then train them to predict equally interpolated labels.


Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly different test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it on practical situations, and connect it to previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, Yoshua Bengio

The goal of hierarchical reinforcement learning is to divide a task into different levels of coarseness with the top-level agent planning only over a high-level view of the world and each subsequent layer having a more detailed view. This paper proposes to learn a set of important states as well as their connections to each other as a high-level abstraction.


In many real-world scenarios, an autonomous agent often encounters various tasks within a single complex environment. We propose to build a graph abstraction over the environment structure to accelerate the learning of these tasks. Here, nodes are important points of interest (pivotal states) and edges represent feasible traversals between them. Our approach has two stages. First, we jointly train a latent pivotal state model and a curiosity-driven goal-conditioned policy in a task-agnostic manner. Second, provided with the information from the world graph, a high-level Manager quickly finds solution to new tasks and expresses subgoals in reference to pivotal states to a low-level Worker. The Worker can then also leverage the graph to easily traverse to the pivotal states of interest, even across long distance, and explore non-locally. We perform a thorough ablation study to evaluate our approach on a suite of challenging maze tasks, demonstrating significant advantages from the proposed framework over baselines that lack world graph knowledge in terms of performance and efficiency.

Authors: Wenling Shang, Alex Trott, Stephan Zheng, Caiming Xiong, Richard Socher

It turns out that the classic view of generalization and overfitting is incomplete! If you add parameters beyond the number of points in your dataset, generalization performance might increase again due to the increased smoothness of overparameterized functions.

The question of generalization in machine learning---how algorithms are able to learn predictors from a training sample to make accurate predictions out-of-sample---is revisited in light of the recent breakthroughs in modern machine learning technology.
The classical approach to understanding generalization is based on bias-variance trade-offs, where model complexity is carefully calibrated so that the fit on the training sample reflects performance out-of-sample.
However, it is now common practice to fit highly complex models like deep neural networks to data with (nearly) zero training error, and yet these interpolating predictors are observed to have good out-of-sample accuracy even for noisy data.
How can the classical understanding of generalization be reconciled with these observations from modern machine learning practice?
In this paper, we bridge the two regimes by exhibiting a new "double descent" risk curve that extends the traditional U-shaped bias-variance curve beyond the point of interpolation.
Specifically, the curve shows that as soon as the model complexity is high enough to achieve interpolation on the training sample---a point that we call the "interpolation threshold"---the risk of suitably chosen interpolating predictors from these models can, in fact, be decreasing as the model complexity increases, often below the risk achieved using non-interpolating models.
The double descent risk curve is demonstrated for a broad range of models, including neural networks and random forests, and a mechanism for producing this behavior is posited.

Authors: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal


Being interviewed by Connor Shorten of Henry AI Labs (https://www.youtube.com/channel/UCHB9VepY6kYvZjj0Bgxnpbw) on the topic of population-based methods and open-ended learning.

Tutorial: https://www.facebook.com/icml.imls/videos/481758745967365/
Book: https://www.amazon.com/dp/B00X57B4JG/

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le


A short rant on sponsor companies at ICML and how to talk to them.

Short intro to the International Conference on Machine Learning in Long Beach, CA.
I'll be making some updates from the conference.


Created 2 months, 2 weeks ago.

10 videos

CategoryScience & Technology