Yannic Kilcher

Videos
About

Apr 25, 2024

Meta's Llama 3 is out. New model, new license, new opportunities.

Apr 25, 2024

73 18:00

Apr 18, 2024

Hugging Face got hacked

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://ykilcher.com/discord
LinkedIn: https://www.linkedin.com/in/ykilcher

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Apr 18, 2024

65 9:54

Apr 16, 2024

[ML News] Microsoft to spend 100 BILLION DOLLARS on supercomputer (& more industry news)

Some updates from industry in the Machine Learning world

If you want to support me, the best thing to do is to share out the content :)

Apr 16, 2024

74 27:31

Apr 14, 2024

[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)

A flurry of new models continues to appear.

If you want to support me, the best thing to do is to share out the content :)

Apr 14, 2024

79 56:15

Apr 09, 2024

Flow Matching for Generative Modeling (Paper Explained)

Flow matching is a more general method than diffusion and serves as the basis for models like Stable Diffusion 3.

Paper: https://arxiv.org/abs/2210.02747

Abstract:
We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow Matching (FM), a simulation-free approach for training CNFs based on regressing vector fields of fixed conditional probability paths. Flow Matching is compatible with a general family of Gaussian probability paths for transforming between noise and data samples -- which subsumes existing diffusion paths as specific instances. Interestingly, we find that employing FM with diffusion paths results in a more robust and stable alternative for training diffusion models. Furthermore, Flow Matching opens the door to training CNFs with other, non-diffusion probability paths. An instance of particular interest is using Optimal Transport (OT) displacement interpolation to define the conditional probability paths. These paths are more efficient than diffusion paths, provide faster training and sampling, and result in better generalization. Training CNFs using Flow Matching on ImageNet leads to consistently better performance than alternative diffusion-based methods in terms of both likelihood and sample quality, and allows fast and reliable sample generation using off-the-shelf numerical ODE solvers.

Authors: Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
Subscrib..

Apr 09, 2024

83 44:04

Apr 08, 2024

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping (Searchformer)

Paper: https://arxiv.org/abs/2402.14083

Abstract:
While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks and present Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Searchformer is an encoder-decoder Transformer model trained to predict the search dynamics of A∗. This model is then fine-tuned via expert iterations to perform fewer search steps than A∗ search while still generating an optimal plan. In our training method, A∗'s search dynamics are expressed as a token sequence outlining when task states are added and removed into the search tree during symbolic planning. In our ablation studies on maze navigation, we find that Searchformer significantly outperforms baselines that predict the optimal plan directly with a 5-10× smaller model size and a 10× smaller training dataset. We also demonstrate how Searchformer scales to larger and more complex decision making tasks like Sokoban with improved percentage of solved tasks and shortened search dynamics.

Authors: Lucas Lehnert, Sainbayar Sukhbaatar, Paul Mcvay, Michael Rabbat, Yuandong Tian

If you want to support me, the best thing to do is to share out the content :)

Apr 08, 2024

130 26:59

Mar 27, 2024

[ML News] Grok-1 open-sourced | Nvidia GTC | OpenAI leaks model names | AI Act

OUTLINE:
0:00 - Intro
0:15 - XAI releases Grok-1
2:00 - Nvidia GTC
4:45 - Comment of the Week
5:35 - Brute-forcing OpenAI model names
7:30 - Inflection AI gets eaten by Microsoft
9:25 - EU AI Act moving forward
11:45 - Advances in Robotics
14:00 - India retracts controversial advisory
14:30 - OpenSora
15:20 - Improved Gemma fine-tuning
16:20 - Decoding encrypted LLM traffic
17:45 - Varia

References:
https://x.ai/blog/grok-os
https://github.com/xai-org/grok-1
https://finance.yahoo.com/news/nvidia-debuts-next-generation-blackwell-ai-chip-at-gtc-2024-205825161.html?guccounter=1&guce_referrer=aHR0cHM6Ly9uZXdzLmdvb2dsZS5jb20v&guce_referrer_sig=AQAAAHYRVePPrDnH3HxPV8smDzUiia_ztWttteAmHKxy-x_Z75lqq2trR4Exwq2sFyjNQojO_95xWvqQFHkV3NI_IKmw9W8XZ7d52qBsdvqaDRkdNzBSzQhnskzUE_E-nDo6OFG0LmrM0ygvjqLgJyhMDnraaGHrUsb98kknjn7-83MJ
https://spectrum.ieee.org/nvidia-gr00t-ros
https://twitter.com/anshelsag/status/1769989302552031473?t=DYAFhri4cu55LMwJV4V99A&s=09
https://twitter.com/ibab_ml/status/1769770983924142475
https://twitter.com/arthurmensch/status/1769842867621581299?t=sYPy011kN9KxzdnA11M4yQ&s=09
https://twitter.com/arithmoquine/status/1770136393563378082?t=FgH3-TABR73QVUQuP5wq2g&s=09
https://files.catbox.moe/od9pyb.txt
https://techcrunch.com/2024/03/19/after-raising-1-3b-inflection-got-eaten-alive-by-its-biggest-investor-microsoft/
https://archive.ph/p4W1N#selection-2463.23-2463.114
https://www.instagram.com/reel/C4df3DZg1wj/?igsh=MWQ1ZGUxMzBkMA%3D%3D
https://techcrunch.com/2024/03/15/mercedes-begins-piloting-apptronik-humanoid-robots/
https://www.axios.com/2024/03/14/humanoid-robot-army-agility-digit-amazon-warehouse
https://techcrunch.com/2024/03/15/india-drops-plan-to-require-approval-for-ai-model-launches/
https://github.com/hpcaitech/Open-Sora
https://www.reddit.com/r/LocalLLaMA/comments/1bd18y8/gemma_finetuning_should_be_much_better_now/
https://twitter.com/felix_red_panda/status/1769363356094230837?t=JMMb3OldqfhhCH8X5e7ljA&s=09
https://twitter.co..

Mar 27, 2024

133 26:49

Mar 18, 2024

[ML News] Devin AI Software Engineer | GPT-4.5-Turbo LEAKED | US Gov't Report: Total Extinction

Your weekly dose of ML News

OUTLINE:
0:00 - Intro
0:15 - Devin: AI software engineer
5:50 - Mira Murati on Sora training data
6:50 - Inflection accused of copying Claude
9:00 - Tools & papers
16:30 - GPT-4.5-turbo mystery
17:30 - US government report: total extinction by AI
19:20 - Various other news

References:
https://www.cognition-labs.com/introducing-devin
https://twitter.com/cognition_labs/status/1767548763134964000?t=ZECIn-uqbguwHtY8X_Gvtw&s=09
https://news.google.com/stories/CAAqNggKIjBDQklTSGpvSmMzUnZjbmt0TXpZd1NoRUtEd2lWMUwyU0N4RnVWM3pSRWhWX01pZ0FQAQ?hl=en-US&gl=US&ceid=US%3Aen
https://www.bloomberg.com/news/articles/2024-03-12/cognition-ai-is-a-peter-thiel-backed-coding-assistant?embedded-checkout=true
https://www.bloomberg.com/authors/AQWHkoPod9g/ashlee-vance
https://www.bloomberg.com/news/articles/2024-03-12/cognition-ai-is-a-peter-thiel-backed-coding-assistant?srnd=undefined&embedded-checkout=true
https://www.bloomberg.com/news/newsletters/2024-03-12/cognition-ai-s-devin-assistant-can-build-websites-videos-from-a-prompt?srnd=undefined&embedded-checkout=true
https://archive.ph/5LZV9
https://github.com/opendevin/opendevin
https://twitter.com/MetaGPT_/status/1767965444579692832?t=dsYKmPfOBVGCFCwvPtZVWQ&s=09
https://docs.deepwisdom.ai/main/en/DataInterpreter/detail.html?id=AppleStockPriceAnalysisAndPrediction
https://docs.deepwisdom.ai/main/en/guide/use_cases/agent/interpreter/intro.html
https://github.com/geekan/MetaGPT/tree/main/examples/di
https://inflection.ai/inflection-2-5
https://twitter.com/seshubon/status/1765870717844050221
https://twitter.com/inflectionAI/status/1766173427441049684
https://www.mlxserver.com/
https://huggingface.co/spaces/mlabonne/AutoMerger
https://github.com/microsoft/aici
https://github.com/google-research/google-research/tree/master/fax
https://github.com/stanfordnlp/pyvene
https://arxiv.org/pdf/2403.06634.pdf
https://twitter.com/mattshumer_/status/1767606938538295757?t=1dYect5ylg9xrWSS4sL38Q&s=p;s=..

Mar 18, 2024

130 53:14

Mar 11, 2024

[ML News] Elon sues OpenAI | Mistral Large | More Gemini Drama

#mlnews #ainews #openai

OUTLINE:
0:00 - Intro
0:20 - Elon sues OpenAI
14:00 - Mistral Large
16:40 - ML Espionage
18:30 - More Gemini Drama
24:00 - Copilot generates spicy images
26:55 - Gemma bugs
28:45 - Varia

References: https://gist.github.com/yk/0c065cdc8e414738abfaae4f8e417e00

Thumbnail pictures: Wikipedia

If you want to support me, the best thing to do is to share out the content :)

Mar 11, 2024

131 0:59

Mar 09, 2024

On Claude 3

Mar 09, 2024

149 15:11

Mar 07, 2024

No, Anthropic's Claude 3 is NOT sentient

No, Anthropic's Claude 3 is not conscious or sentient or self-aware.

References:
https://www.anthropic.com/news/claude-3-family
https://twitter.com/_akhaliq/status/1764673955313459560?t=gkBx2uTXfrxLl-5_mL7Btg&s=09
https://twitter.com/idavidrein/status/1764675668175094169?t=pJfbN3LtKaxsU8egz83Mvg&s=09
https://twitter.com/TolgaBilge_/status/1764754012824314102?t=9bakXDnVMC1oAEyZFoKimA&s=09
https://twitter.com/karinanguyen_/status/1764670019743690757?t=gkBx2uTXfrxLl-5_mL7Btg&s=09
https://twitter.com/alexalbert__/status/1764722513014329620
https://www.lesswrong.com/posts/pc8uP4S9rDoNpwJDZ/claude-3-claims-its-conscious

If you want to support me, the best thing to do is to share out the content :)

Mar 07, 2024

143 42:33

Mar 03, 2024

[ML News] Groq, Gemma, Sora, Gemini, and Air Canada's chatbot troubles

Your dose of ML News!

OUTLINE:
0:00 - Intro
0:20 - Gemma & Gemini
3:40 - Groq
6:30 - Nvidia EOS Supercomputer
7:15 - Gpulist.ai
8:20 - Demis Hassabis on scale
10:10 - Hardware wars
12:05 - Sora
15:10 - Gemini 1.5 Pro & Long Context
18:45 - Air Canada must pay for chatbot mistake
23:30 - Giant Rat Balls
26:25 - Various News

Mar 03, 2024

168 17:35

Feb 26, 2024

Gemini has a Diversity Problem

Google turned the anti-bias dial up to 11 on their new Gemini Pro model.

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a..

Feb 26, 2024

152 50:02

Feb 21, 2024

V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video (Explained)

#vjepa #meta #unsupervisedlearning

V-JEPA is a method for unsupervised representation learning of video data by using only latent representation prediction as objective function.

Weights & Biases course on Structured LLM Outputs: https://wandb.me/course-yannic

OUTLINE:
0:00 - Intro
1:45 - Predictive Feature Principle
8:00 - Weights & Biases course on Structured LLM Outputs
9:45 - The original JEPA architecture
27:30 - V-JEPA Concept
33:15 - V-JEPA Architecture
44:30 - Experimental Results
46:30 - Qualitative Evaluation via Decoding

Blog: https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Paper: https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/

Abstract:
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone, our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Authors: Adrien Bardes Quentin Garrido Xinlei Chen Michael Rabbat Yann LeCun Mido Assran Nicolas Ballas Jean Ponce

If you want..

Feb 21, 2024

145 1:23:58

Feb 21, 2024

What a day in AI! (Sora, Gemini 1.5, V-JEPA, and lots of news)

Your regularly irregular dose of Machine Learning News!

W&B Course on LLM Structured Outputs: https://wandb.me/course-yannic

OUTLINE:
0:00 - OpenAI Sora
3:25 - Gemini 1.5 with 1 Million Tokens context window
4:50 - V-JEPA
6:50 - Sam Altman raises 7 TRILLION dollars for AI chips
9:30 - Sponsor: Weights & Biases course on Structure Output from LLMs
11:30 - Bard becomes Gemini
13:55 - GOODY-2: The world's most responsible model
16:05 - miqu-1-70b leaked from Mistral
18:25 - Zuckerberg on Meta's open approach to AI models
21:40 - 1X advances robotics
23:30 - Questions around Bard's arena leaderboard position
27:00 - Various other news

References:
https://gist.github.com/yk/65fe3d582a43540a61718b9e4b0706d0
(they were too long for this description)

If you want to support me, the best thing to do is to share out the content :)

Feb 21, 2024

178 54:23

Feb 05, 2024

Lumiere: A Space-Time Diffusion Model for Video Generation (Paper Explained)

#lumiere #texttovideoai #google

LUMIERE by Google Research tackles globally consistent text-to-video generation by extending the U-Net downsampling concept to the temporal axis of videos.

OUTLINE:
0:00 - Introduction
8:20 - Problems with keyframes
16:55 - Space-Time U-Net (STUNet)
21:20 - Extending U-Nets to video
37:20 - Multidiffusion for SSR prediction fusing
44:00 - Stylized generation by swapping weights
49:15 - Training & Evaluation
53:20 - Societal Impact & Conclusion

Paper: https://arxiv.org/abs/2401.12945
Website: https://lumiere-video.github.io/

Abstract:
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

Authors: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Disc..

Feb 05, 2024

188 35:26

Jan 22, 2024

AlphaGeometry: Solving olympiad geometry without human demonstrations (Paper Explained)

#deepmind #alphageometry #llm

AlphaGeometry is a combination of a symbolic solver and a large language model by Google DeepMind that tackles IMO geometry questions without any human-generated trainind data.

OUTLINE:
0:00 - Introduction
1:30 - Problem Statement
7:30 - Core Contribution: Synthetic Data Generation
9:30 - Sampling Premises
13:00 - Symbolic Deduction
17:00 - Traceback
19:00 - Auxiliary Construction
25:20 - Experimental Results
32:00 - Problem Representation
34:30 - Final Comments

Paper: https://www.nature.com/articles/s41586-023-06747-5

Abstract:
Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning1,2,3,4, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges1,5, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 20..

Jan 22, 2024

182 34:31

Jan 14, 2024

Mixtral of Experts (Paper Explained)

#mixtral #mistral #chatgpt

OUTLINE:
0:00 - Introduction
3:00 - Mixture of Experts
6:00 - Classic Transformer Blocks
11:15 - Expert Routing
17:00 - Sparse Expert Routing
22:00 - Expert Parallelism
25:00 - Experimental Results
31:30 - Routing Analysis
33:20 - Conclusion

Paper: https://arxiv.org/abs/2401.04088

Abstract:
We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

Links:
Homepage: https://ykilcher.com
Merch: https://ykilcher.com/merch
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitt..

Jan 14, 2024

194 3:39

Jan 11, 2024

Until the Litter End

https://litter.ykilcher.com

If you want to support me, the best thing to do is to share out the content :)

Jan 11, 2024

182 31:45

Jan 08, 2024

LLaMA Pro: Progressive LLaMA with Block Expansion (Paper Explained)

Note: The H800 is a variant of the H100 for the Chinese market

OUTLINE:
0:00 - Introduction
5:30 - Adding new blocks to LLaMA
15:00 - Block expansion
27:40 - Experiments
30:40 - Conclusion

Paper: https://arxiv.org/abs/2401.02415
Other Paper: https://proceedings.mlr.press/v162/shen22f/shen22f.pdf

Abstract:
Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forgetting. In this paper, we experiment on the corpus of code and math, yielding LLaMA Pro-8.3B, a versatile foundation model initialized from LLaMA2-7B, excelling in general tasks, programming, and mathematics. LLaMA Pro and its instruction-following counterpart (LLaMA Pro-Instruct) achieve advanced performance among various benchmarks, demonstrating superiority over existing open models in the LLaMA family and the immense potential of reasoning and addressing diverse tasks as an intelligent agent. Our findings provide valuable insights into integrating natural and programming languages, laying a solid foundation for developing advanced language agents that operate effectively in various environments.

Authors: Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar...

Jan 08, 2024

191 8:16

Jan 04, 2024

I created an AI-powered Social Network

#ai #chatgpt #socialmedia

I created a social network that operates entirely in the latent space.
Litter (aka Latent Twitter) will pull images and text through multiple modality conversions before it hits the network, so you can communicate just the essence of your message.

Website: https://litter.ykilcher.com
Code: https://github.com/yk/litter

OUTLINE:
0:00 - Introduction
1:10 - How does it work?
3:30 - Improving Yann LeCun's post
4:20 - Posting images
5:05 - Image examples
6:40 - Final words

If you want to support me, the best thing to do is to share out the content :)

Jan 04, 2024

209 40:40

Dec 29, 2023

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Paper Explained)

#mamba #s4 #ssm

OUTLINE:
0:00 - Introduction
0:45 - Transformers vs RNNs vs S4
6:10 - What are state space models?
12:30 - Selective State Space Models
17:55 - The Mamba architecture
22:20 - The SSM layer and forward propagation
31:15 - Utilizing GPU memory hierarchy
34:05 - Efficient computation via prefix sums / parallel scans
36:01 - Experimental results and comments
38:00 - A brief look at the code

Paper: https://arxiv.org/abs/2312.00752

Abstract:
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model ou..

Dec 29, 2023

190 33:22

Dec 29, 2023

Another Hit Piece on Open-Source AI

Stanford researchers find problematic content in LAION-5B.

Link: https://purl.stanford.edu/kh752sm9123

If you want to support me, the best thing to do is to share out the content :)

Dec 29, 2023

176 57:51

Dec 29, 2023

NeurIPS 2023 Poster Session 4 (Thursday Morning)

OUTLINE:
0:30 - Activity Grammars for Temporal Action Segmentation
8:50 - Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback
17:05 - On the Role of Noise in the Sample Complexity of Learning Recurrent Neural Networks: Exponential Gaps for Long Sequences
21:20 - Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming
27:10 - Equivariant Adaptation of Large Pretrained Models
33:10 - Multi-Head Adapter Routing for Cross-Task Generalization
39:25 - Geometry-Aware Adaptation for Pretrained Models
46:10 - Adversarial Learning for Feature Shift Detection and Correction

Papers:
Title: Activity Grammars for Temporal Action Segmentation
Link: https://arxiv.org/abs/2312.04266
Author:
Dayoung Gong, Joonseok Lee,
Deunsol Jung, Suha Kwak, Minsu Cho
--------

Title: Diffusion-TTA: Test-time Adaptation of Discriminative Models via Generative Feedback
Link: https://arxiv.org/abs/2311.16102
Author:
Mihir Prabhudesai, Tsung-Wei Ke,
Alexander C. Li, Deepak Pathak, Katerina Fragkiadaki
--------

Title: On the Role of Noise in the Sample Complexity of Learning Recurrent Neural Networks: Exponential Gaps for Long Sequences
Link: https://arxiv.org/abs/2305.18423
Author:
Alireza Fathollah Pour, Hassan Ashtiani
--------

Title: Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming
Link: https://arxiv.org/abs/2310.19068
Author:
Gregory Dexter, Petros Drineas,
David P. Woodruff, Taisuke Yasuda
--------

Title: Equivariant Adaptation of Large Pretrained Models
Link: https://arxiv.org/pdf/2310.01647.pdf
Author:
Arnab Kumar Mondal, Siba Smarak Panigrahi,
Sékou-Oumar Kaba, Sai Rajeswar, Siamak Ravanbakhsh
--------

Title: Multi-Head Adapter Routing for Cross-Task Generalization
Link: https://arxiv.org/abs/2211.03831
Author:
Lucas Caccia, Edoardo Ponti, Zhan Su,
Matheus Pereira, Nicolas Le Roux, Alessandro Sordoni
--------

Title: Geometry-Aware Adaptation for Pretrained Models
Link: https://arxiv.org/abs/230..

Dec 29, 2023