alexalemi.com

I am a Research Scientist at Anthropic. Formerly at Google Deepmind and Disney Research. My current focus is Scaling and the intersection of Information Theory and Deep Learning. I got my Ph.D. in Theoretical Condensed Matter Physics at Cornell University, supervised by Jim Sethna. I got my B.S. at Caltech, where I majored in Physics.

Below you'll find a list of my publications and other writings, for my recent writing, see my blog.

Research

Scaling Exponents Across Parameterizations and Optimizers [arxiv] [pdf]

K Everett, L Xiao, M Wortsman, AA Alemi, R Novak, PJ Liu, I Gur, J Sohl-Dickstein, LP Kaelbling, J Lee, J Pennington 2024-07 ICML 2024

Understanding parameterizations and how to scale them.

Training LLMs over Neurally Compressed Text [arxiv] [pdf]

B Lester, J Lee, AA Alemi, J Pennington, A Roberts, J Sohl-Dickstein, N Constant 2024-04 TMLR, ICLR2025

Trying to train transformers on top of transformers with arithmetic compression.

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [arxiv] [pdf]

PAGI 2023-12 TMLR

Squeezing more performance out of models by fine-tuning on filtered generated responses.

Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?" [arxiv] [pdf]

PAGI 2023-11

It's easy to get models to perform arithmetic incorrectly, if you just ask nicely.

Small-scale proxies for large-scale Transformer training instabilities [arxiv] [pdf]

M Wortsman & PAGI 2023-09 ICLR 2024

Studying problems of large scale models in the small.

Speed Limits for Deep Learning [arxiv] [pdf]

I Seroussi, AA Alemi, M Helias, Z Ringel 2023-07

Working out thermodynamic speed limits for learning.

Variational Prediction [arxiv] [pdf]

AA Alemi, B Poole 2023-05 AABI2023

Targetting the predictive distribution directly with a variational method.

Weighted Ensemble Self-Supervised Learning [arxiv] [pdf] [openreview]

Y Ruan, S Singh, WR Morningstar, AA Alemi, S Ioffe, I Fischer, JV Dillon 2022-11 ICLR 2023

Ensembling the heads of SSL methods gives nice gains.

Trajectory ensembling for fine tuning - performance gains without modifying training [pdf] [openreview] [video]

L Anderson-Conway, V Birodkar, S Singh, H Mobahi, AA Alemi 2022-09 HITY Workshop NeurIPS 2022

Ensembling within a trajectory gives some simple gains.

Bayesian Imitation Learning for End-to-End Mobile Manipulation [arxiv] [pdf]

Y Du, D Ho, AA Alemi, E Jang, M Khansari 2022-02 ICML 2022

Using VIB to help robots open doors.

A Closer Look at the Adversarial Robustness of Information Bottleneck Models [arxiv] [pdf] [openreview]

I Korshunova, D Stutz, AA Alemi, O Wiles, S Gowal 2021-06 ICML 2021 AML Workshop Poster

Looking more carefully, IB models aren't fully robust to adversarial examples.

Does Knowledge Distillation Really Work? [arxiv] [pdf]

S Stanton, P Izmailov, P Kirichenko, AA Alemi, AG Wilson 2021-06 NeurIPS2021

Knowledge distillation doesn't seem to work as well as people assume it does.

VIB is Half Bayes [arxiv] [pdf] [poster-talk] [talk]

AA Alemi, WR Morningstar, B Poole, I Fischer, JV Dillon 2020-11 AABI 2021 Oral

VIB can be rederived as a half-Bayesian half-Maximum likelihood method.

PACᵐ-Bayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regime [arxiv] [pdf] [video]

WR Morningstar, AA Alemi, JV Dillon 2020-10 AISTATS2022

Multisample bound that does better than Bayes at prediction for misspecified models.

Density of States Estimation for Out-of-Distribution Detection [arxiv] [pdf]

WR Morningstar, C Ham, AG Gallagher, B Lakshminarayanan, AA Alemi, JV Dillon 2020-06 AISTATS 2021 Oral

Simple density-of-states inspired out of distribution detection.

The OpenKIM Processing Pipeline: A Cloud-Based Automatic Materials Property Computation Engine [arxiv] [pdf] [openkim.org]

DS Karls, M Bierbaum, AA Alemi, RS Elliot, JP Sethna, EB Tadmor 2020-05 Journal of Chemical Physics

Database for Interatomic Potentials.

Neural Tangents: Fast and Easy Infinite Neural Networks in Python [arxiv] [pdf] [code]

R Novak, L Xiao, J Hron, J Lee, AA Alemi, J Sohl-Dickstein, SS Schoenholz 2019-12 ICLR

Simple to use python package for training infinitely wide neural networks.

Variational Predictive Information Bottleneck [arxiv] [pdf]

AA Alemi 2019-10 AABI

Most modern inference procedures can be rederived as a simple variational bound on a predictive information bottleneck objective.

Information in Infinite Ensembles of Infinitely-Wide Networks [arxiv] [pdf]

R Shwartz-Ziv, AA Alemi 2019-10 AABI 2019 - PMLR

While they seem complex, infinite ensembles of infinitely-wide networks are simple enough to enable tractable calculations of many information theoretic quantities.

CEB Improves Model Robustness [arxiv] [pdf]

I Fischer, AA Alemi 2019-10 Entropy

A class conditional version of VIB shows good robustness.

On Predictive Information in RNNs [arxiv] [pdf]

Z Dong, D Oktay, B Poole, AA Alemi 2019-10

Modern RNNs do not optimally capture predictive information in sequences.

Thermodynamic Computing [arxiv] [pdf]

T Conte, E DeBenedictis, N Ganesh, T Hylton, JP Strachan, RS Williams, AA Alemi, L Altenberg, G Crooks, J Crutchfield, L del Rio, J Deutsch, M DeWeese, K Douglas, M Esposito, M Frank, R Fry, P Harsha, M Hill, C Kello, J Krichmar, S Kumar, SC Liu, S Lloyd, M Marsili, I Nemenman, A Nugent, N Packard, D Randall, P Sadowski, N Santhanam, R Shaw, A Stieg, E Stopnitzky, C Teuscher, C Watkins, D Wolpert, J Yang, Y Yufik 2019-11 CCC

A position paper on the future of thermodynamic computing.

On Variational Bounds of Mutual Information [arxiv] [pdf]

B Poole, S Ozair, A van den Oord, AA Alemi, G Tucker 2019-05 ICML

Overview of recent advances in variationally bounding mutual information.

Dueling Decoders: Regularizing Variational Autoencoder Latent Spaces [arxiv] [pdf]

B Seybold, E Fertig, AA Alemi, I Fischer 2019-05

Sometimes a worse decoder gives better representations.

Variational Autoencoders with Tensorflow Probability Layers [post]

I Fischer, AA Alemi, JV Dillon, TFP Team 2019-03 Tensorflow Blog

TFP makes VAEs easy.

On the Use of ArXiv as a Dataset [arxiv] [pdf] [code]

CB Clement, M Bierbaum, KP O'Keeffe, AA Alemi 2019-05 ICLR workshop RLGM

More people should use the ArXiv as a dataset.

β-VAEs can retain label information even at high compression [arxiv] [pdf]

E Fertig, A Arbabi, AA Alemi 2018-12 NeurIPS BDL Workshop

Some rich decoder VAEs can magically focus on salient information.

Canonical Sectors and Evolution of Firms in the US Stock Markets [arxiv] [pdf]

LX Hayden, R Chachra, AA Alemi, PH Ginsparg, JP Sethna 2018-10 Quantitative Finance

Matrix factorization gives automatic and continous sector assignments to stocks.

WAIC, but Why? Generative Ensembles for Robust Anomaly Detection [arxiv] [pdf]

H Choi, E Jang, AA Alemi 2018-10

Even though it shouldn't work, robust likelihoods can detect OOD data in practice.

TherML: Thermodynamics of Machine Learning [arxiv] [pdf] [video]

AA Alemi, I Fisher 2018-07 ICML2018 TFADGM Workshop

Modern variational latent variable modelling looks a lot like Thermodynamics.

Uncertainty in the Variational Information Bottleneck [arxiv] [pdf] [slides]

AA Alemi, I Fischer, JV Dillon 2018-07 UAI UDL Workshop

VIB builds robust classifiers which are aware of what they don't know.

Watch your step: Learning node embeddings via graph attention [arxiv] [pdf]

S Abu-El-Haija, B Perozzi, R Al-Rfou, AA Alemi 2018-12 NeurIPS

Building better graph representations.

GILBO: one metric to measure them all [arxiv] [pdf]

AA Alemi, I Fischer 2018-12 NeurIPS

A variational lower bound on the mutual informations in GANs highlight some of their problems.

Fixing a Broken ELBO [arxiv] [pdf] [slides]

AA Alemi, B Poole, I Fischer, JV Dillon, RA Saurous, K Murphy 2018-05 ICML

Adopting a representational view of VAEs can help explain away some of their problems.

Tensorflow distributions [arxiv] [pdf] [code]

JV Dillon, I Langmore, D Tran, E Brevdo, S Vasudevan, D Moore, B Patton, AA Alemi, M Hoffman, RA Saurous 2017-11

Paper accompanying library.

Light microscopy at maximal precision [arxiv] [pdf]

M Bierbaum, BD Leahy, AA Alemi, I Cohen, JP Sethna 2017-02 Phys Rev X

Better featuring of colloids.

Jeffrey's prior sampling of deep sigmoidal networks [arxiv] [pdf]

LX Hayden, AA Alemi, PH Ginsparg, JP Sethna 2017-05

Jeffrey's prior doesn't really work for neural networks.

Motion prediction under multimodality with conditional stochastic networks [arxiv] [pdf]

K Fragkiadaki, J Huang, AA Alemi, S Vijayanarasimhan, S Ricco, R Sukthankar 2017-05

Pedestrian motion is stochastic which creates certain challenges.

Inception-v4, inception-resnet and the impact of residual connections on learning [arxiv] [pdf]

C Szegedy, S Ioffe, V Vanhoucke, AA Alemi 2017-02 AAAI

Residual connections improve the inception family of classifiers.

Deep Variational Information Bottleneck [arxiv] [pdf]

AA Alemi, I Fischer, JV Dillon, K Murphy 2017-03 ICLR

A modern formulation of the Information Bottleneck which is friendly towards neural networks.

Improved generator objectives for gans [arxiv] [pdf]

B Poole, AA Alemi, J Sohl-Dickstein, A Angelova 2016-12 NeurIPS Adversarial Workshop

You can target separate divergences for the generator and discriminator of a GAN.

Tree-Structured Variational Autoencoder [pdf]

R Shin, AA Alemi, G Irving, O Vinyals 2016-11

Attempting to learn tree-structured representations.

Improving inception and image classification in tensorflow [post]

AA Alemi 2016-06 Google Research Blog

Blogpost accompanying open source release of Inception Resnet V2.

DeepMath-deep sequence models for premise selection [arxiv] [pdf]

G Irving, C Szegedy, AA Alemi, N Eén, F Chollet, J Urban 2016-06 NeurIPS

Using neural networks to improve automatic theorem proving.

SPARTA: Fast global planning of collision-avoiding robot trajectories [pdf]

CJM Mathy, F Gonda, D Schmidt, N Derbinsky, AA Alemi, J Bento, FM Delle Fave, JS Yedidia 2015-12

Using ADMM to do fast trajectory planning.

You can run, you can hide: The epidemiology and statistical mechanics of zombies [arxiv] [pdf]

AA Alemi, M Bierbaum, CR Myers, JP Sethna 2015-11 Phys Rev E

A fun pedadogical introduction to epidemiology and statistical mechanics.

Zombies Reading Segmented Graphene Articles On The Arxiv [pdf]

AA Alemi 2015-08 Thesis

A collection of four of my graduate student projects.

Clustering via Content-Augmented Stochastic Blockmodels [arxiv] [pdf]

JM Cashore, X Zhao, AA Alemi, Y Liu, PI Frazier 2015-05

Better clustering through content conditioning.

Text segmentation based on semantic word embeddings [arxiv] [pdf]

AA Alemi, P Ginsparg 2015-03

Using word2vec vectors to do automatic text segmentation.

Mechanical properties of growing melanocytic nevi and the progression to melanoma [arxiv] [pdf]

A Taloni, AA Alemi, E Ciusani, JP Sethna, S Zapperi, CAM La Porta 2014-04 PloS One

Elastic models of skin cancer.

Ensuring reliability, reproducibility and transferability in atomistic simulations: The knowledgebase of interatomic models (https://openkim.org) [pdf]

E Tadmor, R Elliott, D Karls, A Ludvik, J Sethna, M Bierbaum, AA Alemi, T Wennblom 2014-10

Knowledgebase of Interatomic Models application programming interface as a standard for molecular simulations [pdf] [openkim.org]

R Elliott, E Tadmor, D Karls, A Ludvik, J Sethna, M Bierbaum, AA Alemi, T Wennblom 2014-10

Building a website to collect interatomic potentials and score them.

Imaging atomic rearrangements in two-dimensional silica glass: watching silica's dance [pdf]

PY Huang, S Kurasch, JS Alden, A Shekhawat, AA Alemi, PL McEuen, JP Sethna, U Kaiser, DA Muller 2013-10 Science

Applying elastic theory to the atomic scale.

Growth and form of melanoma cell colonies [arxiv] [pdf]

MM Baraldi, AA Alemi, JP Sethna, S Caracciolo, CAM La Porta, S Zapperi 2013-08 JSM

Simple models of skin cancer growth.

Near-field radiative heat transfer between macroscopic planar surfaces [arxiv] [pdf]

RS Ottens, Volker Quetschke, Stacy Wise, AA Alemi, Ramsey Lundock, Guido Mueller, David H Reitze, David B Tanner, Bernard F Whiting 2011-03 Phys Rev Lett

Exploration of quantum tunnelling as a mechanism for cooling the next generation LIGO detectors.

Laplace-Runge-Lenz Vector [pdf]

AA Alemi 2009-06

Undergraduate project on the history of the Runge Vector.

NEMS Coupling [pdf]

AA Alemi 2008-09

Undergraduate research project on synchronization in nano cantilevers.

Why Venus has no moon [pdf]

AA Alemi, DJ Stevenson 2006-09 AAS Oral

Undergraduate research investigating whether two collisions in the opposite direction could explain Venus' lack of moon and slow rotation.

Writing

For more writing, see my blog.

Plaque [Blog]

2025-07-31 A dirt-simple bring-your-own-editor 'reactive' python notebook package.

Sliderules Rule [Blog]

2025-06-20 I made a zine about them and a digital sliderule you can use.

A Quarter for your Thoughts [Blog]

2024-11-25 Get the precision of one sig fig essentially for free.

A Degree of Certainty [Blog]

2024-08-14 Let's measure probability in degrees.

Leap Day [Obtude]

2024-03-29 Going overboard to prove the local newspaper wrong.

KL is All You Need [Blog]

2024-01-08 It's all KL under the hood.

The Method of Imaginary Results [Blog]

2023-11-30 Don't think about your prior, think about a hypothetical posterior.

Non-equilibrium Thermodynamics Results Seemingly from Nothing [Blog]

2022-09-16 Deriving some classic results in non-equilibrium thermodynamics from seemingly nothing.

A Path to the Variational Diffusion Loss [Blog]

2022-09-15 Deriving the (Variational) Diffusion and VAE losses from the non-negativity of KL.

Simple Diffusion Colab [colab] [github]

2022-09-15 A simple self-contained Colab introducing latent diffusion.

Simple Population Geiger Counter [Observable]

2022-06-22 More 'realistic' live population counter.

Probabilistic Machine Learning: An Introduction [book page] [pdf]

2022-02-08 Co-wrote the Information Theory Chapter for the book.

Why KL? [Blog]

2020-08-07 Why is the KL divergence so special?

Coronavirus Logistic Growth Plots [Observable]

2020-04-13 A distinct way to view Coronavirus growth.

'Live' Logistic Coronavirus Death Counter [Observable]

2020-03-27 An approximate 'live' corona death counter.

Physics of the weird boing sound on racquetball courts. [Physics Stackexchange]

2014-07-21 A model that recreates the boing sound.

How effective is speeding? [Physics Stackexchange]

2014-07-09 A simple model looking at how effective speeding is at saving time and money.

Can I compute the mass of a coin based on the sound of its fall? [Physics Stackexchange]

2014-06-26 Using the sound of coins dropping to predict their values.

The Linear Theory of Battleship [The Physics Virtuosi]

2011-10-03 Winning at battleship with a dirt simple model.

A tweet is worth at least 140 words [The Physics Virtuosi]

2011-08-30 Greedy twitter compression scheme.

How Long Can you Balance A (Quantum) Pencil [The Physics Virtuosi]

2010-06-16 Simple and probably wrong calculation for the ultimate length of time a pencil can balance.

I was born on Wednesday [The Physics Virtuosi]

2010-05-26 A classic logic puzzle explained.

Code

I also opensourced a top class image classification network: Inception Resnet V2.

A self-contained Diffusion colab is available here.

An open source demo of VIB is available here.

pychebfun is an open source reimplimentation of ChebFun.

texpad.alexalemi.com is a simple static MathJAX formula as copyable image generator.

Talks

How to Think About AI [slides]

2023-10 Osceola Neovates

A popular overview of LLMs and how to think about AI.

Introduction to Statistics through Randomization [video] [slides] [problem set]

2023-03 Google Course

Introduction to Statistical Thinking through Randomization.

Order of Magnitude Physics [part I] [part II] [slides] [problem set] [solutions]

2021-08 Google Course

Dimensional analysis and basic order of magnitude physics.

Basics of Information Theory [part I] [part II] [part III] [slides] [problem set] [solutions]

2021-01 Google Course

An overview of information theory for machine learning.

Information Theory for Representation Learning [video] [slides]

2023-12 InfoCog Workshop @ NeurIPS2023

Everything is KL divergence minimization.

What's Missing? A Speculative Sketch of the Future of Machine Learning and Science [video] [slides]

2023-12 ML and the Physical Sciences Workshop @ NeurIPS2023

Thinking about the future of science and machine learning.

Variational Prediction [slides] [poster]

2023-07 AABI2023

A variational way to directly target the posterior predictive.

A Tale of Two Worlds: The Variational Approach to Machine Learning [video] [slides]

2023-05 UCF CRCV

The variational approach to machine learning.

Inferential Engines [slides]

2023-02 Theoretical Physics for Machine Learning - Aspen

Viewing VAEs as four stroke engines.

PACm Bayes - Your Model is Wrong Workshop [video]

2021-11 Your Model is Wrong Workshop - NeurIPS 2021

Bayesian inference doesn't optimize for prediction in mispecified models.

Machine Learning and Thermodynamics [video] [SciML series] [slides]

2021-07 Scientific Machine Learning Mini-Course (SciML) @ CMU

Another version of the relationship between thermodynamics and machine learning.

VIB is Half Bayes [poster] [talk]

2021-02 Advances in Approximate Bayesian Inference Symposium 2021

The Variational Information Bottleneck can be viewed as a sort of half-Bayesian approach.

Machine Learning and Thermodynamics [video] [slides]

2020-06 University of Maryland - Informal Statistical Physics Seminar

Thermodynamics from a Probabilistic perspective and machine learning from a thermodynamic perspective.

TherML [video] [slides]

2020-06 American Physical Society Topical Group on Data Science

Another version of my TherML talk.

Variational Predictive Information Bottleneck [slides]

2020-02 Information Theory and Applications Workshop

I attempt to show that most modern forms of inference can be viewed as optimizing a variational bound on a predictive information bottleneck objective.

A Case for Compression [video] [slides]

2019-12 NeurIPS 2019 Workshop on Information Theory and Machine Learning

I offer arguments both for and against learning compressed representations in the form of a generalized information bottleneck.

TherML [slides]

2019-01 Aspen: Machine Learning and Physics

Drawing an analogy between Thermodynamics and modern deep variational latent variable generative modelling

Focusing on the Representation [slides]

2018-11 Cornell AI Seminar

An overview of my work, which often amounts to reinterpreting existing techniques in a representational light.

Thermodynamics and Machine Learning [slides]

2018-11 Cornell Physics Colloquium

An earlier talk relating thermodynamics and machine learning for a physics audience.

Fixing a BrokenELBO [slides]

2018-07 ICML2018

A representational reinterpretation of VAEs that help clarify issues such as posterior collapse.

Uncertainty in VIB [slides]

2018-08 UAI UDL Workshop 2018

VIB classifiers capture uncertainty effectively.

Etc.

I used to be quite active on the Physics Stackexchange.

Espdic is a simple Esperanto Dictionary.

One Look: A ludum dare 45 entry.

A Zombie Simulator for our paper.

The Physics Virtuosi was a blog I ran with some friends from graduate school.

A Mirror of My Old Homepage at Cornell University has some random things.