<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet type="text/xsl" href="assets/stylesheets/rss.xsl"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"><channel><title>AlexAlemi.com</title><link>https://alexalemi.com/rss.xml</link><description>Follow my publications and talks.</description><atom:link href="https://alexalemi.com/rss.xml" rel="self"/><docs>http://www.rssboard.org/rss-specification</docs><generator>python-feedgen</generator><image><url>http://alexalemi.com/favicon.ico</url><title>AlexAlemi.com</title><link>https://alexalemi.com/rss.xml</link></image><language>en</language><lastBuildDate>Wed, 06 May 2026 02:30:04 +0000</lastBuildDate><item><title>Why Venus has no moon</title><link>https://alexalemi.com/publications/venus.pdf</link><description>Undergraduate research investigating whether two collisions in the opposite direction could explain Venus' lack of moon and slow rotation. / AA Alemi, DJ Stevenson /  / AAS Oral</description><guid isPermaLink="true">https://alexalemi.com/publications/venus.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/venus.pdf" length="1219090" type="application/pdf"/><pubDate>Fri, 01 Sep 2006 00:00:00 -0400</pubDate></item><item><title>NEMS Coupling</title><link>https://alexalemi.com/publications/nems.pdf</link><description>Undergraduate research project on synchronization in nano cantilevers. / AA Alemi /  / </description><guid isPermaLink="true">https://alexalemi.com/publications/nems.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/nems.pdf" length="296346" type="application/pdf"/><pubDate>Mon, 01 Sep 2008 00:00:00 -0400</pubDate></item><item><title>Laplace-Runge-Lenz Vector</title><link>https://alexalemi.com/publications/laplace.pdf</link><description>Undergraduate project on the history of the Runge Vector. / AA Alemi /  / </description><guid isPermaLink="true">https://alexalemi.com/publications/laplace.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/laplace.pdf" length="254680" type="application/pdf"/><pubDate>Mon, 01 Jun 2009 00:00:00 -0400</pubDate></item><item><title>I was born on Wednesday</title><link>https://thephysicsvirtuosi.com/posts/old/i-was-born-on-wednesday/</link><description>A classic logic puzzle explained.</description><guid isPermaLink="true">https://thephysicsvirtuosi.com/posts/old/i-was-born-on-wednesday/</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Wed, 26 May 2010 00:00:00 -0400</pubDate></item><item><title>How Long Can you Balance A (Quantum) Pencil</title><link>https://thephysicsvirtuosi.com/posts/old/how-long-can-you-balance-a-quantum-pencil/</link><description>Simple and probably wrong calculation for the ultimate length of time a pencil can balance.</description><guid isPermaLink="true">https://thephysicsvirtuosi.com/posts/old/how-long-can-you-balance-a-quantum-pencil/</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Wed, 16 Jun 2010 00:00:00 -0400</pubDate></item><item><title>Near-field radiative heat transfer between macroscopic planar surfaces</title><link>https://arxiv.org/abs/1103.2389</link><description>Exploration of quantum tunnelling as a mechanism for cooling the next generation LIGO detectors. / RS Ottens, Volker Quetschke, Stacy Wise, AA Alemi, Ramsey Lundock, Guido Mueller, David H Reitze, David B Tanner, Bernard F Whiting / 1103.2389 / Phys Rev Lett</description><guid isPermaLink="true">https://alexalemi.com/publications/heat.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/heat.pdf" length="510656" type="application/pdf"/><pubDate>Tue, 01 Mar 2011 00:00:00 -0500</pubDate></item><item><title>A tweet is worth at least 140 words</title><link>https://thephysicsvirtuosi.com/posts/old/a-tweet-is-worth-at-least-140-words/</link><description>Greedy twitter compression scheme.</description><guid isPermaLink="true">https://thephysicsvirtuosi.com/posts/old/a-tweet-is-worth-at-least-140-words/</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Tue, 30 Aug 2011 00:00:00 -0400</pubDate></item><item><title>The Linear Theory of Battleship</title><link>https://thephysicsvirtuosi.com/posts/old/the-linear-theory-of-battleship/</link><description>Winning at battleship with a dirt simple model.</description><guid isPermaLink="true">https://thephysicsvirtuosi.com/posts/old/the-linear-theory-of-battleship/</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Mon, 03 Oct 2011 00:00:00 -0400</pubDate></item><item><title>Growth and form of melanoma cell colonies</title><link>https://arxiv.org/abs/1308.6037</link><description>Simple models of skin cancer growth. / MM Baraldi, AA Alemi, JP Sethna, S Caracciolo, CAM La Porta, S Zapperi / 1308.6037 / JSM</description><guid isPermaLink="true">https://alexalemi.com/publications/melanoma.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/melanoma.pdf" length="306883" type="application/pdf"/><pubDate>Thu, 01 Aug 2013 00:00:00 -0400</pubDate></item><item><title>Imaging atomic rearrangements in two-dimensional silica glass: watching silica's dance</title><link>https://alexalemi.com/publications/glass.pdf</link><description>Applying elastic theory to the atomic scale. / PY Huang, S Kurasch, JS Alden, A Shekhawat, AA Alemi, PL McEuen, JP Sethna, U Kaiser, DA Muller /  / Science</description><guid isPermaLink="true">https://alexalemi.com/publications/glass.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/glass.pdf" length="931242" type="application/pdf"/><pubDate>Tue, 01 Oct 2013 00:00:00 -0400</pubDate></item><item><title>Mechanical properties of growing melanocytic nevi and the progression to melanoma</title><link>https://arxiv.org/abs/1404.4116</link><description>Elastic models of skin cancer. / A Taloni, AA Alemi, E Ciusani, JP Sethna, S Zapperi, CAM La Porta / 1404.4116 / PloS One</description><guid isPermaLink="true">https://alexalemi.com/publications/cancer.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/cancer.pdf" length="6007805" type="application/pdf"/><pubDate>Tue, 01 Apr 2014 00:00:00 -0400</pubDate></item><item><title>Can I compute the mass of a coin based on the sound of its fall?</title><link>https://physics.stackexchange.com/questions/121879/can-i-compute-the-mass-of-a-coin-based-on-the-sound-of-its-fall/121932#121932</link><description>Using the sound of coins dropping to predict their values.</description><guid isPermaLink="true">https://physics.stackexchange.com/questions/121879/can-i-compute-the-mass-of-a-coin-based-on-the-sound-of-its-fall/121932#121932</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Thu, 26 Jun 2014 00:00:00 -0400</pubDate></item><item><title>How effective is speeding?</title><link>https://physics.stackexchange.com/questions/123753/how-effective-is-speeding/123760#123760</link><description>A simple model looking at how effective speeding is at saving time and money.</description><guid isPermaLink="true">https://physics.stackexchange.com/questions/123753/how-effective-is-speeding/123760#123760</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Wed, 09 Jul 2014 00:00:00 -0400</pubDate></item><item><title>Physics of the weird boing sound on racquetball courts.</title><link>https://physics.stackexchange.com/questions/127282/physics-of-weird-boing-sound-in-racquetball-courts/127447#127447</link><description>A model that recreates the boing sound.</description><guid isPermaLink="true">https://physics.stackexchange.com/questions/127282/physics-of-weird-boing-sound-in-racquetball-courts/127447#127447</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Mon, 21 Jul 2014 00:00:00 -0400</pubDate></item><item><title>Knowledgebase of Interatomic Models application programming interface as a standard for molecular simulations</title><link>https://alexalemi.com/publications/openkim2.pdf</link><description>Building a website to collect interatomic potentials and score them. / R Elliott, E Tadmor, D Karls, A Ludvik, J Sethna, M Bierbaum, AA Alemi, T Wennblom /  / </description><guid isPermaLink="true">https://alexalemi.com/publications/openkim2.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/openkim2.pdf" length="67751" type="application/pdf"/><pubDate>Wed, 01 Oct 2014 00:00:00 -0400</pubDate></item><item><title>Ensuring reliability, reproducibility and transferability in atomistic simulations: The knowledgebase of interatomic models (https://openkim.org)</title><link>https://alexalemi.com/publications/openkim-abs.pdf</link><description> / E Tadmor, R Elliott, D Karls, A Ludvik, J Sethna, M Bierbaum, AA Alemi, T Wennblom /  / </description><guid isPermaLink="true">https://alexalemi.com/publications/openkim-abs.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/openkim-abs.pdf" length="65562" type="application/pdf"/><pubDate>Wed, 01 Oct 2014 00:00:00 -0400</pubDate></item><item><title>Text segmentation based on semantic word embeddings</title><link>https://arxiv.org/abs/1503.05543</link><description>Using word2vec vectors to do automatic text segmentation. / AA Alemi, P Ginsparg / 1503.05543 / </description><guid isPermaLink="true">https://alexalemi.com/publications/segmentation.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/segmentation.pdf" length="1386447" type="application/pdf"/><pubDate>Sun, 01 Mar 2015 00:00:00 -0500</pubDate></item><item><title>Clustering via Content-Augmented Stochastic Blockmodels</title><link>https://arxiv.org/abs/1505.06538</link><description>Better clustering through content conditioning. / JM Cashore, X Zhao, AA Alemi, Y Liu, PI Frazier / 1505.06538 / </description><guid isPermaLink="true">https://alexalemi.com/publications/blockmodels.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/blockmodels.pdf" length="475812" type="application/pdf"/><pubDate>Fri, 01 May 2015 00:00:00 -0400</pubDate></item><item><title>Zombies Reading Segmented Graphene Articles On The Arxiv</title><link>https://alexalemi.com/publications/thesis.pdf</link><description>A collection of four of my graduate student projects. / AA Alemi /  / Thesis</description><guid isPermaLink="true">https://alexalemi.com/publications/thesis.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/thesis.pdf" length="26422151" type="application/pdf"/><pubDate>Sat, 01 Aug 2015 00:00:00 -0400</pubDate></item><item><title>You can run, you can hide: The epidemiology and statistical mechanics of zombies</title><link>https://arxiv.org/abs/1503.01104</link><description>A fun pedadogical introduction to epidemiology and statistical mechanics. / AA Alemi, M Bierbaum, CR Myers, JP Sethna / 1503.01104 / Phys Rev E</description><guid isPermaLink="true">https://alexalemi.com/publications/zombies.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/zombies.pdf" length="3061863" type="application/pdf"/><pubDate>Sun, 01 Nov 2015 00:00:00 -0400</pubDate></item><item><title>SPARTA: Fast global planning of collision-avoiding robot trajectories</title><link>https://alexalemi.com/publications/sparta.pdf</link><description>Using ADMM to do fast trajectory planning. / CJM Mathy, F Gonda, D Schmidt, N Derbinsky, AA Alemi, J Bento, FM Delle Fave, JS Yedidia /  / </description><guid isPermaLink="true">https://alexalemi.com/publications/sparta.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/sparta.pdf" length="2454531" type="application/pdf"/><pubDate>Tue, 01 Dec 2015 00:00:00 -0500</pubDate></item><item><title>DeepMath-deep sequence models for premise selection</title><link>https://arxiv.org/abs/1606.04442</link><description>Using neural networks to improve automatic theorem proving. / G Irving, C Szegedy, AA Alemi, N Eén, F Chollet, J Urban / 1606.04442 / NeurIPS</description><guid isPermaLink="true">https://alexalemi.com/publications/deep_math.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/deep_math.pdf" length="1233094" type="application/pdf"/><pubDate>Wed, 01 Jun 2016 00:00:00 -0400</pubDate></item><item><title>Improving inception and image classification in tensorflow</title><link>https://ai.googleblog.com/2016/08/improving-inception-and-image.html</link><description>Blogpost accompanying open source release of Inception Resnet V2. / AA Alemi /  / Google Research Blog</description><guid isPermaLink="true">https://alexalemi.com/publications/inceptionblog.html</guid><category domain="../publications/">publications</category><pubDate>Wed, 01 Jun 2016 00:00:00 -0400</pubDate></item><item><title>Tree-Structured Variational Autoencoder</title><link>https://alexalemi.com/publications/tree_vae.pdf</link><description>Attempting to learn tree-structured representations. / R Shin, AA Alemi, G Irving, O Vinyals /  / </description><guid isPermaLink="true">https://alexalemi.com/publications/tree_vae.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/tree_vae.pdf" length="345022" type="application/pdf"/><pubDate>Tue, 01 Nov 2016 00:00:00 -0400</pubDate></item><item><title>Improved generator objectives for gans</title><link>https://arxiv.org/abs/1612.02780</link><description>You can target separate divergences for the generator and discriminator of a GAN. / B Poole, AA Alemi, J Sohl-Dickstein, A Angelova / 1612.02780 / NeurIPS Adversarial Workshop</description><guid isPermaLink="true">https://alexalemi.com/publications/improved_gan.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/improved_gan.pdf" length="2339810" type="application/pdf"/><pubDate>Thu, 01 Dec 2016 00:00:00 -0500</pubDate></item><item><title>Inception-v4, inception-resnet and the impact of residual connections on learning</title><link>https://arxiv.org/abs/1602.07261</link><description>Residual connections improve the inception family of classifiers. / C Szegedy, S Ioffe, V Vanhoucke, AA Alemi / 1602.07261 / AAAI</description><guid isPermaLink="true">https://alexalemi.com/publications/inceptionv4.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/inceptionv4.pdf" length="957744" type="application/pdf"/><pubDate>Wed, 01 Feb 2017 00:00:00 -0500</pubDate></item><item><title>Light microscopy at maximal precision</title><link>https://arxiv.org/abs/1702.07336</link><description>Better featuring of colloids. / M Bierbaum, BD Leahy, AA Alemi, I Cohen, JP Sethna / 1702.07336 / Phys Rev X</description><guid isPermaLink="true">https://alexalemi.com/publications/peri.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/peri.pdf" length="8004477" type="application/pdf"/><pubDate>Wed, 01 Feb 2017 00:00:00 -0500</pubDate></item><item><title>Deep Variational Information Bottleneck</title><link>https://arxiv.org/abs/1612.00410</link><description>A modern formulation of the Information Bottleneck which is friendly towards neural networks. / AA Alemi, I Fischer, JV Dillon, K Murphy / 1612.00410 / ICLR</description><guid isPermaLink="true">https://alexalemi.com/publications/vib.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/vib.pdf" length="8864456" type="application/pdf"/><pubDate>Wed, 01 Mar 2017 00:00:00 -0500</pubDate></item><item><title>Motion prediction under multimodality with conditional stochastic networks</title><link>https://arxiv.org/abs/1705.02082</link><description>Pedestrian motion is stochastic which creates certain challenges. / K Fragkiadaki, J Huang, AA Alemi, S Vijayanarasimhan, S Ricco, R Sukthankar / 1705.02082 / </description><guid isPermaLink="true">https://alexalemi.com/publications/motion.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/motion.pdf" length="6880472" type="application/pdf"/><pubDate>Mon, 01 May 2017 00:00:00 -0400</pubDate></item><item><title>Jeffrey's prior sampling of deep sigmoidal networks</title><link>https://arxiv.org/abs/1705.10589</link><description>Jeffrey's prior doesn't really work for neural networks. / LX Hayden, AA Alemi, PH Ginsparg, JP Sethna / 1705.10589 / </description><guid isPermaLink="true">https://alexalemi.com/publications/jeffrey.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/jeffrey.pdf" length="20756658" type="application/pdf"/><pubDate>Mon, 01 May 2017 00:00:00 -0400</pubDate></item><item><title>Tensorflow distributions</title><link>https://arxiv.org/abs/1711.10604</link><description>Paper accompanying library. / JV Dillon, I Langmore, D Tran, E Brevdo, S Vasudevan, D Moore, B Patton, AA Alemi, M Hoffman, RA Saurous / 1711.10604 / </description><guid isPermaLink="true">https://alexalemi.com/publications/tfd.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/tfd.pdf" length="294053" type="application/pdf"/><pubDate>Wed, 01 Nov 2017 00:00:00 -0400</pubDate></item><item><title>Fixing a Broken ELBO</title><link>https://arxiv.org/abs/1711.00464</link><description>Adopting a representational view of VAEs can help explain away some of their problems. / AA Alemi, B Poole, I Fischer, JV Dillon, RA Saurous, K Murphy / 1711.00464 / ICML</description><guid isPermaLink="true">https://alexalemi.com/publications/broken_elbo.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/broken_elbo.pdf" length="3105649" type="application/pdf"/><pubDate>Tue, 01 May 2018 00:00:00 -0400</pubDate></item><item><title>Fixing a BrokenELBO</title><link>https://docs.google.com/presentation/d/11ToIFlOLrcP3GTl8u6Lv-jPntifO0iohIp2VIlTsqj8/present</link><description>A representational reinterpretation of VAEs that help clarify issues such as posterior collapse. / ICML2018</description><guid isPermaLink="true">https://alexalemi.com/talks/fixing-broken-elbo.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sun, 01 Jul 2018 00:00:00 -0400</pubDate></item><item><title>Uncertainty in the Variational Information Bottleneck</title><link>https://arxiv.org/abs/1807.00906</link><description>VIB builds robust classifiers which are aware of what they don't know. / AA Alemi, I Fischer, JV Dillon / 1807.00906 / UAI UDL Workshop</description><guid isPermaLink="true">https://alexalemi.com/publications/uncert_vib.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/uncert_vib.pdf" length="9909442" type="application/pdf"/><pubDate>Sun, 01 Jul 2018 00:00:00 -0400</pubDate></item><item><title>TherML: Thermodynamics of Machine Learning</title><link>https://arxiv.org/abs/1807.04162</link><description>Modern variational latent variable modelling looks a lot like Thermodynamics. / AA Alemi, I Fisher / 1807.04162 / ICML2018 TFADGM Workshop</description><guid isPermaLink="true">https://alexalemi.com/publications/therml.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/therml.pdf" length="851285" type="application/pdf"/><pubDate>Sun, 01 Jul 2018 00:00:00 -0400</pubDate></item><item><title>Uncertainty in VIB</title><link>https://docs.google.com/presentation/d/1PjEaRIeDOwVYKEmyLBIBiKS8bYPYwwKcXki_Gb8i42c/present</link><description>VIB classifiers capture uncertainty effectively. / UAI UDL Workshop 2018</description><guid isPermaLink="true">https://alexalemi.com/talks/uaivib.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Wed, 01 Aug 2018 00:00:00 -0400</pubDate></item><item><title>WAIC, but Why? Generative Ensembles for Robust Anomaly Detection</title><link>https://arxiv.org/abs/1810.01392</link><description>Even though it shouldn't work, robust likelihoods can detect OOD data in practice. / H Choi, E Jang, AA Alemi / 1810.01392 / </description><guid isPermaLink="true">https://alexalemi.com/publications/waic.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/waic.pdf" length="1166737" type="application/pdf"/><pubDate>Mon, 01 Oct 2018 00:00:00 -0400</pubDate></item><item><title>Canonical Sectors and Evolution of Firms in the US Stock Markets</title><link>https://arxiv.org/abs/1503.06205</link><description>Matrix factorization gives automatic and continous sector assignments to stocks. / LX Hayden, R Chachra, AA Alemi, PH Ginsparg, JP Sethna / 1503.06205 / Quantitative Finance</description><guid isPermaLink="true">https://alexalemi.com/publications/stocks.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/stocks.pdf" length="3375925" type="application/pdf"/><pubDate>Mon, 01 Oct 2018 00:00:00 -0400</pubDate></item><item><title>Thermodynamics and Machine Learning</title><link>https://docs.google.com/presentation/d/1B2xbdhFRByzIJOdehPVGm5xrCpbUXkijnFRPEwVMKtk/present</link><description>An earlier talk relating thermodynamics and machine learning for a physics audience. / Cornell Physics Colloquium</description><guid isPermaLink="true">https://alexalemi.com/talks/thermodynamics-and-ml.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Thu, 01 Nov 2018 00:00:00 -0400</pubDate></item><item><title>Focusing on the Representation</title><link>https://docs.google.com/presentation/d/1Zd_-R6vVWkPegm_oEXTnlTvFbPTcdb71AtrHYk4-JpM/present</link><description>An overview of my work, which often amounts to reinterpreting existing techniques in a representational light. / Cornell AI Seminar</description><guid isPermaLink="true">https://alexalemi.com/talks/focusing-on-the-representation.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Thu, 01 Nov 2018 00:00:00 -0400</pubDate></item><item><title>GILBO: one metric to measure them all</title><link>https://arxiv.org/abs/1802.04874</link><description>A variational lower bound on the mutual informations in GANs highlight some of their problems. / AA Alemi, I Fischer / 1802.04874 / NeurIPS</description><guid isPermaLink="true">https://alexalemi.com/publications/gilbo.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/gilbo.pdf" length="9947982" type="application/pdf"/><pubDate>Sat, 01 Dec 2018 00:00:00 -0500</pubDate></item><item><title>Watch your step: Learning node embeddings via graph attention</title><link>https://arxiv.org/abs/1710.09599</link><description>Building better graph representations. / S Abu-El-Haija, B Perozzi, R Al-Rfou, AA Alemi / 1710.09599 / NeurIPS</description><guid isPermaLink="true">https://alexalemi.com/publications/watch_step.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/watch_step.pdf" length="562082" type="application/pdf"/><pubDate>Sat, 01 Dec 2018 00:00:00 -0500</pubDate></item><item><title>β-VAEs can retain label information even at high compression</title><link>https://arxiv.org/abs/1812.02682</link><description>Some rich decoder VAEs can magically focus on salient information. / E Fertig, A Arbabi, AA Alemi / 1812.02682 / NeurIPS BDL Workshop</description><guid isPermaLink="true">https://alexalemi.com/publications/beta_retain.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/beta_retain.pdf" length="1429763" type="application/pdf"/><pubDate>Sat, 01 Dec 2018 00:00:00 -0500</pubDate></item><item><title>TherML</title><link>https://docs.google.com/presentation/d/1Uhr4oJwTm2yI7FAvkjMbdK6s_HNwD9T61j6Ccz_eBmc/present</link><description>Drawing an analogy between Thermodynamics and modern deep variational latent variable generative modelling / Aspen: Machine Learning and Physics</description><guid isPermaLink="true">https://alexalemi.com/talks/therml.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Tue, 01 Jan 2019 00:00:00 -0500</pubDate></item><item><title>Variational Autoencoders with Tensorflow Probability Layers</title><link>https://medium.com/tensorflow/variational-autoencoders-with-tensorflow-probability-layers-d06c658931b7</link><description>TFP makes VAEs easy. / I Fischer, AA Alemi, JV Dillon, TFP Team /  / Tensorflow Blog</description><guid isPermaLink="true">https://alexalemi.com/publications/vaetfp.html</guid><category domain="../publications/">publications</category><pubDate>Fri, 01 Mar 2019 00:00:00 -0500</pubDate></item><item><title>On the Use of ArXiv as a Dataset</title><link>https://arxiv.org/abs/1905.0075</link><description>More people should use the ArXiv as a dataset. / CB Clement, M Bierbaum, KP O'Keeffe, AA Alemi / 1905.0075 / ICLR workshop RLGM</description><guid isPermaLink="true">https://alexalemi.com/publications/arxiv.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/arxiv.pdf" length="136260" type="application/pdf"/><pubDate>Wed, 01 May 2019 00:00:00 -0400</pubDate></item><item><title>Dueling Decoders: Regularizing Variational Autoencoder Latent Spaces</title><link>https://arxiv.org/abs/1905.07478</link><description>Sometimes a worse decoder gives better representations. / B Seybold, E Fertig, AA Alemi, I Fischer / 1905.07478 / </description><guid isPermaLink="true">https://alexalemi.com/publications/dueling.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/dueling.pdf" length="5602175" type="application/pdf"/><pubDate>Wed, 01 May 2019 00:00:00 -0400</pubDate></item><item><title>On Variational Bounds of Mutual Information</title><link>https://arxiv.org/abs/1905.06922</link><description>Overview of recent advances in variationally bounding mutual information. / B Poole, S Ozair, A van den Oord, AA Alemi, G Tucker / 1905.06922 / ICML</description><guid isPermaLink="true">https://alexalemi.com/publications/vmibounds.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/vmibounds.pdf" length="2867680" type="application/pdf"/><pubDate>Wed, 01 May 2019 00:00:00 -0400</pubDate></item><item><title>On Predictive Information in RNNs</title><link>https://arxiv.org/abs/1910.09578</link><description>Modern RNNs do not optimally capture predictive information in sequences. / Z Dong, D Oktay, B Poole, AA Alemi / 1910.09578 / </description><guid isPermaLink="true">https://alexalemi.com/publications/salamander.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/salamander.pdf" length="2845465" type="application/pdf"/><pubDate>Tue, 01 Oct 2019 00:00:00 -0400</pubDate></item><item><title>CEB Improves Model Robustness</title><link>https://arxiv.org/abs/2002.05380</link><description>A class conditional version of VIB shows good robustness. / I Fischer, AA Alemi / 2002.05380 / Entropy</description><guid isPermaLink="true">https://alexalemi.com/publications/cebrobust.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/cebrobust.pdf" length="1291415" type="application/pdf"/><pubDate>Tue, 01 Oct 2019 00:00:00 -0400</pubDate></item><item><title>Information in Infinite Ensembles of Infinitely-Wide Networks</title><link>https://arxiv.org/abs/1911.09189</link><description>While they seem complex, infinite ensembles of infinitely-wide networks are simple enough to enable tractable calculations of many information theoretic quantities. / R Shwartz-Ziv, AA Alemi / 1911.09189 / AABI 2019 - PMLR</description><guid isPermaLink="true">https://alexalemi.com/publications/infiniteinfo.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/infiniteinfo.pdf" length="1068596" type="application/pdf"/><pubDate>Tue, 01 Oct 2019 00:00:00 -0400</pubDate></item><item><title>Variational Predictive Information Bottleneck</title><link>https://arxiv.org/abs/1910.10831</link><description>Most modern inference procedures can be rederived as a simple variational bound on a predictive information bottleneck objective. / AA Alemi / 1910.10831 / AABI</description><guid isPermaLink="true">https://alexalemi.com/publications/pib.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/pib.pdf" length="409985" type="application/pdf"/><pubDate>Tue, 01 Oct 2019 00:00:00 -0400</pubDate></item><item><title>Thermodynamic Computing</title><link>https://arxiv.org/abs/1911.01968</link><description>A position paper on the future of thermodynamic computing. / T Conte, E DeBenedictis, N Ganesh, T Hylton, JP Strachan, RS Williams, AA Alemi, L Altenberg, G Crooks, J Crutchfield, L del Rio, J Deutsch, M DeWeese, K Douglas, M Esposito, M Frank, R Fry, P Harsha, M Hill, C Kello, J Krichmar, S Kumar, SC Liu, S Lloyd, M Marsili, I Nemenman, A Nugent, N Packard, D Randall, P Sadowski, N Santhanam, R Shaw, A Stieg, E Stopnitzky, C Teuscher, C Watkins, D Wolpert, J Yang, Y Yufik / 1911.01968 / CCC</description><guid isPermaLink="true">https://alexalemi.com/publications/thermodynamic.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/thermodynamic.pdf" length="2809426" type="application/pdf"/><pubDate>Fri, 01 Nov 2019 00:00:00 -0400</pubDate></item><item><title>A Case for Compression</title><link>https://docs.google.com/presentation/d/1rAZToLv1dfCXfzlzgTiYXBxf563qv0esx_i7y9vYt5c/present</link><description>I offer arguments both for and against learning compressed representations in the form of a generalized information bottleneck. / NeurIPS 2019 Workshop on Information Theory and Machine Learning</description><guid isPermaLink="true">https://alexalemi.com/talks/case-for-compression.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sun, 01 Dec 2019 00:00:00 -0500</pubDate></item><item><title>Neural Tangents: Fast and Easy Infinite Neural Networks in Python</title><link>https://arxiv.org/abs/1912.02803</link><description>Simple to use python package for training infinitely wide neural networks. / R Novak, L Xiao, J Hron, J Lee, AA Alemi, J Sohl-Dickstein, SS Schoenholz / 1912.02803 / ICLR</description><guid isPermaLink="true">https://alexalemi.com/publications/neural_tangents.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/neural_tangents.pdf" length="4405349" type="application/pdf"/><pubDate>Sun, 01 Dec 2019 00:00:00 -0500</pubDate></item><item><title>Variational Predictive Information Bottleneck</title><link>https://docs.google.com/presentation/d/1wlQzWYr2cHu081NWPL9Cfp6z1cC4wIxe_qfFUhAfWcg/present</link><description>I attempt to show that most modern forms of inference can be viewed as optimizing a variational bound on a predictive information bottleneck objective. / Information Theory and Applications Workshop</description><guid isPermaLink="true">https://alexalemi.com/talks/ita-pib.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sat, 01 Feb 2020 00:00:00 -0500</pubDate></item><item><title>'Live' Logistic Coronavirus Death Counter</title><link>https://observablehq.com/@alemi/live-corona-death-counter</link><description>An approximate 'live' corona death counter.</description><guid isPermaLink="true">https://observablehq.com/@alemi/live-corona-death-counter</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Fri, 27 Mar 2020 00:00:00 -0400</pubDate></item><item><title>Coronavirus Logistic Growth Plots</title><link>https://observablehq.com/@alemi/logistic-growth-plots</link><description>A distinct way to view Coronavirus growth.</description><guid isPermaLink="true">https://observablehq.com/@alemi/logistic-growth-plots</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Mon, 13 Apr 2020 00:00:00 -0400</pubDate></item><item><title>The OpenKIM Processing Pipeline: A Cloud-Based Automatic Materials Property Computation Engine</title><link>https://arxiv.org/abs/2005.09062</link><description>Database for Interatomic Potentials. / DS Karls, M Bierbaum, AA Alemi, RS Elliot, JP Sethna, EB Tadmor / 2005.09062 / Journal of Chemical Physics</description><guid isPermaLink="true">https://alexalemi.com/publications/openkim.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/openkim.pdf" length="1325099" type="application/pdf"/><pubDate>Fri, 01 May 2020 00:00:00 -0400</pubDate></item><item><title>TherML</title><link>https://docs.google.com/presentation/d/1LiovZcyZfh-P6mluB9fnyGFkOz4FJ7Y6V_hOuUx_j0A/present</link><description>Another version of my TherML talk. / American Physical Society Topical Group on Data Science</description><guid isPermaLink="true">https://alexalemi.com/talks/therml-aps.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Mon, 01 Jun 2020 00:00:00 -0400</pubDate></item><item><title>Machine Learning and Thermodynamics</title><link>https://docs.google.com/presentation/d/1zG0pU33e6SnIhyYR926Y6JNhBdP4kqM3-vMz1vNdTQk/present</link><description>Thermodynamics from a Probabilistic perspective and machine learning from a thermodynamic perspective. / University of Maryland - Informal Statistical Physics Seminar</description><guid isPermaLink="true">https://alexalemi.com/talks/ml-and-thermo.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Mon, 01 Jun 2020 00:00:00 -0400</pubDate></item><item><title>Density of States Estimation for Out-of-Distribution Detection</title><link>https://arxiv.org/abs/2006.09273</link><description>Simple density-of-states inspired out of distribution detection. / WR Morningstar, C Ham, AG Gallagher, B Lakshminarayanan, AA Alemi, JV Dillon / 2006.09273 / AISTATS 2021 Oral</description><guid isPermaLink="true">https://alexalemi.com/publications/dose.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/dose.pdf" length="4168346" type="application/pdf"/><pubDate>Mon, 01 Jun 2020 00:00:00 -0400</pubDate></item><item><title>Why KL?</title><description>Why is the KL divergence so special?</description><content:encoded>&lt;p&gt;The &lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;Kullback-Liebler
divergence&lt;/a&gt;,
or KL divergence, or relative entropy, or relative information, or information
gain, or expected weight of evidence, or information divergence
(it goes by a lot of different names) is unique
among the ways to measure the difference between two probability
distributions.  It holds a special and privileged place, being used to define
all of the core concepts in information theory, such as mutual information.&lt;/p&gt;
&lt;p&gt;Why is the relative information so special and where does it come from?
How should you interpret it? What is a nat anyway?  In this
note, I'll try to give a better understanding and set of intuitions about
what KL is, why it's interesting, where it comes from and what it's good for.&lt;/p&gt;
&lt;h2&gt;Information Gain&lt;/h2&gt;
&lt;p&gt;Let's see if we can motivate the form of the KL axiomatically.&lt;/p&gt;
&lt;p&gt;Imagine we have some prior set of beliefs summarized as a probability distribution $q$.
In light of some kind of evidence, we update our beliefs to a new distribution $p$.
How &lt;em&gt;much&lt;/em&gt; did we update our beliefs?  How do we quantify
the &lt;em&gt;magnitude&lt;/em&gt; of that update?  What are some properties we might want this
hypothetical function to have?  Let $I[p; q]$ denote the function that measures
how much we moved beliefs when we switch from beliefs $q$ to beliefs $p$.  We'll
call this amount of update the &lt;em&gt;information gain&lt;/em&gt; when we move from $q$ to $p$.
&lt;sup&gt;&lt;a href="#hobson"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="hobson"&gt;1&lt;/sup&gt;
  What follows is my own reconstruction of the fabulous paper:
  &lt;a href="https://link.springer.com/article/10.1007/BF01106578"&gt;
  &lt;b&gt;A New Theorem of Information Theory&lt;/b&gt; by Arthur Hobson
  &lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;We want our information function to satisfy the following properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;It's &lt;strong&gt;continuous&lt;/strong&gt;.  A small change in the distributions makes a small change in the amount of information in the move.&lt;/li&gt;
&lt;li&gt;It's permutation or &lt;strong&gt;reparameterization independent&lt;/strong&gt;.  It doesn't matter if we change the units we've specified our distributions in or if we relabel the sides of our dice, the answer shouldn't change.&lt;/li&gt;
&lt;li&gt;We want it to be &lt;strong&gt;non-negative&lt;/strong&gt; and have the value $I = 0$ if and only if $p = q$.  If $p=q$ we haven't updated our beliefs and so have no information gain.&lt;/li&gt;
&lt;li&gt;We want it to be &lt;strong&gt;monotonic&lt;/strong&gt; in a natural sense.  If we, for instance, start with some uniform distribution over the 24 people in a game of &lt;a href="https://en.wikipedia.org/wiki/Guess_Who%3F"&gt;Guess Who?&lt;/a&gt; and then update to only 5 remaining suspects, $I$ should be larger than if there were still 12 remaining suspects.&lt;/li&gt;
&lt;li&gt;Finally, we want our information function to &lt;strong&gt;decompose&lt;/strong&gt; in a natural and &lt;strong&gt;linear&lt;/strong&gt; way.&lt;sup&gt;&lt;a href="#renyi"&gt;2&lt;/a&gt;&lt;/sup&gt; In particular, we want to be able to relate the information between two joint distributions in terms of the information between their marginal and conditional distributions.&lt;/li&gt;
&lt;/ol&gt;
&lt;aside&gt; &lt;sup id="renyi"&gt;2&lt;/sup&gt;
  If one relaxes the requirement for linear decomposition and instead just requires that our information
  function decompose in a convex way, you get the generalized set of
  &lt;a href="https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy#R%C3%A9nyi_divergence"&gt;Rényi divergences&lt;/a&gt;.
  See: &lt;a href="https://projecteuclid.org/euclid.bsmsp/1200512181"&gt;
  &lt;i&gt;On Measures of Entropy and Information&lt;/i&gt; by Alfréd Rényi.&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;These are all very natural properties for our information function to have.  That last point about composition needs to be elaborated.
The point is that we have alternative ways we might express a probability distribution.  Apropos of nothing, imagine we
are concerned that we might have been exposed to a disease and are thinking about getting a test done.  There are two random variables
under consideration, we will label them $\mathcal{D}$ for whether we actually had the disease or not,
and $\mathcal{T}$ for whether
the test result is positive.  Each of these random variables can take on two possible states, we'll denote them as
$\mathcal{D} \in \{ D, \overline D \}, \mathcal{T} \in \{ T, \overline T \}$.
$D$ represents the state of our having-had-the-disease random variable $\mathcal{D}$ being positive, meaning we actually
did have the disease.  $\overline D$ denotes we actually didn't.
With two binary random variables, there are 4 possible outcomes $(\{ DT, D\overline T, \overline D T, \overline D \overline T\})$
and fully specifying our set of beliefs requires 3 independent probabilities.&lt;/p&gt;
&lt;aside&gt; &lt;sup id="kent"&gt;3&lt;/sup&gt;
  An &amp;ldquo;&lt;i&gt;Almost Certainly Not&lt;/i&gt;&amp;rdquo; is 7% on
  the &lt;a href="https://en.wikipedia.org/wiki/Words_of_estimative_probability"&gt;Kent's words of Estimative Probability&lt;/a&gt; list.
&lt;/aside&gt;
&lt;aside&gt; &lt;sup id="covid"&gt;4&lt;/sup&gt;
  See for instance the RDT Cellex Inc. &lt;a href="https://www.centerforhealthsecurity.org/resources/COVID-19/serology/Serology-based-tests-for-COVID-19.html"&gt;SARS-COV-2 Test&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;What are our prior beliefs?
Let's imagine while we are concerned we might have had the disease, but if we are being honest,
we almost certainly didn't,&lt;sup&gt;&lt;a href="#kent"&gt;3&lt;/a&gt;&lt;/sup&gt;
so we'll put our prior belief in having had the disease at 7%. $(q(D) = 0.07)$.
How do we expect the antibody test to go if we have it done?
You do a bit of research and discover
that if you had had the disease, the sensitivity or &lt;em&gt;true positive rate&lt;/em&gt; of the
test you're about to take is 93.8% $(q(T|D) = 0.938)$.
The specificity or &lt;em&gt;true negative rate&lt;/em&gt; of that
same test is 95.6% $(q(\overline T | \overline D) = 0.956)$. &lt;sup&gt;&lt;a href="#covid"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;figure id="#conditional" class="right"&gt;
  &lt;center&gt;
  &lt;img width="45%" src="figures/KLdiagram2.svg"
    alt="Conditional characterization of distribution."&gt;
  &lt;img width="45%" src="figures/KLdiagram.svg"
    alt="Joint characterization of distribution."&gt;
  &lt;figcaption&gt;
  Figure 1. Two equivalent ways to express the joint distribution $q(\mathcal{D}\mathcal{T})$.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
We've just specified our prior beliefs with 3 numbers, imagining our process as having two steps,
first, we either had the disease or not $(q(\mathcal{D}))$ and then, conditioned on that
we get the result of our test $(q(\mathcal{T}|\mathcal{D}))$.
Equivalently, we could have just given the joint probability distribution, as shown in Figure 1.
&lt;p&gt;The point now is that if we were to update our beliefs, in the diagram on the right there is just a single
distribution $q(\mathcal{D},\mathcal{T})$, in the one on the left there are essentially three different distributions
$(q(\mathcal{D}), q(\mathcal{T}|D), q(\mathcal{T}| \overline D))$ and we want
some sort of &lt;em&gt;structural&lt;/em&gt; consistency between the two sides:
$$
I[p(\mathcal{D},\mathcal{T}); q(\mathcal{D},\mathcal{T})] \quad \textrm{versus} \quad
I[p(\mathcal{D}); q(\mathcal{D})], I[p(\mathcal{T}|D); q(\mathcal{T}|D)],
I[p(\mathcal{T}|\overline D), q(\mathcal{T}|\overline D)] .
$$&lt;/p&gt;
&lt;p&gt;The consistency we will require is that our information measure decomposes linearly between
these two different descriptions. The information between the joints should be a weighted
linear combination of the informations of three constituent distributions.
In this particular case we will require:
$$ I[p(\mathcal{D},\mathcal{T}); q(\mathcal{D},\mathcal{T})] =  I[p(\mathcal{D}); q(\mathcal{D})] + p(D) I[p(\mathcal{T}|D); q(\mathcal{T}|D)] + p(\overline D) I[p(\mathcal{T}|\overline D), q(\mathcal{T}|\overline D)] .
$$
In words: The information in the full joint update is the information update for
your belief in whether or not you had the disease $(q(\mathcal D))$ &lt;em&gt;plus&lt;/em&gt; the informations
in the two conditional distributions, but weighted by how often we find ourselves in each of those
branches, as measured by our updated beliefs $(p(\mathcal{D}))$.&lt;/p&gt;
&lt;p&gt;More generally we are requiring that our information function satisfies a natural &lt;em&gt;chain rule&lt;/em&gt;:
$$ I[ p(X,Y); q(X,Y) ] = I[ p(X); q(X) ] + \mathbb{E}_{p(X)} \left[ I[ p(Y|X); q(Y|X) ] \right] $$&lt;/p&gt;
&lt;p&gt;Notice that it is here, in this sort of structural independence that we make
our information function manifestly asymmetric.  Here our $p$ distribution
becomes distinguished over our $q$ as it is the one we use to weight the child
contributions.  This makes sense if we imagine or if $p$ is the actual
distribution that events are drawn from, for it means that this will correspond
to the information we would observe in expectation.&lt;/p&gt;
&lt;p&gt;The interesting thing is that if you want your information function to satisfy
all of these seemingly reasonable properties, that is enough to determine it
&lt;em&gt;uniquely&lt;/em&gt;.  The only function satisfying all of these properties is the
relative entropy, or KL divergence we all know and love:
$$
I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)}
$$&lt;/p&gt;
&lt;p&gt;See &lt;a href="https://link.springer.com/article/10.1007/BF01106578"&gt;
&lt;b&gt;A New Theorem of Information Theory&lt;/b&gt; by Arthur Hobson
&lt;/a&gt; for a complete proof,
but here I'll offer a more colloquial argument like the one
given by Ariel Caticha.&lt;sup&gt;&lt;a href="#caticha"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="caticha"&gt;5&lt;/sup&gt;
  &lt;i&gt;Lectures on Probability, Entropy and Statistcal Physics&lt;/i&gt; by
  Ariel Caticha. &lt;a href="https://arxiv.org/abs/0808.0012"&gt;arXiv:0808.0012&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;We will start with and focus on the continuous setting, where we have two probability
distributions $p$ and $q$.  We seek a functional that takes our two distributions
and gives back our information gain and we seek one that is &lt;em&gt;local&lt;/em&gt; in the physics sense,
meaning that our &lt;em&gt;functional&lt;/em&gt; can be written as the integral of a &lt;em&gt;function&lt;/em&gt; depending
only on the values the probability densities take at each point:
$$ I[p;q] = \int \mathrm dx\, \mathcal{A}(x, p(x), q(x)). $$&lt;/p&gt;
&lt;p&gt;Our requirement that our information gain be
&lt;em&gt;reparameterization independent&lt;/em&gt; means it has to
be invariant to any remapping of our coordinates, or in other words,
it has to be dimensionless.  Imagine $x$ has units of a length, here our integral
measure $\mathrm dx$ has units of a length, and the densities $p(x), q(x)$ would
have units of an inverse length.  In order to be dimensionally consistent
our functional must take the form:&lt;sup&gt;&lt;a href="#caveat"&gt;6&lt;/a&gt;&lt;/sup&gt;
$$ I[p;q] = \int \mathrm dx\, p(x) f\left( \frac{p(x)}{q(x)} \right). $$&lt;/p&gt;
&lt;aside&gt; &lt;sup id="caveat"&gt;6&lt;/sup&gt;
  We could have just as well written it as $I[p;q] = \int \mathrm dx\, q(x) g\left( \frac{p(x)}{q(x)} \right)$ (that is, the form
  of an &lt;a href="https://en.wikipedia.org/wiki/F-divergence"&gt;f-divergence&lt;/a&gt;), but
  this is equivalent to the way we wrote it with $f(\mathcal{X}) = \mathcal{X} g(\mathcal X)$.
  Putting the $p(x)$ as the integral measure better aligns with what we are about to do next.
&lt;/aside&gt;
&lt;p&gt;Finally, our decomposability requirement above when written out in terms of
continuous densities takes the form:
$$ I[ p(x,y); q(x,y) ] = I[ p(x); q(x) ] + \int \mathrm dx\, p(x) I[p(y|x) ; q(y|x)] $$&lt;/p&gt;
&lt;p&gt;Combining this linear decomposition requirement with our requirement for the
form required and pushing some equations around gives us:
$$
\begin{align}
I[ p(x,y); q(x,y) ] &amp;amp;= I[p(x); q(x)] + \int \mathrm dx\, p(x) I[p(y|x); q(y|x)] \\
\int \mathrm dx\, \mathrm dy\, p(x,y) f\left(\frac{p(x,y)}{q(x,y)} \right)&amp;amp;= \int \mathrm dx\, p(x) f\left(\frac{p(x)}{q(x)} \right) + \int \mathrm dx\, p(x) \int dy\, p(y|x) f\left(\frac{p(y|x)}{q(y|x)} \right) \\
\int \mathrm dx\, \mathrm dy\, p(x) p(y|x) f\left(\frac{p(x)p(y|x)}{q(x)q(y|x)} \right)&amp;amp;= \int dx\, dy\, p(x) p(y|x) \left[ f\left(\frac{p(x)}{q(x)} \right) + f\left(\frac{p(y|x)}{q(y|x)} \right)\right] .
\end{align}
$$
Notice that this demonstrates that our function $f$ must satisfy the property:
$$ f(ab) = f(a) + f(b). $$
This well known functional equation has a unique (up to a multiplicative constant) &lt;em&gt;continuous&lt;/em&gt; solution:
$$ f(x) = c \log x. $$
We can roll the choice of multiplicative constant into our choice of basis for the logarithm and arrive at our final form
for our information gain:
$$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)}. $$&lt;/p&gt;
&lt;div id="#nonnegative"&gt;As for the non-negativity, our final form satisfies that property.  Because we have that $\log x \leq x -1$:
$$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)} = -\int \mathrm dx \, p(x) \log \frac{q(x)}{p(x)} \geq
-\int \mathrm dx\, p(x) \left( \frac{q(x)}{p(x)} - 1 \right) = 0. $$
&lt;aside&gt;
  &lt;img width="100%" src="figures/logbound.svg"
    alt="Visual demonstration of log x &lt; x - 1."&gt;
&lt;/aside&gt;
&lt;/p&gt;
&lt;h2&gt;Bayes Rule&lt;/h2&gt;
&lt;aside&gt; &lt;sup id="caticha2"&gt;52&lt;/sup&gt;
  I first saw this form of motivation for Bayes Rule in
  &lt;i&gt;Lectures on Probability, Entropy and Statistical Physics&lt;/i&gt; by
  Ariel Caticha. &lt;a href="https://arxiv.org/abs/0808.0012"&gt;arXiv:0808.0012&lt;/a&gt;
&lt;/aside&gt;
Having identified the right way to measure how much information is gained when we update a distribution
from $q$ to $p$, why don't we put this to practical use and try to figure out how we
&lt;i&gt;ought&lt;/i&gt; to update
our beliefs in light of evidence or observations.&lt;sup&gt;&lt;a href="#caticha2"&gt;52&lt;/a&gt;&lt;/sup&gt;
&lt;p&gt;Returning to our disease testing example, let's say you get the test done and receive a
positive result $(\mathcal T = T)$.
What should your new distribution of beliefs be? Well, first off if we've observed the results of the test
we should probably have our updated beliefs reflect the observation we made, making it consistent with our
observation, setting $p(T) = 1$, but this doesn't fully specify $p$; we need two more numbers.  How should
we set those?&lt;/p&gt;
&lt;p&gt;Why don't we aim to be conservative and try to find a new set of beliefs
that are as close as possible to our prior beliefs while still being consistent with the
observation that we've made?
Namely, let's look now for a joint distribution $p(\mathcal T, \mathcal D)$
that is as close as possible to $q(\mathcal T, \mathcal D)$ but for which we have that $p(T)=1$.
$$ \DeclareMathOperator{\argmin}{arg\,min} $$
$$ \argmin_{p(\mathcal D, \mathcal T)} I[p(\mathcal D, \mathcal T); q(\mathcal D, \mathcal T)] \quad \text{ s.t. }\quad p(T) = 1 $$
Now that we know
how to measure how much information is gained in updating our beliefs, we will
find the $p$ that minimizes this update while still being true to the observation we made.
Writing $p(\mathcal D,\mathcal T) = p(\mathcal T)p(\mathcal D|\mathcal T)$
and using our linear decomposition rule from above (the other way around), we have:
$$ I[p(\mathcal D,\mathcal T); q(\mathcal D,\mathcal T)] = I[p(\mathcal T);q(\mathcal T)] + I[p(\mathcal D|T);q(\mathcal D|T)]. $$
Because we've decided to fix $p(T)=1$ in order to be consistent with our
observation, the way to minimize the information between the joints is to set $p(\mathcal D|T)=q(\mathcal D|T)$ so
that our second term vanishes. In this particular case this means:
$$ p(T)=1 $$
$$ p(D|T) = q(D|T) = \frac{q(T|D)q(D)}{q(T|D)q(D) + q(T|\overline D)q(\overline D)} = 0.616 $$&lt;/p&gt;
&lt;p&gt;Furthermore, the marginal distribution of our updated beliefs about our disease status is:
$$ p(D) = p(D|T)p(T) = q(D|T) = 0.616$$
In this particular case our updated belief is only 3 to 2 on
that we actually had the disease, despite our positive test result. In Figure 2
we show both our prior in this factorization as well as our new beliefs.&lt;/p&gt;
&lt;figure id="#posterior" class="right"&gt;
  &lt;center&gt;
  &lt;img width="35%" src="figures/KLdiagram2q.svg"
    alt="Prior distribution of beliefs."&gt;
  &lt;img width="35%" src="figures/KLdiagram2p.svg"
    alt="Posterior distribution of beliefs."&gt;
  &lt;figcaption&gt;
  Figure 2. Our prior (left, blue, notice that we've swapped the order of the conditioning) and updated (right, orange) beliefs after observing that the test was positive.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Notice what just happened.  If we look for a new distribution that is as close as possible
to our previous distribution of beliefs (as measured by $I[p;q]$) which is also consistent
with our observations, we end up with an updated, or &lt;em&gt;posterior&lt;/em&gt; set of beliefs given
by Bayes' Rule.  Imagine we had some observable $x$ and some parameters $\theta$.  Our
prior set of beliefs are described by the joint distribution $q(\theta,x) = q(x|\theta)q(\theta)$:
a &lt;em&gt;likelihood&lt;/em&gt; $q(x|\theta)$ of how we expect the data to be distributed given
the parameter values and some &lt;em&gt;prior&lt;/em&gt; $q(\theta)$ set of beliefs about what values
those parameters can take.  If we make an observation and see some value for our observable $x=X$,
what ought our new beliefs be?  If we search for the joint distribution $p(x,\theta)$ that is
as close as possible to our previous beliefs $q(x,\theta)$ but that no longer has any
uncertainty about the value the observable will take $(p(x) = \delta(x-X))$ we see
that minimizing the information gain:
$$ I[p;q] = I[p(x);q(x)] + \int \mathrm dx\, p(x) \, I[p(\theta|x); q(\theta|x)], $$
is accomplished if we set $p(\theta|x) = q(\theta|x)$, yielding the updated joint:
$$ p(x,\theta) = p(x)p(\theta|x) = \delta(x-X) q(\theta|x) $$
and the marginal beliefs about the parameters to be:
$$ p(\theta) = \int \mathrm dx\, p(x,\theta) = \int \mathrm dx\, \delta(x-X) q(\theta|x) = q(\theta|X), $$
or precisely what you probably thought it should have been anyway if you've heard
of Bayesian inference.&lt;/p&gt;
&lt;p&gt;Although, if you stop to think about it, even though many of us know of and have
used &lt;a href="https://en.wikipedia.org/wiki/Bayes%27_theorem"&gt;Bayes Theorem&lt;/a&gt;
for a long time, the way it's normally presented, it is just a trivial statement
about how joint distributions factor.
$$ q(\theta, D) = q(\theta) q(D|\theta) = q(D) q(\theta|D)  \implies
q(\theta|D) = \frac{q(D|\theta) q(\theta)}{q(D)}. $$
But, this is just a statement about distribution
$q$, our prior beliefs.  It tells us nothing about how we should update those
beliefs in light of observations.  However, the previous argument demonstrates
that if you want to set your updated beliefs such that they are as close
as possible to your prior beliefs while being consistent with your
observations, you should set your updated beliefs according to
Bayes' rule run on the prior beliefs.&lt;/p&gt;
&lt;p&gt;&lt;span id="expected-weight-of-evidence"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;Expected Weight of Evidence&lt;/h2&gt;
&lt;p&gt;Traditionally, KL is interpreted from a coding perspective, a view I've included in an appendix below,
but here I offer a different perspective from the viewpoint of model selection.&lt;a href="#woe"&gt;&lt;sup&gt;8&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="woe"&gt;8&lt;/sup&gt;
I think weight of evidence is one of the most underappreciated concepts.  For a nice overview see: &lt;i&gt;Weight of Evidence: A Brief Survey&lt;/i&gt; by I.J. Good.
&lt;a href="https://www.cs.tufts.edu/~nr/cs257/archive/jack-good/weight-of-evidence.pdf"&gt;[pdf]&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;Above we saw that we can motivate Bayesian inference as choosing a posterior belief distribution
that has the minimal information gain over our prior distribution of beliefs while being consistent
with our observations.  This guides us towards forming better belief distributions, but what if we
just have two different belief distributions and wish to decide between them?&lt;/p&gt;
&lt;p&gt;Really what we want to know is what is the probability that our beliefs are correct in light of evidence?
Symbolically you might write this as $p(P|E)$ where $P$ is some belief distribution and $E$ is some
evidence, data, or observations. If we run Bayes Theorem we can see that:
$$ p(P|E) = \frac{p(E|P) p(P)}{p(E)}. $$
We can update our belief in our beliefs being correct by setting our updated
weight in the belief $p(P|E)$ to be proportional to our initial weight $p(P)$ times
the &lt;em&gt;likelihood&lt;/em&gt; that the evidence we observed would have been generated if our belief was true $(p(E|P))$.  The probability of the evidence given the belief $P$ is just the likelihood $P(E)$.
Proportional because we would need to know how likely the evidence would be $p(E)$ amongst all possible
beliefs. This last part, the &lt;a href="https://en.wikipedia.org/wiki/Marginal_likelihood"&gt;&lt;i&gt;marginal likelihood&lt;/i&gt;&lt;/a&gt;
is notoriously difficult to compute. In principle, it is asking us to evaluate how likely
the evidence would be from all possible models.&lt;/p&gt;
&lt;p&gt;However, we can make further progress if we content ourselves to not necessarily knowing the
absolute probability our model or beliefs are correct, but instead just its probability relative
to some other model.  If we consider the ratio of two different models $P$ and $Q$ we have:
$$ \frac{p(P|E)}{p(Q|E)} = \frac{p(E|P)}{p(E|Q)} \frac{p(P)}{p(Q)}. $$
Notice that the marginal likelihoods cancel out.  This is saying that whatever prior relative odds for the two models
being correct, if we compute the &lt;a href="https://en.wikipedia.org/wiki/Bayes_factor"&gt;&lt;i&gt;Bayes factor&lt;/i&gt;&lt;/a&gt;
$\left( \frac{p(E|P)}{p(E|Q)} \right)$, it tells us how the relative probabilities of the two beliefs should update
in light of the evidence. Taking a log on both sides:
$$ \log \frac{p(P|E)}{p(Q|E)} = \log \frac{p(E|P)}{p(E|Q)} + \log \frac{p(P)}{p(Q)},$$
turns this multiplicative factor into an additive one.&lt;/p&gt;
&lt;p&gt;If what we are deciding between is two different probability distributions, you may recognize that this additive &lt;i&gt;weight of evidence&lt;/i&gt;
for $p$ over $q$ when we observe $x$ is precisely the integrand in our information gain:
$$ w[x; p,q] =  \log \frac{p(x)}{q(x)}. $$
The log ratio of two probability distributions measures by how much you should update your prior log odds between the two distributions being
correct.  The KL divergence is just then the expected weight of evidence if we draw samples from $p(x)$ itself:
$$ I[p;q] = \mathbb{E}_p\left[ \log \frac{p(x)}{q(x)} \right] = \mathbb{E}_p \left[ w[x; p,q] \right]$$&lt;/p&gt;
&lt;p&gt;So, one way to interpret the relative entropy is that if our data was actually coming from the distribution $p$ and we had some other
hypothesis $q$, the $I[p;q]$ measures on average how much we should believe $p$ over $q$ on each observation.  In order to make that
statement more precise, we need a better language to talk about the magnitudes of these quantities.&lt;/p&gt;
&lt;h2&gt;How loud is the Evidence?&lt;/h2&gt;
&lt;p&gt;Our measurement of the amount of information was only unique up to a choice of multiplicative constant.  This is equivalent to
our choice of base for the logarithm.  We can think of this as the &lt;em&gt;units&lt;/em&gt; we use to measure our information.  The traditional choices
would be to use the base-2 logarithm and measure the information in &lt;em&gt;bits&lt;/em&gt;,&lt;sup&gt;&lt;a href="#bit"&gt;9&lt;/a&gt;&lt;/sup&gt;
or to use the more mathematically convenient natural
logarithm and measure the information in &lt;em&gt;nats&lt;/em&gt;.  Another option is to measure the information in
&lt;a href="https://en.wikipedia.org/wiki/Hartley_(unit)"&gt;&lt;em&gt;decibans&lt;/em&gt;&lt;/a&gt; or &lt;em&gt;decibels&lt;/em&gt; or &lt;em&gt;Hartley's&lt;/em&gt;, wherein
we use ten times the base-10 logarithm.&lt;/p&gt;
&lt;aside&gt; &lt;sup id="bit"&gt;9&lt;/sup&gt;
 &lt;i&gt;bit&lt;/i&gt; being short for &lt;i&gt;binary digit&lt;/i&gt;.
 &lt;i&gt;nat&lt;/i&gt; is then short for &lt;i&gt;natural digit&lt;/i&gt;.
 People sometimes suggest &lt;i&gt;dit&lt;/i&gt; for the base-10 &lt;/i&gt;decimal digit&lt;/i&gt;.
 Turing suggested *ban* as short hand for the amount of evidence deduced about the setting
 of the Enigma machine using the Banburismus method, itself named after the town of Banbury where
 the team got their large card sheets used in the method.
 For more discussion about the history and etymology of these and related units see section 4.8.1 of
 &lt;a href="https://books.google.com/books/about/Probability_Theory.html?id=tTN4HuUNXjgC&amp;source=kp_book_description"&gt;&lt;i&gt;Probability Theory: The Logic of Science&lt;/i&gt; by E.T. Jaynes&lt;/a&gt;.
&lt;/aside&gt;
$$ I[p;q] = 10 \int \mathrm dx\, p(x) \log_{10} \frac{p(x)}{q(x)}\, \textrm{dB} $$
&lt;p&gt;The nice thing about measuring information in decibans or &lt;a href="https://en.wikipedia.org/wiki/Decibel"&gt;decibels&lt;/a&gt;
is the people already have some familiarity with the unit, such as for measuring the &lt;em&gt;loudness&lt;/em&gt; of sounds.
It's always a comparative measurement, for sound taking $10 \log_{10} \frac{P}{P_0}$ of the power
to some reference or baseline power.  In the same way we could besides just measuring the KL between two distributions,
measure the comparative difference between any two probabilities on the log scale:
$$ 10 \log_{10} \frac{p(x)}{q(x)} \textrm{ dB}. $$&lt;/p&gt;
&lt;p&gt;In particular, we could get some feeling for these quantities by comparing the probability something happens to the
probability it doesn't.  Consider a simple binary outcome and taking $q=1-p$, in this case, the weight of evidence
that the thing happens versus it doesn't upon observing it happen once is:
$$ 10 \log_{10} \frac{p}{1-p} \text{ dB}. $$
This essentially gives us a new scale to measure probabilities on.
Instead of expressing probabilities as a number between 0 and 1,
here we are computing the log &lt;em&gt;odds&lt;/em&gt; of an event happening on the decibel scale.&lt;/p&gt;
&lt;p&gt;Below in Table 1 is a summary of the correspondence between decibans and odds or probabilities, and
in Figure 3 is a large visual representation you can play with.&lt;/p&gt;
&lt;figure&gt;
&lt;center&gt;
&lt;table&gt;
  &lt;thead&gt;&lt;th&gt;db&lt;th&gt;odds&lt;th&gt;~odds&lt;th&gt;probability&lt;th&gt;spinner
  &lt;tr&gt;&lt;td&gt;0&lt;td&gt;1.00&lt;td&gt;1:1&lt;td&gt; 50%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-0" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;1&lt;td&gt;1.26&lt;td&gt;5:4&lt;td&gt;56%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-1" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;2&lt;td&gt;1.58&lt;td&gt;π:2&lt;td&gt;61%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-2" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;3&lt;td&gt;2.00&lt;td&gt;2:1&lt;td&gt;67%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-3" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;4&lt;td&gt;2.51&lt;td&gt;5:2&lt;td&gt;71.5%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-4" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;5&lt;td&gt;3.16&lt;td&gt;π:1&lt;td&gt;76%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-5" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;6&lt;td&gt;3.98&lt;td&gt;4:1&lt;td&gt;80%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-6" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;7&lt;td&gt;5.01&lt;td&gt;5:1&lt;td&gt;83%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-7" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;8&lt;td&gt;6.31&lt;td&gt;2π:1&lt;td&gt;86%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-8" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;9&lt;td&gt;7.94&lt;td&gt;8:1&lt;td&gt;89%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-9" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;10&lt;td&gt;10&lt;td&gt;10:1&lt;td&gt;91%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-10" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;11&lt;td&gt;12.6&lt;td&gt;4π:1&lt;td&gt;92.6%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-11" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;12&lt;td&gt;15.8&lt;td&gt;16:1&lt;td&gt;94%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-12" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
  &lt;tr&gt;&lt;td&gt;13&lt;td&gt;20&lt;td&gt;20:1&lt;td&gt;95%
    &lt;td&gt;&lt;svg height="30" width="30" viewBox="0 0 20 20"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-13" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /&gt;&lt;/svg&gt;
&lt;/table&gt;
&lt;/center&gt;
  &lt;figcaption&gt;
    Table 1: A table of the correspondence between decibans/decibels and odds or probabilities.
  &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;figure id="bigspin"&gt;
  &lt;center&gt;
   &lt;svg height="300" width="300" viewBox="-2 -2 25 25"&gt; &lt;circle r="10" cx="10" cy="10" fill="white" stroke="black" stroke-width=0.2 /&gt; &lt;circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-100" stroke="#1f77b4" stroke-width="9.9" stroke-dasharray="3.141 3.141" /&gt;&lt;/svg&gt;
   &lt;br /&gt;
   &lt;input value=0 type='number' style="width: 4em" id="percent" onchange="updatePercent();"&gt;
   &lt;label for="percent"&gt;dB&lt;/label&gt;
   &lt;br/&gt;
   &lt;input id="slider" style="width: 65%;" type="range" min="-23" step="0.1" max="23" value="0" class="slider" id="slider"
   oninput="updateSlider();" &gt;
  &lt;figcaption&gt;Figure 3: A larger visual representation of decibels as a probability that you can play with. Here the set value
  of decibels measure the weight of evidence between the spinner giving a blue versus a white outcome.&lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Another nice property of measuring evidence and probabilities in
decibels is that it seems like 1 dB roughly corresponds the smallest detectable value that people
notice in terms of a change in underlying distribution, being the difference between &lt;i&gt;even chance&lt;/i&gt;
and 5 to 4 odds, &lt;i&gt;moderate probability&lt;/i&gt; or &lt;i&gt;better than even chance&lt;/i&gt;.&lt;/p&gt;
&lt;aside id="quantifying"&gt;&lt;sup&gt;10&lt;/sup&gt;
  &lt;a href="https://projecteuclid.org/euclid.ss/1177012242"&gt;&lt;i&gt;Quantifying Probabilistic Expressions&lt;/i&gt; by
  Frederick Mosteller and Cleo Youtz&lt;/a&gt;.
&lt;/aside&gt;
Additionally, $10 \textrm{ dB}$ corresponds to 10 to 1 odds, or 91% probability, which people associate
with events being &lt;i&gt;almost certain&lt;/i&gt; or happening &lt;i&gt;almost always&lt;/i&gt;. &lt;sup&gt;&lt;a href="#quantifying"&gt;10&lt;/a&gt;&lt;/sup&gt;.
&lt;p&gt;The traditional statistical threshold for reported results is a &lt;a href="https://en.wikipedia.org/wiki/P-value"&gt;p-value&lt;/a&gt;
of 0.05, which is often &lt;a href="https://en.wikipedia.org/wiki/Misuse_of_p-values"&gt;misinterpreted&lt;/a&gt;
to mean that the probability the null hypothesis is less than
5%.  While this isn't what the p-value measures, if we obtain more than 13 dB of evidence against some
null hypothesis, this does mean that the relative odds that it is correct have decreased by a factor of 20,
taking us below 20 to 1 against if we started with even odds.&lt;/p&gt;
&lt;p&gt;We have the conversions:
$$ 1 \textrm{ nat} = \frac{10}{\log 10} \textrm{ dB} = 4.34 \textrm{ dB} $$
$$ 1 \textrm{ bit} = \frac{10}{\log_2 10} \textrm{ dB} = 3.01 \textrm{ dB} $$&lt;/p&gt;
&lt;h2&gt;Examples and Magnitudes&lt;/h2&gt;
&lt;h3&gt;Double-headed Coin&lt;/h3&gt;
&lt;p&gt;Let's say I have two coins in my pocket, the first is an ordinary unbiased coin, and the second is doubled-headed.
I give you one of them and you start flipping the coin.  You get a heads, then another heads, then another.  How many
heads would you need to see in a row until you're sure you've been given the doubled-headed coin?  Let's
work out the relative entropy between these two distributions.  On the one hand we have $p(H)=1, p(\overline H) =0$,
and the other $q(H) = q(\overline H)= 0.5$.&lt;/p&gt;
&lt;p&gt;$$ I[p;q] = 10 \sum_i p_i \log_{10} \frac{p_i}{q_i} = -10 \log_{10} 2 = 3.01 \text{ dB} $$&lt;/p&gt;
&lt;p&gt;The relative entropy of a sure thing and a coin flip is 3 decibels.  This means that if we want to be more sure than 20 to 1
that we have the doubled-headed coin we'd need to observe 5 heads in a row, giving us 15 dB of evidence.&lt;/p&gt;
&lt;h3&gt;Births&lt;/h3&gt;
&lt;p&gt;Perhaps the first hypothesis test to be resolved with modern statistics was the question of whether more male or female
babies are born.  Using data from 1745 to 1770, Laplace found that in those 26 years, 251,527 boys and 241,945 girls were born.
This gives a fraction of male births of $\sim 51\%$.
Is this just a statistical fluke, or are boys more common than girls at birth?  What Laplace did was to analytically
work out the Bayesian posterior distribution for the probability that a male baby was born using a uniform prior, obtaining
a $\operatorname{Beta}(251528, 241946)$ distribution, for which the probability that the probability a male is born
is less than or equal to $1/2$ is
$$ \int_0^{1/2} \mathrm dx \, \operatorname{Beta}(x; 251528, 241946) \sim 10^{-42}$$
enough for Laplace to declare that he was &lt;em&gt;morally certain&lt;/em&gt; that males
are born more frequently than females.&lt;/p&gt;
&lt;p&gt;Let's work out the weight of evidence in this case, let's say we were comparing two hypotheses, the first
that males are born 51% of the time, and the second that they are born 50% of the time.  With Laplace's data, the
total weight of evidence in this case is:&lt;/p&gt;
&lt;p&gt;$$ 2515270 \log_{10} \frac{0.51}{0.50} + 2419450 \log_{10} \frac{0.49}{0.50} = 404 \text{ dB} $$
a whopping 400 decibels of evidence for males being born 51% of the time rather than 50%.
At the same time, I'm not sure most people are aware that males are born with a higher proportion and it doesn't
seem to affect most people's lives.  Why is that?  Well, let's evaluate the relative entropy between
a 51% Bernoulli and a 50% Bernoulli:
$$ I = 5.1 \log_{10}\frac{0.51}{0.50} + 4.9 \log_{10} \frac{0.49}{0.50} = 8.7 \times 10^{-4} \text{ dB}. $$
Notice that the relative entropy is quite small.  On average, if the true distribution was 51%, the evidence
we accumulate on each observed birth is less than 8 &lt;em&gt;microbels&lt;/em&gt;.  This means that on average in order to be reasonably
sure that the 51% hypothesis is true, we'd have to observe $\sim \frac{13}{8.7 \times 10^{-4}} \sim 15,000$ births.
This makes clear how with enough data we could both be very sure that males are born with a higher frequency
than females, but at the same time, this could have very little impact on our individual lives.&lt;/p&gt;
&lt;h3&gt;Likelihoods and Learning&lt;/h3&gt;
&lt;p&gt;What we would really like to do is learn a model of some real life distribution.  If the true distribution of data is $p(x)$,
and we have some kind of parametric model $q(x;\theta)$, we would like to set our model parameters $\theta$ so that
we get as close as possible to the true distribution.  In other words, we want to minimize the relative entropy from
the &lt;em&gt;real world&lt;/em&gt; to our &lt;em&gt;model&lt;/em&gt;:
$$\min I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x;\theta)}. $$
The biggest complication is that we don't actually know what the true distribution of the data is. We can, however, sample data.  Luckily for us, as far as this as an objective for $\theta$ goes, we can treat the entropy of $p(x)$ as
a constant.  This motivates the traditional maximum likelihood objective:
$$ \max \int \mathrm dx \, \log q(x;\theta). $$&lt;/p&gt;
&lt;aside id="gpt3"&gt;&lt;sup&gt;11&lt;/sup&gt;
  For instance, the latest &lt;a href="https://arxiv.org/abs/2005.14165"&gt;GPT-3&lt;/a&gt; model trained by OpenAI,
  was trained on less than half of the training set. (See Table 2.2 in the paper.)
&lt;/aside&gt;
If we had an infinite dataset, maximum likelihood is the same as minimizing the relative entropy between the real world and
our model.  Unfortunately, we don't often have infinite datasets.&lt;sup&gt;&lt;a href="#gpt3"&gt;11&lt;/a&gt;&lt;/sup&gt;
On finite datasets, maximum likelihood can still be interpreted as minimizing a KL divergence, but now
the KL divergence between the *empirical distribution* $\hat p(x) = \sum_i \delta(x - x_i) $
and our model $q(x;\theta)$.
&lt;p&gt;Unfortunately, the cross entropy is no longer reparameterization invariant a
point I elaborate in an appendix below, and so is difficult to interpret
directly, but if we take the difference of any two cross entropies, we can
still interpret that as the weight of evidence for one model with regards to
the other.  Because of the lack of reparameterization independence, care must
be taken to ensure that the likelihoods of the two models are evaluated using
the same measure, but provided they are:&lt;/p&gt;
&lt;p&gt;$$ L_1 - L_2 = \mathbb{E}\left[ \log q_1(x) \right] - \mathbb{E}\left[ \log q_2(x) \right] = \mathbb{E}\left[ \log \frac{q_1(x)}{q_2(x)} \right] $$&lt;/p&gt;
&lt;aside id="mnist"&gt;&lt;sup&gt;12&lt;/sup&gt;
  The entirety of which can fit in a &lt;a href="https://twitter.com/alemi/status/1042658244609499137"&gt;tweet&lt;/a&gt;.
&lt;/aside&gt;
Given the size of test sets we have for modern image datasets, this means that very small changes in likelihood can be
interpreted as large confidences in the superiorities of models.  Take for instance something as simple as binary static MNIST.&lt;sup&gt;&lt;a href="#mnist"&gt;12&lt;/a&gt;&lt;/sup&gt;  Here, with 10,000 test set images, a difference in likelihoods of 0.0013 dB or 0.0004 nats corresponds to 13 dB of evidence for the one model over the second.
&lt;h2&gt;Appendix A: Whither Continuous Entropy&lt;/h2&gt;&lt;/h2&gt;
&lt;p&gt;&lt;span id="appendix-a"&gt;The&lt;/span&gt; relative entropy really is the proper way to define entropy.  For all
of the things that Shannon got right, he flubbed a bit when he defined the
entropy of a distribution as:
$$ H(P) = -\sum_i p_i \log p_i $$&lt;/p&gt;
&lt;p&gt;Why do I say he flubbed?  Because this notion of entropy doesn't generalize
to continuous distributions.  The continuous analog:
$$ H(P) = -\int \mathrm dx\, p(x) \log p(x) $$
isn't &lt;em&gt;reparameterization independent&lt;/em&gt;.  Consider for instance the distribution
of adult human heights: &lt;sup&gt;&lt;a href="#bimodal"&gt;13&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;figure&gt;
  &lt;center&gt;
  &lt;img src="figures/adult_heights.svg"
    alt="Distribution of adult heights."&gt;
  &lt;figcaption&gt;Figure 1. Distribution of adult heights. &lt;sup&gt;&lt;a href="#ourworld"&gt;14&lt;/a&gt;&lt;/sup&gt;&lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;aside&gt; &lt;sup id="bimodal"&gt;13&lt;/sup&gt;
  Note that you may have heard that
  &lt;a href="https://www.johndcook.com/blog/2008/07/20/why-heights-are-normally-distributed/"&gt;heights are normally distributed&lt;/a&gt;.
  Adult male (or female) heights are normally distributed, but differ in their means and variances, making the
  &lt;a href="https://www.johndcook.com/blog/2008/11/25/distribution-of-adult-heights/"&gt;distribution of adult heights a mixture distribution&lt;/a&gt;.
&lt;/aside&gt;
&lt;aside&gt; &lt;sup id="ourworld"&gt;14&lt;/sup&gt;
  Data taken from
  &lt;a href="https://ourworldindata.org/human-height"&gt;ourworldindata.org&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;If you measure the continuous entropy of this distribution measured
in centimeters you get 5.4 bits.  If you instead measure the entropy
of the same distribution in feet you get 0.43 bits.  If you instead
were to measure heights in meters it would be -1.3 bits! &lt;sup&gt;&lt;a href="#negative"&gt;15&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="negative"&gt;15&lt;/sup&gt;
  It seems strange to have a negative entropy, but in this case, it is basically
  reflecting the fact that in terms of meters, the human height distribution doesn't
  span a whole meter in breadth, so it actually takes fewer &lt;i&gt;relative&lt;/i&gt; bits
  to specify a human height in meters than it would take to specify any
  quantity in meters, because its uncertainty is less than a whole meter.
&lt;/aside&gt;
&lt;h2&gt;Appendix B: Coding Interpretation&lt;/h2&gt;
&lt;p&gt;The traditional interpretation offered for the KL is from the coding
perspective.
Imagine we have a simple 4-letter
alphabet that we want to communicate over the wire.
If the four letters occurred with different probabilities:
$p(A)=1/2, p(B)=1/4, p(C)=p(D)=1/8$, with an optimally designed &lt;a
href="https://en.wikipedia.org/wiki/Huffman_coding"&gt;Huffman Code&lt;/a&gt; we could
encode our letters with a variable length code: $A:0, B:10, C:110, D:111$, and
on average we'd only be spending $1/2 + 2/4 + 3/8 + 3/8 = 7/4$ bits per letter.&lt;/p&gt;
&lt;figure&gt;
  &lt;center&gt;
  &lt;table&gt;
    &lt;thead&gt;&lt;th&gt;&lt;th&gt;A&lt;th&gt;B&lt;th&gt;C&lt;th&gt;D
    &lt;tr&gt;&lt;td&gt;$p$&lt;td&gt;1/2&lt;td&gt;1/4&lt;td&gt;1/8&lt;td&gt;1/8
    &lt;tr&gt;&lt;td&gt;p-code&lt;td&gt;0&lt;td&gt;10&lt;td&gt;110&lt;td&gt;111
    &lt;tr&gt;&lt;td&gt;$q$&lt;td&gt;1/4&lt;td&gt;1/4&lt;td&gt;1/4&lt;td&gt;1/4
    &lt;tr&gt;&lt;td&gt;q-code&lt;td&gt;00&lt;td&gt;01&lt;td&gt;10&lt;td&gt;11
  &lt;/table&gt;
  &lt;figcaption&gt;
    Table 2: A simple example of two different distributions over a 4 letter alphabet.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Imagine however we didn't know what the true distribution of letters was and instead
designed an optimal code using a different distribution $q$.  If we believed
each of the 4 letters were equally likely $(q(A)=q(B)=q(C)=q(D)=1/4)$, the optimal way to
encode messages would just assign a two bit code to each letter $(A : 00, B:01, C:10,
D:11)$.  If we used this suboptimal code to send messages that were actually distributed
as $p$ it would cost $2/2 + 2/4 + 2/8 + 2/8 = 2$ bits per letter.  Our incorrect
belief leads to a $2 - 7/4 = 1/4$ of a bit inefficiency.  For these two distributions,
it shouldn't come as a surprise that the information gain is precisely 1/4 bits:
$$ I[p;q] = \sum_i p_i \log_2 \frac{p_i}{q_i} = 1/4 \textrm{ bits}. $$&lt;/p&gt;
&lt;p&gt;For an optimally designed code, the code lengths go as $-\log p(x)$ for any symbol $x$.
Our information gain can be interpreted as a difference in expected code lengths under $p$:
$$ I[p;q] = \mathbb{E}_p[ -\log q ] - \mathbb{E}_p[-\log p ]. $$
The information gain $I[p;q]$ measures the &lt;em&gt;excess encoding cost&lt;/em&gt; for trying to encode messages
from $p$ using a code designed for $q$.&lt;/p&gt;
&lt;script type='text/javascript'&gt;
const SEGMENTS = 5;
const RADIUS = 5;
const CIRCUMFERENCE = 2 * Math.PI * RADIUS;

function fraction(i, db) {
  const progress = document.getElementById('progress-' + i);
  let odds = Math.pow(10.0, db / 10.0);
  let p = odds / (1+odds);
  let fill = CIRCUMFERENCE / SEGMENTS * p;
  let space = CIRCUMFERENCE / SEGMENTS * (1-p);
  let val = fill + " " + space;
  progress.style.strokeDasharray = val;
}

for (let i = 0; i &lt;= 13; i++) {
  fraction(i,i);
}

function updateSlider() {
  let value = document.getElementById("slider").value;
  fraction(100, value);
  document.getElementById("percent").value = value;
}

function updatePercent() {
  let value = document.getElementById("percent").value;
  document.getElementById("slider").value = value;
  fraction(100, value);
}
&lt;/script&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/kl.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Fri, 07 Aug 2020 00:00:00 -0400</pubDate></item><item><title>PACᵐ-Bayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regime</title><link>https://arxiv.org/abs/2010.09629</link><description>Multisample bound that does better than Bayes at prediction for misspecified models. / WR Morningstar, AA Alemi, JV Dillon / 2010.09629 / AISTATS2022</description><guid isPermaLink="true">https://alexalemi.com/publications/pacm.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/pacm.pdf" length="2493761" type="application/pdf"/><pubDate>Thu, 01 Oct 2020 00:00:00 -0400</pubDate></item><item><title>VIB is Half Bayes</title><link>https://arxiv.org/abs/2011.08711</link><description>VIB can be rederived as a half-Bayesian half-Maximum likelihood method. / AA Alemi, WR Morningstar, B Poole, I Fischer, JV Dillon / 2011.08711 / AABI 2021 Oral</description><guid isPermaLink="true">https://alexalemi.com/publications/pacvib.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/pacvib.pdf" length="1347687" type="application/pdf"/><pubDate>Sun, 01 Nov 2020 00:00:00 -0400</pubDate></item><item><title>Basics of Information Theory</title><link>https://drive.google.com/file/d/1jPUhUS5T2rgyx5Dlt74g4_MwX8vrBi2I/view?usp=sharing</link><description>An overview of information theory for machine learning. / Google Course</description><guid isPermaLink="true">https://alexalemi.com/talks/basics-of-infotheory.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Fri, 01 Jan 2021 00:00:00 -0500</pubDate></item><item><title>VIB is Half Bayes</title><link>https://youtu.be/JGDAZ4joUX8</link><description>The Variational Information Bottleneck can be viewed as a sort of half-Bayesian approach. / Advances in Approximate Bayesian Inference Symposium 2021</description><guid isPermaLink="true">https://alexalemi.com/talks/vib-is-half-bayes.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Mon, 01 Feb 2021 00:00:00 -0500</pubDate></item><item><title>Does Knowledge Distillation Really Work?</title><link>https://arxiv.org/abs/2106.05945</link><description>Knowledge distillation doesn't seem to work as well as people assume it does. / S Stanton, P Izmailov, P Kirichenko, AA Alemi, AG Wilson / 2106.05945 / NeurIPS2021</description><guid isPermaLink="true">https://alexalemi.com/publications/distillation.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/distillation.pdf" length="2795100" type="application/pdf"/><pubDate>Tue, 01 Jun 2021 00:00:00 -0400</pubDate></item><item><title>A Closer Look at the Adversarial Robustness of Information Bottleneck Models</title><link>https://arxiv.org/abs/2107.05712</link><description>Looking more carefully, IB models aren't fully robust to adversarial examples. / I Korshunova, D Stutz, AA Alemi, O Wiles, S Gowal / 2107.05712 / ICML 2021 AML Workshop Poster</description><guid isPermaLink="true">https://alexalemi.com/publications/robustness.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/robustness.pdf" length="2760069" type="application/pdf"/><pubDate>Tue, 01 Jun 2021 00:00:00 -0400</pubDate></item><item><title>Machine Learning and Thermodynamics</title><link>https://docs.google.com/presentation/d/1tIGTRRE0gKjBIySrQOUO-qlIDlj1nXhDybamG0YP3YI/present</link><description>Another version of the relationship between thermodynamics and machine learning. / Scientific Machine Learning Mini-Course (SciML) @ CMU</description><guid isPermaLink="true">https://alexalemi.com/talks/sciml-thermo.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Thu, 01 Jul 2021 00:00:00 -0400</pubDate></item><item><title>Order of Magnitude Physics</title><link>https://drive.google.com/file/d/1kU2UAAG3rLSOEW9gXtpINuj2AZyNGwHV/view?usp=sharing</link><description>Dimensional analysis and basic order of magnitude physics. / Google Course</description><guid isPermaLink="true">https://alexalemi.com/talks/order-of-magnitude-physics.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sun, 01 Aug 2021 00:00:00 -0400</pubDate></item><item><title>PACm Bayes - Your Model is Wrong Workshop</title><link>https://youtu.be/HHu7fclYlVg</link><description>Bayesian inference doesn't optimize for prediction in mispecified models. / Your Model is Wrong Workshop - NeurIPS 2021</description><guid isPermaLink="true">https://alexalemi.com/talks/pacm-talk.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Mon, 01 Nov 2021 00:00:00 -0400</pubDate></item><item><title>Bayesian Imitation Learning for End-to-End Mobile Manipulation</title><link>https://arxiv.org/abs/2202.07600</link><description>Using VIB to help robots open doors. / Y Du, D Ho, AA Alemi, E Jang, M Khansari / 2202.07600 / ICML 2022</description><guid isPermaLink="true">https://alexalemi.com/publications/endtoend.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/endtoend.pdf" length="5462864" type="application/pdf"/><pubDate>Tue, 01 Feb 2022 00:00:00 -0500</pubDate></item><item><title>Probabilistic Machine Learning: An Introduction</title><link>https://github.com/probml/pml2-book/releases/latest</link><description>Co-wrote the Information Theory Chapter for the book.</description><guid isPermaLink="true">https://probml.github.io/pml-book/book2.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Tue, 08 Feb 2022 00:00:00 -0500</pubDate></item><item><title>Simple Population Geiger Counter</title><link>https://observablehq.com/@alemi/simple-live-population-counter</link><description>More 'realistic' live population counter.</description><guid isPermaLink="true">https://observablehq.com/@alemi/simple-live-population-counter</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Wed, 22 Jun 2022 00:00:00 -0400</pubDate></item><item><title>Trajectory ensembling for fine tuning - performance gains without modifying training</title><link>https://alexalemi.com/publications/traj-ensemble.pdf</link><description>Ensembling within a trajectory gives some simple gains. / L Anderson-Conway, V Birodkar, S Singh, H Mobahi, AA Alemi /  / HITY Workshop NeurIPS 2022</description><guid isPermaLink="true">https://alexalemi.com/publications/traj-ensemble.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/traj-ensemble.pdf" length="357056" type="application/pdf"/><pubDate>Thu, 01 Sep 2022 00:00:00 -0400</pubDate></item><item><title>Simple Diffusion Colab</title><link>https://github.com/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb</link><description>A simple self-contained Colab introducing latent diffusion.</description><guid isPermaLink="true">https://colab.sandbox.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Thu, 15 Sep 2022 00:00:00 -0400</pubDate></item><item><title>A Path to the Variational Diffusion Loss</title><description>Deriving the (Variational) Diffusion and VAE losses from the non-negativity of KL.</description><content:encoded>&lt;p&gt;Diffusion models have made quite a splash, especially after the open-source release of &lt;a href="https://huggingface.co/spaces/stabilityai/stable-diffusion"&gt;Stable Diffusion&lt;/a&gt;.  What are diffusion models, where does the loss come from and what does a simple example look like? I've recently helped open-source a simple, pedagogical, self-contained
&lt;a href="https://colab.research.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb"&gt;example colab&lt;/a&gt;
of a diffusion model trained on EMNIST, which you can find as part of the &lt;a href="https://arxiv.org/abs/2107.00630"&gt;Variational Diffusion Models (VDM)&lt;/a&gt; &lt;a href="https://github.com/google-research/vdm"&gt;github page&lt;/a&gt;. In this post, I wanted to give some more background and a simple way to motivate where the loss function comes from.&lt;/p&gt;
&lt;h2&gt;Non-negativity of KL&lt;/h2&gt;
&lt;aside&gt;&lt;sup id="#p-and-q"&gt;1&lt;/sup&gt;
 I tend to reverse the use of $p$ and $q$ with respect to the rest of the world.  Most people use $p$ for the generative model and $q$ for the approximate posterior.  They do this because, for most people, the generative model is the star of the show and the approximate posterior is playing second fiddle.  My reversal of the letters is deliberate.  To me, the &lt;i&gt;forward process&lt;/i&gt; $p(x,z)=p(x)p(z|x)$ composed of the &lt;i&gt;true image distribution&lt;/i&gt; $p(x)$ and the &lt;i&gt;encoder&lt;/i&gt; $p(z|x)$ is the star of the show.  $p$ is the joint distribution that exists in the real world, $q$ is our approximation to it. 
&lt;/aside&gt;
&lt;p&gt;Let's say we want to build a latent-variable model, $q(x, z)$ where the likelihood of the data ($p(x)$), has high marginal likelihood: $\log q(x)$. Unfortunately, computing $\log q(x)$ involves an intractable integral over the latent variable, $z$.&lt;sup&gt;&lt;a href="#p-and-q"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#kl"&gt;2&lt;/sup&gt;
 &lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;Kullback-Leibler Divergence&lt;/a&gt;, for more background on KL see my &lt;a href="kl.html"&gt;other post&lt;/a&gt;.
&lt;/aside&gt;
&lt;aside&gt;&lt;sup id="#brakets"&gt;3&lt;/sup&gt;
	I use &lt;a href="https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation"&gt;brakets&lt;/a&gt; to show expectations and unless noted, always with respect to the full $p$ distribution.
	$$ \left\langle \cdot \right\rangle_p = \mathbb{E}_p \left[ \cdot \right] = \int dx\, p(x) [\cdot] $$
&lt;pre&gt;&lt;code&gt;	
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If I don't denote the distribution the expectation is with respect to on the brakets, it's always the full joint $p(x,\cdots)$. Notice that this works even if there
are fewer variables or conditioning variables left inside the terms in the brakets, as any excess variables will just marginalize out without issue in the expectation and any variables being conditioned on will be evaluated in expectation as desired.&lt;/p&gt;
&lt;/aside&gt;
&lt;p&gt;We can derive the tractable objective used to train these models using the observation that the KL&lt;sup&gt;&lt;a href="#kl"&gt;2&lt;/a&gt;&lt;/sup&gt; divergence is non-negative and monotonic. The &lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;Kullback-Leibler (KL) divergence&lt;/a&gt; between any two distributions is non-negative:&lt;sup&gt;&lt;a href="#brakets"&gt;3&lt;/a&gt;&lt;/sup&gt;
$$ \left\langle \log \frac{p(x)}{q(x)} \right\rangle_p \geq 0. $$&lt;/p&gt;
&lt;p&gt;If we marginalize out some subset of random variables the KL divergence of the marginal distributions has to be less. For any two random variables:
$$ \begin{align}
\left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle &amp;amp;= \left\langle \log \frac{p(x)p(z|x)}{q(x)q(z|x)} \right\rangle  \\
&amp;amp;= \left\langle \log \frac{p(x)}{q(x)} \right\rangle + \left\langle \log \frac{p(z|x)}{q(z|x)} \right\rangle \\
&amp;amp;\geq \left\langle \log \frac{p(x)}{q(x)}\right\rangle \geq 0
\end{align} $$
Intuitively, if we think about KL divergence as a "distance" between probability distributions, two joint distributions always have to be at least as far apart as their marginals.  As we just saw, the KL of the joint is the sum of the KL between the two marginals, as well as the expected KL of the conditional distributions (which has to be positive, as all KLs are).&lt;/p&gt;
&lt;h2&gt;VAEs&lt;/h2&gt;
&lt;p&gt;Imagine designing these joint distributions to have different flavors.  Think of $p(x,z)$ as a &lt;em&gt;forward&lt;/em&gt; process $p(x) p(z|x)$ that takes an image from some natural image distribution $p(x)$ and then encodes it into some representation $z$ with an encoder $p(z|x)$.  This is a joint distribution over the two variables. Running the forward process would give us $(x,z)$ pairs, pairs of natural images and their encodings.
Next, imagine a different joint distribution, a &lt;em&gt;reverse process&lt;/em&gt; $q(x,z)$ that takes some sample from a &lt;em&gt;prior&lt;/em&gt; $q(z)$ and then runs it through a &lt;em&gt;decoder&lt;/em&gt; $q(x|z)$ to generate a synthetic image.  This is a generative model of the kind we might be used to building.  This is also a fully-fledged joint distribution that we could sample from, in order to generate $(x,z)$ pairs. At initialization, these two distributions are very different. The goal of generative modeling is to bring these two joint distributions into alignment.&lt;/p&gt;
&lt;p&gt;Based on the properties of the KL divergence, these two joint distributions must have a non-negative KL divergence that is monotonic to marginalizing out one of the variables:
$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle =  \left\langle \log \frac{p(x) p(z|x)}{q(z) q(x|z)} \right\rangle \geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle \geq 0 $$
Notice what this is saying. The KL divergence between the joint distributions here is the expected log density ratio of the forward to the reverse model's likelihood, where the expectation -- the samples -- are taken with respect to the &lt;em&gt;forward&lt;/em&gt; process $p(x,z)$.  This joint KL is itself an upper bound for the KL divergence between the marginal distributions $p(x)$ and $q(x)$.  $p(x)$ was our original image distribution, while $q(x)$ is the distribution of synthetic images drawn from the generative model that is our reverse process:
$$ q(x) = \int dz\, q(x|z) q(z) $$&lt;/p&gt;
&lt;p&gt;So, by minimizing the KL between our forward and reverse process -- by aligning the two joint distributions -- we can ensure that we make progress towards learning a good generative model of our images $q(x)$.  We can ensure that we are aligning the marginals $q(x)$ and $p(x)$.&lt;/p&gt;
&lt;p&gt;The tightness of this bound is controlled by how close together the remaining conditional distributions are:&lt;/p&gt;
&lt;p&gt;$$ \left\langle  \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log \frac{p(x)}{q(x)} \right\rangle + \left\langle \log \frac{p(z|x)}{q(z|x)} \right\rangle $$
In other words: the degree to which our encoding distribution ($p(z|x)$) matches the Bayesian posterior of our generative model ($q(z|x)$) determies the tightness of our bound.&lt;/p&gt;
&lt;p&gt;So, again, all we started with is the idea of two different processes, the &lt;em&gt;forward&lt;/em&gt; process that takes images and encodes them and a &lt;em&gt;reverse&lt;/em&gt; process that samples some latents from a known distribution and decodes them.  If we try to minimize the KL divergence between these two processes, forward to reverse, we can ensure that this is a valid bound on the marginal KL between the true image distribution $p(x)$ and the marginal of our generative model $q(x)$.  That is, by learning to make the two joint processes look alike we are also as a consequence learning a good generative model of images.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#ELBO"&gt;4&lt;/sup&gt;
	&lt;a href="https://en.wikipedia.org/wiki/Evidence_lower_bound"&gt;Evidence Lower BOund&lt;/a&gt;
&lt;/aside&gt;
We've just derived the ordinary ELBO:&lt;sup&gt;&lt;a href="#ELBO"&gt;4&lt;/a&gt;&lt;/sup&gt;
$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log p(x) -\log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle, $$
up to a constant outside our control, the entropy of the true image distribution $p(x)$.  Notice that this term cancels out on both sides if we wish to target
the cross-entropy from our true $p(x)$ to our model's $q(x)$ rather than the KL.
&lt;p&gt;$$\begin{align}
\left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log p(x) - \log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle &amp;amp;\geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle \\
\left\langle -\log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle &amp;amp;\geq \left\langle -\log q(x) \right\rangle \\
\left\langle \log q(x) \right\rangle &amp;amp;\geq \left\langle \log q(x|z) - \log \frac{p(z|x)}{q(z)} \right\rangle
\end{align}$$&lt;/p&gt;
&lt;p&gt;At the end of the day, the hope and the dream we seem to have in doing latent variable modeling is that maybe we will somehow be more successful in learning a reverse $q(z)q(x|z)$ process to match some forward $p(x)p(z|x)$ than we would have been able to just model the density $q(x)$ directly.  We are hoping that by expanding the problem, and making it a harder or larger modeling task, it'll become easier for us to optimize or learn.&lt;/p&gt;
&lt;h2&gt;Diffusion&lt;/h2&gt;
&lt;p&gt;For diffusion models, honestly, there isn't much to add except they add many more steps.
The only difference is that instead of a two-step forward process, in diffusion we imagine a many-stepped (or potentially continuous) forward and reverse process.&lt;/p&gt;
&lt;p&gt;In particular, in most diffusion models we fix the forward process to be a Markov chain:
$$ p(x, z_0, z_1, z_2, \cdots, z_{T-1}, z_T) = p(x) p(z_0|x) p(z_1|z_0) \cdots p(z_T|z_{T-1}), $$
which starts with a sample from a natural image distribution $p(x)$ and then adds $T$ steps of additive Gaussian noise $p(z_t| z_{t-1})  \sim \mathcal N(\alpha_{t} z_{t-1}, \sigma_{t}^2) $.&lt;/p&gt;
&lt;figure id="#diffusion-forward"&gt;
 &lt;img src="figures/diffusion-forward.svg"
      alt="Graphical model showing the forward process for diffusion."&gt;
 &lt;figcaption&gt;
   Figure 1. The graphical model for the forward process in diffusion.
 &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;aside&gt;&lt;sup id="#variance-preserving"&gt;5&lt;/sup&gt;
In a lot of the diffusion work, the process is taken to be *variance preserving* by setting:
$$ \alpha^2 = 1 - \sigma^2 $$
&lt;/aside&gt;
&lt;p&gt;This takes an ordinary image and then adds more and more noise to it until it looks more or less indistinguishable from just isotropic Gaussian noise.&lt;sup&gt;&lt;a href="#variance-preserving"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;figure id="#forward-diffusion"&gt;
  &lt;center&gt;
  &lt;img src="figures/forward-diffusion.png"
    alt="Illustration of standard forward diffusion process as additive Gaussian noise."&gt;
  &lt;figcaption&gt;
  Figure 2. A demonstration of the typical forward process in diffusion models.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;One particularly nice thing about using Gaussians for every step of the forward process here is that the composition of a bunch of conditional Gaussians is itself Gaussian so we will have a closed form for the marginal distribution at any intermediate time:
$$ p(z_t|x) = \mathcal N(\tilde \alpha_t x, \tilde \sigma_t^2 I ).$$&lt;/p&gt;
&lt;p&gt;With a forward process defined, we parameterize or learn the reverse process, a Markov chain that operates in the opposite direction:
$$ q(x,z_0,z_1,\cdots,z_T) = q(z_T) q(z_{T-1}|z_T) \cdots q(z_1|z_2)q(z_0|z_1)q(x|z_0) $$&lt;/p&gt;
&lt;figure id="#diffusion-reverse"&gt;
 &lt;img src="figures/diffusion-backward.svg"
      alt="Graphical model showing the reverse process for diffusion."&gt;
 &lt;figcaption&gt;
   Figure 3. The graphical model for the reverse process in diffusion.
 &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;aside&gt;&lt;sup id="#extra-entropy"&gt;6&lt;/sup&gt;
	Aside, again, from the constant entropy of the data outside our control which we can ignore for purposes of optimization.	
&lt;/aside&gt;
&lt;p&gt;The VDM loss is&lt;sup&gt;&lt;a href="#extra-entropy"&gt;6&lt;/a&gt;&lt;/sup&gt; simply the KL between these two joints, which serves as an upper bound on the KL of the image marginals:
$$ \left\langle \log \frac{p(x,z_0,z_1,\cdots,z_T)}{q(x,z_0,z_1,\cdots,z_T)} \right\rangle \geq \left\langle \log \frac{p(x)}{q(x)}\right\rangle $$&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#deep-unsupervised"&gt;7&lt;/sup&gt;
See &lt;a href="https://arxiv.org/abs/1503.03585"&gt;Deep Unsupervised Learning Using Nonequilibrium Thermodynamics&lt;/a&gt; by Sohl-Dickstein et al.
&lt;/aside&gt;
&lt;p&gt;Just as in the case of a VAE, here, the hope is that it might actually be easier to model the larger joint distribution than it was to try to model the density directly.  In the case of simple diffusion models, the forward process is fixed additive Gaussian noise. If we make enough steps in the forward process we believe we ought to be able to learn the reverse process exactly.&lt;sup&gt;&lt;a href="#deep-unsupervised"&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h3&gt;Various Sundry Tricks&lt;/h3&gt;
&lt;p&gt;The joint KL is equivalent to the VDM loss. However, in practice, to make this loss efficient to train, diffusion models leverage a lot of the known structure
of the forward process to power a very clever parameterization of the reverse process. This requires some tricky rearranging of terms and some stochastic approximation to make the whole thing efficient.&lt;br /&gt;
To see the code, please check out the &lt;a href="https://colab.research.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb"&gt;example colab&lt;/a&gt; as well as its accompanying text that walks through some of these details in more detail.&lt;/p&gt;
&lt;p&gt;To utilize our knowledge of the forward process, we're actually going to rewrite the forward process not as a sequence of conditional Gaussian steps (a &lt;em&gt;bottom-up&lt;/em&gt; forward process):
$$ p(x,z_0,z_1,z_2,\cdots,z_T) = p(x) p(z_0|x) p(z_1|z_0) p(z_2|z_1) \cdots p(z_T|z_{T-1}) $$
but instead we'll rearrange this to be a product of a bunch of conditional reverse steps (as a &lt;em&gt;top-down&lt;/em&gt; forward process):
$$ 
\begin{align}
p(x, z_0, z_1, z_2,\cdots, z_N) &amp;amp;= p(z_0,z_1,z_2,\cdots, z_T|x) p(x) \\
&amp;amp;= p(z_0|z_1,\cdots,z_T,x)p(z_1|z_2,\cdots,z_T,x)\cdots p(z_T|x)p(x) \\
&amp;amp;= p(z_0|z_1,x)p(z_1|z_2,x)\cdots p(z_{T-1}|z_{T},x)p(z_T|x)p(x)
\end{align}$$
For the Gaussian diffusion, we can analytically figure out what these conditional reverse steps should be for the forward process $p(z_{t-1}|z_t,x)$. These distributions compute the probability of seeing a particular noisy image from the previous step if we get to observe both the noisy image as well as the original image.&lt;/p&gt;
&lt;figure id="#diffusion-forward-reverse"&gt;
 &lt;img src="figures/diffusion-forward-backward.svg"
      alt="Graphical model showing the top-down generative process for diffusion."&gt;
 &lt;figcaption&gt;
   Figure 4. The graphical model for the top-down forward process in diffusion.
 &lt;/figcaption&gt;
&lt;/figure&gt;
&lt;p&gt;We'll then parameterize our reverse process $q(z_{t-1}|z_t)$ to have this same &lt;em&gt;functional form&lt;/em&gt;:
$$ q(z_{t-1}|z_t) \leftarrow p(z_{t-1}|z_t, \hat x(z_t, t)). $$
We'll model the reverse process as if it were the exact reversed conditional forward process, but of course, for the true reverse process we don't get to observe the true original image.  Still, we'll use the same functional form, it's just we'll spend our modeling budget on trying to impute the original clean image $\hat x$ after observing the noisy image $z_t$ and which step we are on $t$.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#epshat"&gt;8&lt;/sup&gt;
	The two are affinely related:
	$$ \hat x_t = (z_t - \sigma_t \hat \epsilon_t) / \alpha_t $$
&lt;/aside&gt;
&lt;p&gt;The actual parametric model in a diffusion model is this bit, $\hat x(z_t, t)$. It is a neural network that takes as input the noisy image $z_t$ and the step we are on in the diffusion process $t$ and has the job of trying to predict what the corresponding clean image was that generated the noisy image.  In most diffusion models this is implemented as a &lt;a href="https://en.wikipedia.org/wiki/U-Net"&gt;U-Net&lt;/a&gt; style architecture. In practice, it's been found that if instead of predicting the clean image $\hat x$, you predict the noise $\hat \epsilon$ from the noisy image, you get better-looking samples.&lt;sup&gt;&lt;a href="#epshat"&gt;8&lt;/a&gt;&lt;/sup&gt; The full reverse generative model then consists of many steps of looking at a noisy image and trying to infer the clean one; rinse and repeat.&lt;/p&gt;
&lt;p&gt;With these choices in place, we can now look at the full joint KL and organize terms.&lt;/p&gt;
&lt;p&gt;$$ \left\langle \log p(x) - \log q(x|z_0) + \log \frac{p(z_T|x)}{q(z_T)}  + \sum_{i=0}^{T-1} \log \frac{p(z_i|z_{i+1},x)}{q(z_i|z_{i+1})} \right\rangle_p $$&lt;/p&gt;
&lt;p&gt;The last trick we're going to use is that we're going to avoid computing all of the terms in our sum by simply not computing all of the terms in our sum.  We'll approximate the sum with Monte Carlo: we'll simply randomly choose one of the terms and upweight it appropriately.
At that point, we have the loss function used to train VDM models.  A very nice thing about the VDM loss is that it is clear that we are optimizing a bound on the marginal likelihood of our generative model.  As you can learn in the &lt;a href="https://arxiv.org/abs/2107.00630"&gt;VDM Paper&lt;/a&gt;, many of the diffusion models you've heard about correspond to a &lt;em&gt;weighted&lt;/em&gt; form of this same objective, where different terms in the sum get different weights.&lt;/p&gt;
&lt;p&gt;After going through all of the fancy math, the analytic KL divergences involved in the diffusion loss simplify quite nicely:
$$ \left\langle \log p(x) - \log q(x|z_0) + \log \frac{p(z_T|x)}{q(z_T)} + \frac 1 2 \sum_{t=0}^{T-1} \beta_t \left\lVert \epsilon - \hat \epsilon(z_t,t) \right\rVert^2  \right\rangle $$
For variational diffusion the weight terms $\beta_t$ depend on your choice of &lt;em&gt;noise schedule&lt;/em&gt;.  For most other diffusion models in the wild, these $\beta_t$ weights are conventionally set to 1.&lt;/p&gt;
&lt;h2&gt;Closing Thoughts&lt;/h2&gt;
&lt;p&gt;So, why are diffusion models so interesting?  Well, first and foremost, the reason they are drawing so much attention is that they have shown tremendous performance.  It feels like for the first time we have models that are able to generate very high resolution, very high fidelity natural images. Projects like &lt;a href="https://openai.com/dall-e-2/"&gt;DALL-E2&lt;/a&gt;, &lt;a href="https://imagen.research.google/"&gt;Imagen&lt;/a&gt;, and &lt;a href="https://github.com/CompVis/stable-diffusion#stable-diffusion-v1"&gt;Stable Diffusion&lt;/a&gt; show really impressive results.  What is the magic driving these models?&lt;/p&gt;
&lt;p&gt;At a high level, I think we can say that diffusion models start to realize the dream of latent variable models.  Sometimes, when you are faced with a problem that is too difficult, you can crack it if you consider an even harder, related problem.  As I tried to demonstrate here, even for simple latent variable models like VAEs and especially for diffusion models, one reason we can point to for their success is that instead of directly modeling the distribution over images, they model a much larger joint distribution.  That larger joint distribution is strictly speaking a bigger thing to attempt to model, but here we get to design the forward process in such a way that even if there are many pieces to the forward process, those pieces individually are easier to tackle.&lt;/p&gt;
&lt;p&gt;However, if that were the case, shouldn't we have expected deep hierarchical models to perform similarly awesomely?  Probably, though here I think there is another real trick that diffusion has up its sleeve.  For a general deep hierarchical generative model, even if by splitting the problem up into smaller pieces you might have split it up into easier-to-model tasks, to evaluate the joint KL you still need to evaluate all of those terms.  That is, as your model becomes richer and more computationally expressive because of its depth, so does the cost of training your model, as you have to evaluate all of the layers at each step in the training process.&lt;/p&gt;
&lt;p&gt;Diffusion models avoid this by structuring their forward process in such a way that all of the steps share a great deal of structural similarity.  This allows diffusion to approximate a sum of a potentially large number of steps by a single randomly chosen step.  If each step looks more or less the same, you can get a good estimate for the whole sum by looking at an individual, random, term.&lt;/p&gt;
&lt;p&gt;The last trick up its sleeve is, even if you managed to design a deep hierarchical generative model with this structural homogeneity property, if you wanted to get to some intermediate position in the hierarchy you'd still have to run roughly half of the full forward process.  That would still be expensive in general.  Here, diffusion avoids that entirely.&lt;br /&gt;
As boring as a sequence of conditional Gaussians is as a forward process, it is also beautiful: it enables exact analytic marginalization to intermediate steps.  You can very quickly mimic the result of adding hundreds of steps of additive Gaussian noise by simply adding a moderate amount of Gaussian noise in a single shot.&lt;/p&gt;
&lt;p&gt;So, ultimately, what do I think is one of the main reasons diffusion models do so well? I think it's because they &lt;em&gt;can&lt;/em&gt; do so well! I think it's because they are very powerful, expressive, generative models.  Sampling from them is generally rather expensive.  Drawing a sample means running the full reverse process, which might mean calling the central score net a thousand or so times.  That is a very powerful and very expressive generative model, but magically, we can train that generative model's likelihood without ever having to actually instantiate the full generative process at training time due to our set of sundry tricks.&lt;/p&gt;
&lt;p&gt;I'm excited to see where this all goes and hope this post and the &lt;a href="https://colab.sandbox.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb"&gt;colab&lt;/a&gt; help to introduce these magical models to a wider audience.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Special thanks to &lt;a href="https://twitter.com/poolio"&gt;Ben Poole&lt;/a&gt;, &lt;a href="https://twitter.com/pavel_izmailov"&gt;Pavel Izmailov&lt;/a&gt;, &lt;a href="https://twitter.com/def_chris_suter"&gt;Christopher Suter&lt;/a&gt;, and Sergey Ioffe, and &lt;a href="https://twitter.com/itfische"&gt;Ian Fischer&lt;/a&gt; for helpful feedback on this post.&lt;/small&gt;&lt;/p&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/diffusion.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Thu, 15 Sep 2022 00:00:00 -0400</pubDate></item><item><title>Non-equilibrium Thermodynamics Results Seemingly from Nothing</title><description>Deriving some classic results in non-equilibrium thermodynamics from seemingly nothing.</description><content:encoded>&lt;p&gt;Let's see if we can very quickly prove the Jarzynski Equality and related non-equilibrium statistical mechanics results.  Much like the mathematical underpinnings of thermodynamics are pretty mathematically simple, e.g. the existence of a convex surface on which mixed partial derivatives commute, I believe most of the results in non-equilibrium statistical mechanics are similarly due to a rhetorical reinterpretation of a simple mathematical manipulation.&lt;/p&gt;
&lt;p&gt;This post will assume some familiarity with physics.&lt;/p&gt;
&lt;h2&gt;Basic Facts&lt;/h2&gt;
&lt;p&gt;The underlying math in our case are two facts, one that probability distributions are normalized:
$$ \int dx\, p(x) = 1. $$&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#concave"&gt;1&lt;/sup&gt;
The proof of which is straightforward given that $\log$ is concave and Jensen's inequality, see &lt;a href="kl.html#non-negative-proof"&gt;my other blog post&lt;/a&gt; for a proof.
&lt;/aside&gt;
and second, that KL divergence is positive:&lt;sup&gt;&lt;a href="#concave"&gt;1&lt;/a&gt;&lt;/sup&gt;
$$ \int dx\, p(x) \log \frac{p(x)}{q(x)} \geq 0. $$
&lt;h2&gt;Density Ratios&lt;/h2&gt;
&lt;p&gt;To generate the classic non-equilibrium statistical mechanics results we start by considering a simple ratio of two joint probability distributions:
$$ \frac{q(x_0, x_1)}{p(x_0, x_1)} $$
Clearly we have a tremendous freedom here in our choices for the distributions $p$ and $q$. Mathematically it's uninteresting but we can start to build some rhetorical weight by factoring our two distributions in two distinct ways:
$$ \frac{q(x_1) q(x_0|x_1)}{p(x_0)p(x_1|x_0)} $$
Despite still not having done anything, we can start to build an interpretation here. Imagine $x_0$ and $x_1$ as being two configurations of a system, with $x_1$ happening &lt;em&gt;after&lt;/em&gt; $x_0$.  Now, though we're allowed by the chain rule to factor distributions any way we wish, here we've chosen to factor $p$ to be suggestive of some kind of &lt;em&gt;forward process&lt;/em&gt; wherein we first sample some $x_0$ from a distribution $p(x_0)$ and then evolve it according to some potentially stochastic process to generate our next state $x_1$ conditioned on the first: $p(x_1|x_0)$.  At the same time, we've factored $q$ the other way, evocative of a &lt;em&gt;reverse process&lt;/em&gt; that starts at $x_1$ and then evolves backward to $x_0$.&lt;/p&gt;
&lt;p&gt;To make further progress, let's specialize a bit.  Let's imagine that $x_0$ and $x_1$ are configurations of a physical system evolving according to Hamiltonian dynamics, with a Hamiltonian governed by some kind of control parameter $\lambda$.  Let's further &lt;em&gt;imagine&lt;/em&gt; that at the beginning of either our forward or reverse process our system is in thermodynamic equilibrium at the same temperature, and in particular in a &lt;a href="https://en.wikipedia.org/wiki/Canonical_ensemble"&gt;canonical ensemble&lt;/a&gt;:&lt;sup&gt;&lt;a href="#beta"&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#beta"&gt;2&lt;/sup&gt;
$\beta$ is the &lt;a href="https://en.wikipedia.org/wiki/Thermodynamic_beta"&gt;inverse temperature&lt;/a&gt; $1/(k_B T)$
&lt;/aside&gt;
$$ 
\begin{align}
    p(x_0) &amp;= \frac{1}{Z(\beta,\lambda_0)} e^{-\beta H(x_0, \lambda_0)} \\
    q(x_1) &amp;= \frac{1}{Z(\beta, \lambda_1)} e^{-\beta H(x_1, \lambda_1)}.
\end{align}
$$
&lt;p&gt;Simply substituting these expressions into our density ratio we find:&lt;/p&gt;
&lt;p&gt;$$ \frac{q(x_0,x_1)}{p(x_0,x_1)} = \frac{Z(\beta,\lambda_0)}{Z(\beta, \lambda_1)} e^{-\beta \left( H(x_1,\lambda_1) - H(x_0, \lambda_0) \right)} \frac{q(x_0|x_1)}{p(x_1|x_0)}. $$&lt;/p&gt;
&lt;p&gt;We can clean this up a bit and give it a cleaner physical interpretation.  Let's identify the change in the Hamiltonian with the work:
$$ W \equiv H(x_1,\lambda_1) - H(x_0, \lambda_0). $$
And let's use the standard definition of the free energy:
$$ \beta F = -\log Z, $$
to rewrite the ratio of partition functions as a difference in free energies:
$$ e^{-\beta \Delta F} = e^{\log Z(\beta,\lambda_0) -\log Z(\beta,\lambda_1)} = \frac{Z(\beta,\lambda_0)}{Z(\beta,\lambda_1)}. $$
Combining these results gives:
$$ \frac{q(x_0,x_1)}{p(x_0,x_1)} = e^{\beta (W - \Delta F)} \frac{q(x_0|x_1)}{p(x_1|x_0)}. $$
I'm going to anticipate some of the things we're going to talk about below and define the log of the forward over the reverse transition probabilities as the &lt;em&gt;heat&lt;/em&gt;:
$$ Q = \log \frac{p(x_1|x_0)}{q(x_0|x_1)}. $$
With this final identification we end up with the general statement:
$$ \frac{q_R}{p_F} = e^{\beta (W - Q - \Delta F)}. $$
The density ratio of the reverse process (shortened here as $q_R$) to the forward process $p_F$ is given by the exponential of $\beta$ times the quantity of the work, minus the heat minus the change in free energy.&lt;/p&gt;
&lt;h2&gt;Hamiltonian Dynamics&lt;/h2&gt;
&lt;p&gt;First, if we assume that our dynamics is Hamiltonian, and thus deterministic and reversible, we know that the probability that we start at $x_0$ and end up at $x_1$ if we evolve forward in time is the same as the probability that we start at $x_1$ and end up at $x_0$ if we reverse our time evolution, ($q(x_0|x_1) = p(x_1|x_0)$)&lt;sup&gt;&lt;a href="#heat-caveat"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#heat-caveat"&gt;3&lt;/sup&gt;
    Alternatively, if you trust our identification of heat, you could imagine an isolated system where the heat flow is zero.
&lt;/aside&gt;
so the ratio of conditional probabilities actually cancels and we generate &lt;a href="https://en.wikipedia.org/wiki/Crooks_fluctuation_theorem"&gt;Crooks' Fluctuation Theorem&lt;/a&gt;:
$$ \frac{q_R}{p_F} =  e^{\beta (W - \Delta F)}. $$
The ratio of the reverse process probability to the forward probability for a given initial and final point is given by the exponential $e^{\beta (W - \Delta F)}$.  If we now take the integral of this with respect to the forward process, we generate the &lt;a href="https://en.wikipedia.org/wiki/Jarzynski_equality"&gt;Jarzynski equality&lt;/a&gt;:&lt;sup&gt;&lt;a href="#langle"&gt;4&lt;/a&gt;&lt;/sup&gt;
&lt;aside&gt;&lt;sup id="#langle"&gt;4&lt;/sup&gt;
We've also introduced the $\langle \cdot \rangle$ notation for expectations to clean up the notation a bit.
&lt;/aside&gt;
$$ \int dx_0\, dx_1\, p(x_0,x_1) \frac{q(x_0,x_1)}{p(x_0,x_1)} = 1 = \left\langle e^{\beta (W - \Delta F)} \right\rangle_p, $$
&lt;aside&gt;&lt;sup id="#free-energy"&gt;5&lt;/sup&gt;
The free energy only depends on the partition function $Z$ which is a constant so can be taken outside the expectation.
&lt;/aside&gt;
which simplifies to&lt;sup&gt;&lt;a href="#free-energy"&gt;5&lt;/a&gt;&lt;/sup&gt;:
$$ \left\langle e^{-\beta W}\right\rangle_p = e^{-\beta \Delta F}. $$
So, recapping, what have we just done?  
Since we can take density ratios of arbitrary probability distributions, we could choose those two densities to mean something we care about.  Consider $p$ the forward, Hamiltonian evolution of a system from $x_0$ to $x_1$ and $q$ the reverse process.  If we imagine that both the forward and reverse processes start in a state of canonical equilibrium, we can generate both Crooks' Fluctuation Theorem as well as the Jarzynski equality. 
&lt;p&gt;The power of this result is that it allows us to relate an expectation computed with respect to non-equilibrium processes (the exponential of the beta weighted stochastic work needed for a bunch of non-equilibrium realizations of our trajectory) to a pure equilibrium quantity (a difference of equilibrium free energies).
In the context of the physical sciences, this lets us perform non-equilibrium simulations or experiments, and provided we measure the work performed over many such runs, even with the system driven far from equilibrium, we can estimate equilibrium free energy differences.&lt;/p&gt;
&lt;h2&gt;Stochastic Dynamics&lt;/h2&gt;
&lt;p&gt;But, let's say you don't like the assumption that the dynamics are Hamiltonian, we can imagine something else, imagine our dynamics is stochastic but imagine discretizing the dynamics.  We still need to make some kind of assumption, in this case, we'll imagine that our process consists of $N$ steps, each of which is governed by a Markov transition kernel.  Finally, we'll assume that each transition kernel has a stationary distribution and satisfies detailed balance.&lt;/p&gt;
&lt;p&gt;What this means is that we'll imagine that our forward process now takes the form:
$$ 
\begin{align}
p_F &amp;amp;= p(x_0) p(x_1|x_0) p(x_2|x_1) \cdots p(x_N|x_{N-1}) \\
&amp;amp;= p(x_0) T_1(x_1|x_0) T_2(x_2|x_0) \cdots T_N(x_N|x_{N-1}) 
\end{align}
$$
Here we've denoted the intermediate conditional distributions as being governed by our transistion kernels, labeled with the corresponding stationary distribution.  Saying that our kernels have a stationary distribution that they respect according to detailed balance means that:
$$ T_k(x'|x) \sigma_k(x) = T_{k}(x|x') \sigma_k(x'), $$
for the stationary distribution $\sigma_k$.&lt;/p&gt;
&lt;p&gt;We've defined our forward process, now we need to define our reverse process.  We'll imagine that the reverse process is governed by the same transition kernels but running in reverse:&lt;sup&gt;&lt;a href="#reverse"&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#reverse"&gt;6&lt;/sup&gt;
Notice that by *reverse* here we mean that the kernels are actually designed to be the ones targeting the stationary distribution for the step we're on, rather than the one we are heading to.
&lt;/aside&gt;
$$
\begin{align}
    q_R &amp;= q(x_N) q(x_{N-1}|x_N) \cdots q(x_1|x_2) q(x_0|x_1) \\
    &amp;= q(x_N) T_{N}(x_{N-1}|x_N) \cdots T_2(x_1|x_2) T_1(x_0|x_1).
\end{align}
$$
&lt;p&gt;Now if we look at the ratio of our reverse to our forward process, things simplify a bit:
$$
\begin{align}
\frac{q_R}{p_F} &amp;amp;= \frac{q(x_N)T_N(x_{N-1}|x_N)\cdots T_2(x_1|x_2)T_1(x_0|x_1)}{p(x_0)T_1(x_1|x_0)T_2(x_2|x_1)\cdots T_N(x_N|x_{N-1})} \\
&amp;amp;= \frac{q(x_N)}{p(x_0)} \frac{T_1(x_1|x_0)}{T_1(x_0|x_1)} \frac{T_2(x_1|x_2)}{T_2(x_2|x_1)} \cdots \frac{T_N(x_{N-1}|x_N)}{T_N(x_N|x_{N-1})} \\
&amp;amp;= \frac{q(x_N)}{p(x_0)} \frac{\sigma_1(x_1)}{\sigma_1(x_0)} \frac{\sigma_2(x_2)}{\sigma_2(x_1)} \cdots \frac{\sigma_N(x_{N-1})}{\sigma_N(x_N)} .
\end{align}
$$&lt;/p&gt;
&lt;p&gt;Finally, as we did above, let's imagine that all of these marginal distributions take the form of a canonical distribution.&lt;sup&gt;&lt;a href="#stationary"&gt;7&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#stationary"&gt;7&lt;/sup&gt;
Notice that this isn't the same as assuming that our process is always in equilibrium, we are still describing a potentially non-equilibrium process, the only assumption here is that the dynamics is Markov and *stationary* with some stationary distribution that we can characterize.
&lt;/aside&gt;
$$
\begin{align}
    q(x_N) &amp;\equiv \frac{1}{Z_N} e^{-\beta H_N} \\
    p(x_0) &amp;\equiv \frac{1}{Z_0} e^{-\beta H_0} \\
    \sigma_k(x_j) &amp;\equiv \frac{1}{Z_k} e^{-\beta E_k(x_j)}.
\end{align}
$$
Notice that the nice simplification that happens here is that since we imagined our reverse process as being the reverse of the forward process, in all but one of these fractions, the partition function of the intermediate stationary processes will cancel out.  Putting this all together we obtain the general result:
$$ \frac{q_R}{p_F} = e^{\beta(W - Q - \Delta F)}, $$
if we identify $W$ with the total energy change of the system ($H_0-H_N$), $\Delta F$ with the change in the partition functions (as above, $-\beta \Delta F = \log Z_0/Z_N$) and now identify the &lt;i&gt;heat&lt;/i&gt; as additional energy changes in each of the intermediate processes:&lt;sup&gt;&lt;a href="#serious"&gt;8&lt;/a&gt;&lt;/sup&gt;
&lt;aside&gt;&lt;sup id="#serious"&gt;8&lt;/sup&gt;
I don't think we should take this identification with the heat too seriously, some of the literature just calls this the total work for the trajectory.
&lt;/aside&gt;
$$ Q \equiv \sum_{k=1}^{N} Q_k \qquad Q_k = \Delta E_k = E_k(x_k) - E_k(x_{k-1}) . $$
And I believe we've done it.  Taking the expectation of this quantity with respect to the forward process will give us the Jarzynksi equality again&lt;sup&gt;&lt;a href="#ais"&gt;9&lt;/a&gt;&lt;/sup&gt;:
$$ \left\langle e^{\beta(W - Q)} \right\rangle = e^{\beta \Delta F}. $$
&lt;aside&gt;&lt;sup id="#ais"&gt;9&lt;/sup&gt;
    We've also just reinvented &lt;a href="https://arxiv.org/abs/physics/9803008"&gt;Annealed Importance Sampling (AIS)&lt;/a&gt;. For more details of how these non-equilibrium results relate to AIS see &lt;a href="https://papers.nips.cc/paper/2017/hash/4da04049a062f5adfe81b67dd755cecc-Abstract.html"&gt;&lt;i&gt;Model Evidence from nonequilibrium simulations&lt;/i&gt;&lt;/a&gt; by Habeck, NeurIPS2017.
&lt;/aside&gt;
&lt;p&gt;Taking the logarithm of the ratio and then the expectation is equivalent to the KL divergence between the forward and reverse processes, which we know must be positive:
$$ D(p_F; q_R) = \left\langle \log \frac{p_F}{q_F} \right\rangle_F = -\beta \left\langle W - Q \right\rangle + \beta \Delta F \geq 0 $$
which naturally generates the inequality (a version of the second law):
$$ \Delta F \geq \left\langle W - Q \right\rangle. $$
As a reminder, in this case, we were generalized to a situation where our initial distributions were canonical, but our dynamics were generalized to any sequence of Markovian transition kernels, provided only that those kernels have a stationary distribution.&lt;/p&gt;
&lt;h2&gt;Generalized Landauer Bound&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://youtu.be/r33Wj8FF_EQ?t=356"&gt;Wolpert says&lt;/a&gt; that, from stochastic thermodynamics we know:&lt;/p&gt;
&lt;p&gt;\begin{equation}
-\Delta Q = \Delta \Sigma + S(p_0) - S(p_1)
\end{equation}&lt;/p&gt;
&lt;p&gt;Which, with $\Delta \Sigma \geq 0$ gives us the &lt;em&gt;generalized Landauer bound&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;\begin{equation}
-\Delta Q \geq S(p_0) - S(p_1)
\end{equation}&lt;/p&gt;
&lt;p&gt;For the classic case of bit erasure the change in entropy is $\log 2$ and we get Landauer's bound:&lt;/p&gt;
&lt;p&gt;\begin{equation}
-\Delta Q \geq kT \log 2
\end{equation}&lt;/p&gt;
&lt;p&gt;So, where does this come from?  It doesn't seem like there is much to it, honestly, imagine two joint distributions $p(x_0, x_1)$ and $q(x_0, x_1)$ describing a &lt;em&gt;forward&lt;/em&gt; and &lt;em&gt;reverse&lt;/em&gt; process that moves between two states.  The KL divergence between these two is non-negative and &lt;em&gt;monotonic&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;\begin{equation}
\left\langle \log \frac{p(x_0,x_1)}{q(x_0,x_1)} \right\rangle_p \geq \left\langle \log \frac{p(x_1)}{q(x_1)} \right\rangle \geq 0
\end{equation}&lt;/p&gt;
&lt;p&gt;We can simply rearrange terms to see that:
Subtracting $\langle \log p(x_1)/q(x_1) \rangle$ from both sides we first find the entropy production:
\begin{equation}
\Delta\Sigma \equiv \left\langle \log \frac{p(x_1|x_0)p(x_0)}{q(x_0|x_1)p(x_1)} \right\rangle \geq 0
\end{equation}&lt;/p&gt;
&lt;p&gt;and we can establish the identity:
\begin{equation}
\left\langle \log \frac{p(x_1|x_0)p(x_0)}{q(x_0|x_1)p(x_1)} \right\rangle_p = \left\langle \log \frac{p(x_1|x_0)}{q(x_0|x_1)} \right\rangle_p + \left\langle \log \frac{p(x_0)}{p(x_1)} \right\rangle_p
\end{equation}&lt;/p&gt;
&lt;p&gt;If we simply identify terms, we recover the Wolpert form:&lt;/p&gt;
&lt;p&gt;\begin{equation}
\Delta \Sigma = -\Delta Q + S(p_1)-S(p_0)
\end{equation}&lt;/p&gt;
&lt;p&gt;To make these identifications, we can see that:
\begin{equation}
S(p_0) = -\left\langle \log p(x_0) \right\rangle \qquad S(p_1) = -\left\langle \log p(x_1) \right\rangle
\end{equation}&lt;/p&gt;
&lt;p&gt;And for the &lt;em&gt;entropy rate&lt;/em&gt;:
\begin{equation}
-\Delta Q \equiv \left\langle \log \frac{p(x_1|x_0)}{q(x_0|x_1)} \right\rangle
\end{equation}
which appears to be the likelihood ratio of our forward and reverse conditional processes, i.e. some characterization of the irreversibility of our system.&lt;/p&gt;
&lt;p&gt;If we happen to be in a system that satisfies local detailed balance, we know that there should be some kind of steady state distribution for which:
\begin{equation}
p(x_1|x_0) \pi(x_0) = q(x_0|x_1) \pi(x_1)
\end{equation}
so that:
\begin{equation}
\log \frac{p(x_1|x_0)}{q(x_0|x_1)} = \log \frac{\pi(x_1)}{\pi(x_0)}
\end{equation}
and if we further imagine that the steady state distribution is boltzmann like and the system is in contact with some kind of heat bath, we see that:
\begin{equation}
\log \frac{\pi(x_1)}{\pi(x_0)} = \log \frac{\frac{1}{Z_1}e^{\beta H_1}}{\frac{1}{Z_0} e^{\beta H_0}} = \log \frac{Z_0}{Z_1}+ \beta (H_1 - H_0)  = \beta \Delta F - \beta \Delta U = \Delta Q
\end{equation}
we can identify the forward to the reverse transition probabilties as the heat flow from the bath.&lt;/p&gt;
&lt;h2&gt;Variational Autoencoder&lt;/h2&gt;
&lt;p&gt;To show some of the generality of what we're doing here, let's do it again but for a completely different kind of system, this time a &lt;a href="https://en.wikipedia.org/wiki/Variational_autoencoder"&gt;Variational Autoencoder&lt;/a&gt;.  In a variational autoencoder there are two joint distributions at play, one a &lt;em&gt;representational model&lt;/em&gt; $p(x,z) = p(x) p(z|x)$ which starts with a draw from some &lt;em&gt;true&lt;/em&gt; data distribution $p(x)$ and then uses an &lt;em&gt;encoder&lt;/em&gt; to map that datum to some kind of representative code, or summary, or representation $z$: $p(z|x)$.  The other joint distribution consists of a &lt;em&gt;generative model&lt;/em&gt; $q(x,z) = q(z)q(x|z)$ that imagines a joint distribution over the same space but works in &lt;em&gt;reverse&lt;/em&gt;.  First, we generate a &lt;em&gt;latent variable&lt;/em&gt; $z$ from some &lt;em&gt;prior distribution&lt;/em&gt; $q(z)$ and then we use a &lt;em&gt;decoder&lt;/em&gt; to stochastically turn that latent variable into a generated datum $x$: $q(x|z)$.&lt;/p&gt;
&lt;p&gt;We can easily imagine the ratio of these two densities:
$$ \frac{q(x,z)}{p(x,z)} = \frac{q(z)q(x|z)}{p(x)p(z|x)}. $$&lt;/p&gt;
&lt;p&gt;As we saw above, the way to generate an inequality here is to turn this into a KL divergence:
$$
\begin{align}
D( p(x,z) ; q(x,z) ) &amp;amp;= \left\langle \log \frac{p(x) p(z|x)}{q(z) q(x|z)} \right\rangle_p  \\
&amp;amp;= -\left\langle -\log p(x) \right\rangle_p + \left\langle -\log q(x|z) \right\rangle_p  + \left\langle \log \frac{p(z|x)}{q(z)} \right\rangle_p \\
&amp;amp;\equiv -\mathbb{H} + D + R  \geq 0
\end{align}
$$
Here, just as above we've only rearranged terms, but this time organized them into three contributions, the &lt;em&gt;entropy&lt;/em&gt; of the true data generating process:
$$ H \equiv \left\langle -\log p(x) \right\rangle_p, $$
the &lt;em&gt;distortion&lt;/em&gt; a measure of the likelihood we encode then decode and image to the one we started with:
$$ D \equiv \left\langle - \log q(x|z) \right\rangle_p = -\int dx\, p(x) \int dz\, p(z|x) \log q(x|z), $$
and the &lt;em&gt;rate&lt;/em&gt;, a measure of the excess cost required to communicate this message $z$ over a wire designed to be optimal for the prior $q(z)$:
$$ R \equiv \left\langle \log \frac{p(z|x)}{q(z)} \right\rangle_p = \left\langle D(p(z|x); q(z)) \right\rangle_{p(x)}. $$
We've just rederived the &lt;em&gt;ELBO&lt;/em&gt;&lt;sup&gt;&lt;a href="#elbo"&gt;10&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="#elbo"&gt;10&lt;/sup&gt;
For Evidence Lower BOund.
&lt;/aside&gt;
rendered in the form presented in &lt;i&gt;Fixing a Broken ELBO&lt;/i&gt;&lt;sup&gt;&lt;a href="#brokenelbo"&gt;11&lt;/a&gt;&lt;/sup&gt;
&lt;aside&gt;&lt;sup id="#brokenelbo"&gt;11&lt;/sup&gt;
&lt;i&gt;Fixing a Broken ELBO&lt;/i&gt; by AA Alemi, B Poole, I Fischer, JV Dillon, RA Saurous and K Murphy, ICML 2018. &lt;a href="https://arxiv.org/abs/1711.00464"&gt;1711.00464&lt;/a&gt;
&lt;/aside&gt;
$$ \textsf{ELBO} \equiv D + R \geq H. $$
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;We've managed to derive several non-equilibrium statistical mechanical equalities and inequalities seemingly from nothing.  All of these results were powered by the facts we opened with, that probability distributions integrate to one and that KL divergences are positive.  The only challenge here was one of semantics.  To get power out of such trivial mathematical manipulations required us to make judicious choices in how we interpreted them.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Special thanks to Sam Schoenholz, Srinivas Vasudevan, Yasaman Bahri and Jim Sethna for helpful feedback on this post.&lt;/small&gt;&lt;/p&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/non-equilibrium.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Fri, 16 Sep 2022 00:00:00 -0400</pubDate></item><item><title>Weighted Ensemble Self-Supervised Learning</title><link>https://arxiv.org/abs/2211.09981</link><description>Ensembling the heads of SSL methods gives nice gains. / Y Ruan, S Singh, WR Morningstar, AA Alemi, S Ioffe, I Fischer, JV Dillon / 2211.09981 / ICLR 2023</description><guid isPermaLink="true">https://alexalemi.com/publications/weighted-ssl.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/weighted-ssl.pdf" length="668671" type="application/pdf"/><pubDate>Tue, 01 Nov 2022 00:00:00 -0400</pubDate></item><item><title>Inferential Engines</title><link>https://docs.google.com/presentation/d/1WjyaZxYD6jf_bkK4QIhgM73MuO8nXiLZqt0P8s726VQ/present?usp=share_link&amp;resourcekey=0-IuSicQOQrSCn-kQsu3NyQQ</link><description>Viewing VAEs as four stroke engines. / Theoretical Physics for Machine Learning - Aspen</description><guid isPermaLink="true">https://alexalemi.com/talks/inferential-engines.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Wed, 01 Feb 2023 00:00:00 -0500</pubDate></item><item><title>Introduction to Statistics through Randomization</title><link>https://drive.google.com/file/d/1SS_xBteNnw8T_O7CSgLxY2rEgmARAatv/view?usp=sharing</link><description>Introduction to Statistical Thinking through Randomization. / Google Course</description><guid isPermaLink="true">https://alexalemi.com/talks/intro-to-statistics.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Wed, 01 Mar 2023 00:00:00 -0500</pubDate></item><item><title>A Tale of Two Worlds: The Variational Approach to Machine Learning</title><link>https://docs.google.com/presentation/d/1cuBk28v94wwCVTjIRIuP_3Dgvf3L2jCcu2Xiaxxj8nU/present?usp=sharing</link><description>The variational approach to machine learning. / UCF CRCV</description><guid isPermaLink="true">https://alexalemi.com/talks/tale-of-two-worlds.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Mon, 01 May 2023 00:00:00 -0400</pubDate></item><item><title>Variational Prediction</title><link>https://arxiv.org/abs/2307.07568</link><description>Targetting the predictive distribution directly with a variational method. / AA Alemi, B Poole / 2307.07568 / AABI2023</description><guid isPermaLink="true">https://alexalemi.com/publications/variational-prediction.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/variational-prediction.pdf" length="649037" type="application/pdf"/><pubDate>Mon, 01 May 2023 00:00:00 -0400</pubDate></item><item><title>Variational Prediction</title><link>https://alexalemi.com/talks/vp-poster.pdf</link><description>A variational way to directly target the posterior predictive. / AABI2023</description><guid isPermaLink="true">https://alexalemi.com/talks/variational-prediction.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sat, 01 Jul 2023 00:00:00 -0400</pubDate></item><item><title>Speed Limits for Deep Learning</title><link>https://arxiv.org/abs/2307.14653</link><description>Working out thermodynamic speed limits for learning. / I Seroussi, AA Alemi, M Helias, Z Ringel / 2307.14653 / </description><guid isPermaLink="true">https://alexalemi.com/publications/speedlimits.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/speedlimits.pdf" length="1977861" type="application/pdf"/><pubDate>Sat, 01 Jul 2023 00:00:00 -0400</pubDate></item><item><title>Small-scale proxies for large-scale Transformer training instabilities</title><link>https://arxiv.org/abs/2309.14322</link><description>Studying problems of large scale models in the small. / M Wortsman &amp; PAGI / 2309.14322 / ICLR 2024</description><guid isPermaLink="true">https://alexalemi.com/publications/instabilities.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/instabilities.pdf" length="18185892" type="application/pdf"/><pubDate>Fri, 01 Sep 2023 00:00:00 -0400</pubDate></item><item><title>How to Think About AI</title><link>https://docs.google.com/presentation/d/1VhJm6Kcq0I5d-2VtpFxm0QoVwn64pqEEwTuWIyiSIrg/present?usp=sharing</link><description>A popular overview of LLMs and how to think about AI. / Osceola Neovates</description><guid isPermaLink="true">https://alexalemi.com/talks/how-to-think-about-ai.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sun, 01 Oct 2023 00:00:00 -0400</pubDate></item><item><title>Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?"</title><link>https://arxiv.org/abs/2311.07587</link><description>It's easy to get models to perform arithmetic incorrectly, if you just ask nicely. / PAGI / 2311.07587 / </description><guid isPermaLink="true">https://alexalemi.com/publications/adversarial-arithmetic.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/adversarial-arithmetic.pdf" length="624164" type="application/pdf"/><pubDate>Wed, 01 Nov 2023 00:00:00 -0400</pubDate></item><item><title>The Method of Imaginary Results</title><description>Don't think about your prior, think about a hypothetical posterior.</description><content:encoded>&lt;p&gt;Performing Bayesian inference requires a full joint distribution over both our
data and parameters $p(D,\theta)$.  In the usual way of doing things, we
specify that joint distribution by providing two pieces: a likelihood
$p(D|\theta)$ that specifies how we believe the data would be
generated if we happened to know the exact parameter values and some prior
$p(\theta)$ over parameters that represents our state of belief about what the
parameters are before we look at any data.&lt;/p&gt;
&lt;p&gt;Most people don't have any deep philosophical issues with specifying a
likelihood $p(D|\theta)$. We're aware that our likelihoods might not be
perfect, that they are some approximation of what is happening in the real
world. Still, we have opinions about them, we feel as though we can reason about
whether a given likelihood is good or bad for some situation.&lt;/p&gt;
&lt;p&gt;I believe I can model a series of $D$ heads in $N$ coin flips with a &lt;a href="https://en.wikipedia.org/wiki/Binomial_distribution"&gt;Binomial
likelihood&lt;/a&gt; for instance,
and I don't really have any qualms about that.
I might decide to model the heights of my pea plants with a &lt;a href="https://en.wikipedia.org/wiki/Normal_distribution"&gt;Normal
Distribution&lt;/a&gt; or perform a
&lt;a href="https://en.wikipedia.org/wiki/Linear_regression"&gt;linear fit&lt;/a&gt; to some data, or
do image classification with some &lt;a href="https://en.wikipedia.org/wiki/Convolutional_neural_network"&gt;convolutional neural
network&lt;/a&gt; or
&lt;a href="https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)"&gt;transformer&lt;/a&gt;. In any case, I often have a good idea of what I should use as a likelihood $p(D|\theta)$.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="yesterday"&gt;1&lt;/sup&gt;
This quote is often attributed to Lindley (1970), but I see no such occurance in that work. Looking around the original quote appears to be "Today's posterior is tomorrow's prior." From &lt;a href="https://people.umass.edu/stanek/pubhlth892d/Lindley-The-Statist-2000.pdf"&gt;&lt;i&gt;The Philosophy of Statistics&lt;/i&gt;&lt;/a&gt; by Dennis V Lindley, The Statistician 2000.  There do not appear to be any earlier occurrences of "today's prior" used in the sense of Bayesian inference that appear on Google Scholar.
&lt;/aside&gt;
&lt;p&gt;Choosing the prior $p(\theta)$ is what all the fuss is about.  This is the part that raises various philosophical issues.  This is the part that, if we are being honest, is much harder.  What do I believe the bias of a coin is before I ever flip the coin?  I'm not really sure to be honest.  In many contexts I might have previously done some experiments, in which case I could use yesterday's posterior as today's prior.&lt;sup&gt;&lt;a href="#yesterday"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="indifference"&gt;2&lt;/sup&gt;
Originally coined the &lt;i&gt;principle of insufficient reason&lt;/i&gt; by Johannes on Kries, John Maynard Keyes renamed it the &lt;i&gt;principle of indifference&lt;/i&gt;. &lt;a href="https://en.wikipedia.org/wiki/Principle_of_indifference#History"&gt;[wiki]&lt;/a&gt;
&lt;/aside&gt;
&lt;aside&gt;&lt;sup id="jaynes-priors"&gt;3&lt;/sup&gt;
    &lt;a href="https://bayes.wustl.edu/etj/articles/prior.pdf"&gt;&lt;i&gt;Prior Probabilities&lt;/i&gt;&lt;/a&gt; by Edwin T. Jaynes. 1968
&lt;/aside&gt;
&lt;aside&gt;&lt;sup id="reference"&gt;4&lt;/sup&gt;
See for instance &lt;a href="https://arxiv.org/pdf/0904.0156.pdf"&gt;&lt;i&gt;A Formal Definition of Reference Priors&lt;/i&gt;&lt;/a&gt; by Berger, Bernardo and Sun.  For a more modern take I really like see &lt;a href="https://arxiv.org/abs/1705.01166"&gt;&lt;i&gt;Maximizing the information learned from finite data selects a simple model&lt;/i&gt;&lt;/a&gt; by Mattingly et al. and &lt;a href="https://arxiv.org/abs/2205.03343"&gt;&lt;i&gt;Far from Asymptopia&lt;/i&gt;&lt;/a&gt; by Abbott and Machta.
&lt;/aside&gt;
&lt;p&gt;However, lacking previous experiments, I often feel at a loss. There are many frameworks for designing priors that people have proposed.  Laplace originally motivated a flat prior for the Bernoulli likelihood by appealing to the &lt;em&gt;principle of indifference&lt;/em&gt;.&lt;sup&gt;&lt;a href="#indifference"&gt;2&lt;/a&gt;&lt;/sup&gt;
&lt;a href="https://en.wikipedia.org/wiki/Jeffreys_prior"&gt;Jeffreys&lt;/a&gt; taught us how to build priors that were reparameterization-independent.
Jaynes would argue for choosing priors by appealing to symmetries.&lt;sup&gt;&lt;a href="#jaynes-priors"&gt;3&lt;/a&gt;&lt;/sup&gt; Bernardo suggested choosing priors to maximize the information you extract from data, so called &lt;i&gt;reference priors&lt;/i&gt;.&lt;sup&gt;&lt;a href="#reference"&gt;4&lt;/a&gt;&lt;/sup&gt; Gelman and friends tout &lt;i&gt;weakly informative priors&lt;/i&gt;.
There are even whole &lt;a href="https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations"&gt;lists of common recommendations&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;What if we didn't have to choose a prior directly?&lt;/p&gt;
&lt;h2&gt;The Method of Imaginary Results&lt;/h2&gt;
&lt;aside&gt;&lt;sup id="gelman-speed"&gt;5&lt;/sup&gt;
See the first page of the nice little paper &lt;a href="http://www.stat.columbia.edu/~gelman/research/published/GelmanSpeed.pdf"&gt;&lt;i&gt;Characterizing a Joint Probability Distribution By Conditionals&lt;/i&gt;&lt;/a&gt; by Gelman and Speed, 1991.
&lt;/aside&gt;
&lt;p&gt;Enter &lt;em&gt;the method of imaginary results&lt;/em&gt;.  It turns out&lt;sup&gt;&lt;a href="#gelman-speed"&gt;5&lt;/a&gt;&lt;/sup&gt; that we can uniquely characterize a joint distribution in a different way.  Specifying a likelihood $L(D|\theta)$ and a prior $\pi(\theta)$ uniquely characterizes the joint $p(D,\theta) = L(D|\theta)\pi(\theta)$.  You know what else uniquely characterizes the joint? Specifying a likelihood $L(D|\theta)$ and some &lt;em&gt;hypothetical&lt;/em&gt; posterior $q(\theta|D_0)$. The corresponding unique joint $p(\theta,D)$ is given by:&lt;/p&gt;
&lt;p&gt;$$ p(\theta, D) \propto L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} = \frac{ L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} }{\int d\theta\, L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)}}.  $$&lt;/p&gt;
&lt;p&gt;Which naturally satisfies the two inputs we provided:
$$ p(D|\theta) = L(D|\theta) \qquad p(\theta|D_0) = q(\theta|D_0). $$&lt;/p&gt;
&lt;p&gt;This flips the problem on its head.  We no longer have to specify a &lt;em&gt;prior&lt;/em&gt;.  Instead we can specify a &lt;em&gt;hypothetical posterior&lt;/em&gt;.  We can say what we would believe, if, hypothetically we had observed some dataset $D_0$.&lt;/p&gt;
&lt;p&gt;I think that this is an easier task to do.  It is easier for me to reason about what beliefs I should hypothetically hold in light of some data than it is for me to reason about what I believe independent of any data.&lt;/p&gt;
&lt;h2&gt;Coin Example&lt;/h2&gt;
&lt;p&gt;Let's work the simple example of some coin flips.  I believe I can model a coin as being a simple Bernoulli process. There is some probability $\theta$ that the coin will land heads and each flip is independent and identically distributed.  Therefore, I can model observing $H$ heads out of sequence of $N$ flips with a &lt;a href="https://en.wikipedia.org/wiki/Binomial_distribution"&gt;Binomial Likelihood&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;$$ L({H,N}|\theta) = { N \choose H} \theta^H (1- \theta)^{N-H} $$&lt;/p&gt;
&lt;p&gt;Now, we imagine I actually observe some sequence of coin flips, let's say 6 out of 10 flips were heads. Now what should I believe about the bias of my coin?  To answer this I need to specify a prior belief I have about the bias of the coin.  In most textbook examples, that prior is taken to be uniform $p(\theta) = 1$, saying that our prior belief is that it is equally likely that the coin should have a bias in an interval $\theta + \delta \theta$ for any $\theta$, i.e. this prior says its just as likely the bias of the coin is between 0.1 and 0.2 as it is that it is between 0.5 and 0.6.&lt;/p&gt;
&lt;p&gt;Alternatively, I could take Jeffrey's advice and adopt a non-informative prior that is reparameterization independent, or I could try to adopt Gelman's advice and start with an informative prior concentrated near fairness. Below is a representation of these three standard choices where the prior is shown in blue and the posterior from 6 heads out of 10 flips is shown in orange.&lt;/p&gt;
&lt;figure id="#standard-priors" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/standard-coin-priors.svg"
    alt="A visualization of some standard prior and posterior pairs, a uniform prior, jeffrey's prior and a weakly informative prior."&gt;
  &lt;figcaption&gt;
  Figure 1. Some standard textbook priors and the resulting posterior for 6 heads out of 10 coin flips.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;These are convenient mathematically and make for easy problems to solve for a homework exercise, but they aren't realistic.  If we are being honest, we tend to expect that coins we encounter in the real world and very nearly fair.&lt;sup&gt;&lt;a href="#fiarness"&gt;6&lt;/a&gt;&lt;/sup&gt;.  We could therefore start with a prior that is concentrated near fair, but how do we assign a meaningful width to that distribution?  And if we're being honest, I've encountered trick coins in my days, double headed and doubled tailed coins and if some wierdo walks up to me and asks me to start predicting a whole sequence of coin flips I shouldn't discount the possiblity they are trying to play me for a fool.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="fairness"&gt;6&lt;/sup&gt;
Despite there being &lt;a href="https://statweb.stanford.edu/~cgates/PERSI/papers/dyn_coin_07.pdf"&gt;arguments&lt;/a&gt; and &lt;a href="https://arxiv.org/abs/2310.04153"&gt;strong evidence&lt;/a&gt; that there is a &lt;i&gt;dynamical&lt;/i&gt; bias, meaning that coins tend to land on the same side they start as showing up in human flips.
&lt;/aside&gt;
&lt;p&gt;As this stage, trying to adjust the parameters of our &lt;em&gt;prior&lt;/em&gt; without any evidence or data is difficult.  I have a hard time talking to my gut to decide what I should set my prior beliefs to apropro of &lt;em&gt;nothing&lt;/em&gt;.  Instead, let's try to invoke the method of imaginary results and imagine some hypothetical dataset and probe our beliefs.  Imagine we've just observed 10 coin flips, and all 10 of them were heads!  What do you believe now?  Now that I've hypothesized a dataset I have an easier time talking to my gut.&lt;/p&gt;
&lt;p&gt;In this scenario, I feel as though I would place a reasonable probability on the coin being unfair, let's say 50%.  At the same time, I think I would still place a reasonable probability on the coin being &lt;em&gt;exactly&lt;/em&gt; fair, let's say 25%.  The remaining 25% probability I would want to spread around but biased towards heads, for that let's use a $\operatorname{Beta}(11,1)$ distribution or $11\, \theta^{10}$.  I've attempted to visualize this distribution below.&lt;a href="#deltas"&gt;&lt;sup&gt;7&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;figure id="#imaginary-result-coin" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/imaginary-result.svg"
    alt="A mixture of 25\% mass on 0.5, 50\% mass on 1.0 and 25\% mass on a Beta(11, 1)."&gt;
  &lt;figcaption&gt;
  Figure 2. My attempt at illiciting an imaginary result of a posterior I'm comfortable with if I were to observe 10 heads in a row from a coin.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;aside&gt;&lt;sup id="deltas"&gt;7&lt;/sup&gt;
I don't know a good way to visualize probability distributions with delta components. I like the idea of putting vertical lines with balls at the top, but I don't know how to demonstrate how much mass the deltas have in reference to the rest of the density, it doesn't feel like there would be a good choice for the right scale to use.  Here I've done something I'm not proud of but I feel like is honest at least, I've just discretized the distribution into 100 bins, so you can see both the width and the corresponding height, I choose a fairly course discretization so that you can still see the bit of the Beta underneath the spikes.
&lt;/aside&gt;
&lt;p&gt;Or in equation form:&lt;/p&gt;
&lt;p&gt;$$ q(\theta|D_0) = \frac 12 \delta(\theta -1 ) + \frac 14 \delta\left(\theta - \frac 12 \right) + \frac {11} 4 \theta^{10} $$&lt;/p&gt;
&lt;p&gt;Once we've specified this imaginary result, we have everything we need to form a posterior for our original problem with 6 heads out of 10 flips.&lt;/p&gt;
&lt;p&gt;$$\begin{align}
p(\theta|D) &amp;amp;\propto L(D|\theta) \frac{q(\theta|D_0)}{L(D_0|\theta)} \\
&amp;amp;\propto 210 \theta^6 (1-\theta)^4 \frac{\frac 14 \delta\left(\theta - \frac 12 \right) + \frac 12 \delta(\theta - 1) + \frac{11}{4} \theta^{10}}{\theta^{10}} \\
&amp;amp;= \frac{210}{211} \delta\left(\theta -\frac 12 \right) + \frac{1}{211} \left( 2310 \theta^{6} (1-\theta)^4 \right)
\end{align} $$&lt;/p&gt;
&lt;figure id="#imaginary-result-posterior-coin" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/imaginary-result-posterior.svg"
    alt="A mixture of 99.5\% mass on 0.5, 0.50\% mass on a Beta(7, 5)."&gt;
  &lt;figcaption&gt;
  Figure 3. The posterior I get from my illicited imaginary posterior if I actually observe 6 heads and 4 tails. The blue curve is the true posterior, the dashed orange is a blown up version of the small residual component.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;The posterior we find is 99.5% probability on the coin being exactly fair, and 0.5% probability assigned to a $\operatorname{Beta}(7,5)$ type posterior, which is buried in the true form above, but I've blown up in the dashed line so you can see its shape.  This posterior has a very heavy weight on the coin being exactly fair, which I think is reflective of my actual beliefs but I would have had difficulty specifying in terms of a prior.  Instead, if I imagine the coin coming up heads 10 times in a row, the fact that I wanted to still give the coin a 25% chance of being fair is obviously mathematically equivalent to me having a 98.7% prior belief the coin is fair, but I feel as though I have a much higher sensitivity to the right number when I express this as a hypothetical posterior.&lt;/p&gt;
&lt;p&gt;The method of imaginary results let's us ask ourselves what we would believe in light of some data, rather than ask us to express what we believe apropos of &lt;em&gt;nothing&lt;/em&gt;.  I think this helps resolve some of the philosophical issues have with prior selection in Bayesian inference.&lt;/p&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/imaginary-results.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Thu, 30 Nov 2023 00:00:00 -0500</pubDate></item><item><title>What's Missing? A Speculative Sketch of the Future of Machine Learning and Science</title><link>https://docs.google.com/presentation/d/1kemUtTS_qjQEuLXiO7j68_csERqkb0B_qP6xrGkfkFw/present?usp=sharing</link><description>Thinking about the future of science and machine learning. / ML and the Physical Sciences Workshop @ NeurIPS2023</description><guid isPermaLink="true">https://alexalemi.com/talks/whats-missing.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Fri, 01 Dec 2023 00:00:00 -0500</pubDate></item><item><title>Information Theory for Representation Learning</title><link>https://docs.google.com/presentation/d/1YwgRzjWATHVX60Me6qOEOxQIiFjxdqO0_9jdAWXFx74/present?usp=sharing&amp;resourcekey=0-T4ume8tMl__GoYZnKgHMEg</link><description>Everything is KL divergence minimization. / InfoCog Workshop @ NeurIPS2023</description><guid isPermaLink="true">https://alexalemi.com/talks/information-theory-for-representation-learning.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Fri, 01 Dec 2023 00:00:00 -0500</pubDate></item><item><title>Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models</title><link>https://arxiv.org/abs/2312.06585</link><description>Squeezing more performance out of models by fine-tuning on filtered generated responses. / PAGI / 2312.06585 / TMLR</description><guid isPermaLink="true">https://alexalemi.com/publications/self-training.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/self-training.pdf" length="645177" type="application/pdf"/><pubDate>Fri, 01 Dec 2023 00:00:00 -0500</pubDate></item><item><title>KL is All You Need</title><description>It's all KL under the hood.</description><content:encoded>&lt;p&gt;Modern machine learning is a sea of initialisms: VAE, VIB, VDM, BBB, VB, etc.
But, the more time I spend working in this field the more I come to appreciate
that the core of essentially all modern machine learning methods is a single
universal objective:
&lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;Kullback-Leibler (KL) divergence&lt;/a&gt;
minimization.  Even better, there is a very
simple &lt;em&gt;universal recipe&lt;/em&gt; you can follow to rederive most of the named
objectives out there.  Understand KL, understand the recipe, and you'll
understand all of these methods and be well on your way to deriving your own.&lt;/p&gt;
&lt;!--
The more I work on machine learning, the more I'm embarrassed to admit that most of what I'm worked on is all so blindingly simple.  At the core of essentially all of modern machine learning is a single
objective: KL minimization, and there is a very simple recipe to follow to re-derive
most of the named objectives out there.  --&gt;
&lt;p&gt;In the past I've discussed some of the
&lt;a href="kl.html"&gt;special properties of KL divergence&lt;/a&gt;, and how you can derive
&lt;a href="diffusion.html"&gt;VAEs or Diffusion Models&lt;/a&gt; by means of a simple KL objective.
What follows is an extension of those ideas, essentially a written version of a recent
&lt;a href="https://alexalemi.com/talks/information-theory-for-representation-learning.html"&gt;talk&lt;/a&gt;
&lt;a href="https://docs.google.com/presentation/d/1YwgRzjWATHVX60Me6qOEOxQIiFjxdqO0_9jdAWXFx74/present?usp=sharing&amp;resourcekey=0-T4ume8tMl__GoYZnKgHMEg"&gt;[slides]&lt;/a&gt;
I gave at the InfoCog Workshop at NeurIPS 2024.
&lt;a href="#multivariate-ib"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="figattribution"&gt;1&lt;/sup&gt;
This is also essentially my own retelling of &lt;a href="https://www.cs.huji.ac.il/labs/learning/Theses/Slonim_PhD.pdf"&gt;Noam Slonim's thesis on the Multivariate Information Bottleneck&lt;/a&gt;.
&lt;/aside&gt;
&lt;figure id="#conditional" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/kl-elephant.png"
    alt="A cartoon depicting several blindfolded scientists analyzing different parts of an elephant, the different scientists think they are looking at VIB, or Diffusion or BNNs or VAEs or Semi-supervised Learning or Bayesian inference, but really its all just KL Divergence."&gt;
  &lt;figcaption&gt;
  Figure 1. The elephant in the room is KL divergence or the relevant entropy.&lt;a href="#figattribution"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;aside&gt; &lt;sup id="figattribution"&gt;2&lt;/sup&gt;
Cartoon modified from Kevan C. Herold, Jeffrey A. Bluestone, Type 1 Diabetes Immunotherapy: Is the Glass Half Empty or Half Full?. Sci. Transl. Med.3,95fs1-95fs1(2011).
&lt;a href="https://doi.org/10.1126/scitranslmed.3002981"&gt;DOI:10.1126/scitranslmed.3002981&lt;/a&gt;.
&lt;/aside&gt;
&lt;h2&gt;KL Divergence as Expected Weight of Evidence&lt;/h2&gt;
&lt;p&gt;Before we get into it, we need to make sure we're all starting on the same
page.  Because KL divergence is so fundamental and special (as I've written
about &lt;a href="kl.html"&gt;before&lt;/a&gt;) it has many different interpretations. For
our purposes, the most useful interpretation is as &lt;a
href="kl.html#expected-weight-of-evidence"&gt;an expected weight of evidence&lt;/a&gt;.&lt;a href="#woe"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;
I'll briefly build that up here.&lt;/p&gt;
&lt;aside&gt; &lt;sup id="woe"&gt;3&lt;/sup&gt;
I think weight of evidence is one of the most underappreciated concepts.  For a nice overview see: &lt;i&gt;Weight of Evidence: A Brief Survey&lt;/i&gt; by I.J. Good. &lt;a href="https://link.springer.com/article/10.1007/BF01106578"&gt;[pdf]&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;Imagine we have two hypotheses $P$ and $Q$ and we're trying to decide which of these two is a better model of the world.  We go out an collect some data $D$ and would like to use that data to help us discriminate between the two models.  Being good probabilistic thinkers with a penchant for gambling, what we're interested in is:&lt;/p&gt;
&lt;p&gt;$$ \frac{\Pr(P|D)}{\Pr(Q|D)}, $$&lt;/p&gt;
&lt;p&gt;the &lt;a href="https://en.wikipedia.org/wiki/Odds"&gt;&lt;i&gt;odds&lt;/i&gt;&lt;/a&gt; of $P$ versus $Q$, given the data $D$. Using &lt;a href="https://en.wikipedia.org/wiki/Bayes%27_theorem"&gt;Bayes rule&lt;/a&gt; we can express this as:&lt;/p&gt;
&lt;p&gt;$$ \frac{\Pr(P|D)}{\Pr(Q|D)} = \frac{\Pr(D|P)}{\Pr(D|Q)} \frac{\Pr(P)}{\Pr(Q)}, $$&lt;/p&gt;
&lt;p&gt;the product of the &lt;a href="https://en.wikipedia.org/wiki/Likelihood_function"&gt;&lt;i&gt;likelihood&lt;/i&gt;&lt;/a&gt; ratio that the data we observed were generated by model $P$ and $Q$ times the &lt;i&gt;prior odds&lt;/i&gt; of the two models.  Taking a logarithm of both sides turns the product into an easier to work with sum:&lt;/p&gt;
&lt;p&gt;$$ \log \frac{\Pr(P|D)}{\Pr(Q|D)} = \log \frac{\Pr(D|P)}{\Pr(D|Q)} + \log \frac{\Pr(P)}{\Pr(Q)}. $$&lt;/p&gt;
&lt;p&gt;Now, the &lt;i&gt;posterior log odds&lt;/i&gt; is expressed as the sum of the &lt;i&gt;weight of evidence&lt;/i&gt; plus the &lt;i&gt;prior log odds&lt;/i&gt; of the two hypotheses.&lt;/p&gt;
&lt;figure id="#belief-of-meter class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/belief-o-meter.png"
    alt="A cartoon representation of a Belief-O-Meter as an old school linear analog meter."&gt;
  &lt;figcaption&gt;
  Figure 2. Belief-O-Meter.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;This &lt;em&gt;weight of evidence&lt;/em&gt; tells us how much to update our beliefs in light of evidence.  If you picture a sort of Belief-O-Meter™ for your own beliefs, each bit of independent evidence gives you an additive update for the meter, pushing your beliefs either toward $P$ or toward $Q$.  For simple hypothesis taking the form of probability distributions, this weight of evidence is just the log density ratios of the data under the models:&lt;/p&gt;
&lt;p&gt;$$ \log \frac{\Pr(D|P)}{\Pr(D|Q)} \text{ becomes } \log \frac{p(D)}{q(D)}. $$&lt;/p&gt;
&lt;!-- TODO Explain the change in notation here. --&gt;
&lt;!-- What then is &lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;the Kullback-Leibler (KL) divergence&lt;/a&gt; ? --&gt;
&lt;p&gt;OK, so what does this have to do with the &lt;a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence"&gt;KL divergence&lt;/a&gt;?
Imagine if one of our two hypotheses is actually true.  If $P$ was the probability distribution governing the actual world, the &lt;i&gt;expected weight of evidence&lt;/i&gt; we would accumulate from observing some data would be, the KL divergence:&lt;a href="#brakets"&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;$$ I[p;q] \equiv \int dx\, p(x) \log \frac{p(x)}{q(x)} \equiv \left\langle \log \frac{p(x)}{q(x)} \right\rangle_{p(x)} . $$&lt;/p&gt;
&lt;aside&gt; &lt;sup id="brakets"&gt;4&lt;/sup&gt;
To clean up the notation, I like using brakets: $ \langle \cdot \rangle_p \equiv \mathbb{E}_{p}[\cdot] \equiv \int dx\, p(x) [\cdot]$, and to clean things up further (or because I'm lazy) I'll often leave off the subscript saying which distribution the brakets are to be taken with respect to, in any of those cases you can assume its a full joint $p$ distribution (over any variables that are otherwise unbound).
&lt;/aside&gt;
&lt;p&gt;Therefore, we can interpret the KL divergence as a measure of how quickly we would be able to discern between hypotheses $P$ and $Q$ if $P$ were true.  Similarly, the &lt;i&gt;reverse KL&lt;/i&gt; is:&lt;/p&gt;
&lt;p&gt;$$ I[q;p] \equiv \int dx\, q(x) \log \frac{q(x)}{p(x)} \equiv \left\langle \log \frac{q(x)}{p(x)} \right\rangle_{q(x)}, $$&lt;/p&gt;
&lt;p&gt;a measure of how quickly we'd be able to discern between $P$ and $Q$ if $Q$ were true.  Suddenly, the asymmetry of the KL divergence, an issue that often causes consternation is no longer a mystery.  We should expect the expected weight of evidence to be asymmetric.   As an extreme example, imagine we were trying to decide between two hypothesis regarding some coin flips we are about to observe.  $P$ is the hypothesis that the coin is fair while $Q$ is the hypothesis the coin is a cheating, double-headed coin.  In this case, if we actually had a fair coin, we expect to be able to perfectly discern the two hypotheses (infinite KL) because we will eventually observe a tails, an impossible situation under the alternative ($Q$) hypothesis.  Meanwhile, if the coin is actually a cheat, we'll be able to collect, on average, 1 bit of evidence per flip in favor of the hypothesis that the coin is a cheat, but we will only ever observe heads and so never be able to perfectly rule out the possibility that the coin is fair and we've simply observed some miracle.&lt;a href="#million"&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="million"&gt;5&lt;/sup&gt;
As a true aside, people often say "one in a million" but lack a good mental model of just how rare that is.  20 heads in a row for a fair coin is a one in a million event. This fun fact and others can be found &lt;a href="https://www.stat.berkeley.edu/~aldous/Real-World/million.html"&gt;here, at David Aldous's Home Page&lt;/a&gt;.
&lt;/aside&gt;
&lt;h2&gt;Mathematical Properties&lt;/h2&gt;
&lt;p&gt;In what follows, we'll need to use two mathematical properties of the KL divergence. The first is that the KL divergence is non-negative, i.e. the lowest it can be is zero:&lt;/p&gt;
&lt;p&gt;$$ I[p;q] \equiv \int dx\, p(x) \log \frac{p(x)}{q(x)} \geq 0, $$&lt;/p&gt;
&lt;p&gt;which I'll leave as an exercise to the reader, or you can see a proof in the &lt;a href="kl.html#nonnegative"&gt;previous post&lt;/a&gt;. In the context of our interpretation of KL divergence as an expected weight of evidence, the non-negativity of KL divergence means, essentially, that the world can't lie to us.  If we are trying to decide between two hypotheses, and one of them happens to be correct, we have to, we must, we have to, we must, on average, be pushed in the direction of the correct hypothesis.  Even the Devil can't construct a $q \neq p$ that we would be led to believe after seeing enough samples from $p$.&lt;/p&gt;
&lt;p&gt;The other property we'll use is the &lt;em&gt;monotonicity&lt;/em&gt; of the KL divergence.  This is a generalized version of the &lt;a href="https://en.wikipedia.org/wiki/Data_processing_inequality"&gt;data processing inequality&lt;/a&gt;.  If we perform some kind of processing on our random variables, it should only make it harder to discern between two hypotheses, not easier.  In particular, the version we'll need today concerns &lt;em&gt;marginalization&lt;/em&gt;, if I have two joint distributions defined on two random variables, it always has to be the case that the KL divergence between their two marginals must be less than or equal to the joint KL:
$$ \int dx\, dy\, p(x,y) \log \frac{p(x,y)}{q(x,y)} \geq \int dx\, p(x) \log \frac{p(x)}{q(x)}, $$
which is easy to show if you decompose $p(x,y) = p(x) p(y|x)$ and use the fact that all KL divergences (including the conditional $I[p(y|x);q(y|x)] \geq 0$ are non-negative.&lt;/p&gt;
&lt;p&gt;Again, in terms of our current interpretation, this makes sense. If I have some beliefs defined over several variables, if I only get to observe some subset of them, it should be harder for me to discern the beliefs.  The less I look at, the less I see.&lt;/p&gt;
&lt;h2&gt;Universal Recipe&lt;/h2&gt;
&lt;p&gt;With the prerequisites out of they way we're ready to see the "universal recipe" for generating objectives.&lt;/p&gt;
&lt;p&gt;In machine learning, broadly, we build neural networks and need some guidance on how to set their parameters.  An &lt;em&gt;objective&lt;/em&gt; acts like a score that ranks each possible setting and guides our search in the space of parameters for a &lt;em&gt;good&lt;/em&gt; one.  How &lt;em&gt;ought&lt;/em&gt; we value, or judge each possible solution?&lt;/p&gt;
&lt;p&gt;Fundamentally, there are two things in conflict.  There is a the &lt;em&gt;real world&lt;/em&gt; with all of its causal depedencies and structure, a great deal of which we can no influence on.  Data comes from some data generating process wholly outside of our control.  On top of this data we are often interested in building machines to process the data, which may exist in the real world but have a billion or more knobs we need guidance on how to set.  In contrast to the real world, there is the &lt;em&gt;dream world&lt;/em&gt;, the world of our desires, the world as it wish it were to be.  There's a simple story we &lt;em&gt;wish&lt;/em&gt; were true that we could tell about the data and its causal structure. When doing Bayesian inference this is the &lt;em&gt;generative&lt;/em&gt; model you use to describe the data.  If we're being honest with ourselves, it isn't that the data we observe actually comes from our generative model, we only wish that were the case.  So, we have two different stories we could try to tell about the world, the accurate real world description and the wishful dream world one.&lt;/p&gt;
&lt;p&gt;The goal is to make the real world look more like our dreams.  Given that KL divergence is the &lt;em&gt;proper&lt;/em&gt; way to measure how similar two distributions are, we need only minimize the KL divergence between the real world -- the world we can sample from -- and the world as we wish it were.  The smaller that KL can become, the harder it becomes for us or anyone else to distinguish between our dreams and reality.  In steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Draw a causal graphical model corresponding to the world as it is, the true world $P$.&lt;/li&gt;
&lt;li&gt;Augment the real world with any components you wish to add.&lt;/li&gt;
&lt;li&gt;Draw the world of your desires, what success would look like, what you are targeting, the dream world $Q$.&lt;/li&gt;
&lt;li&gt;Minimize $I[P;Q]$.&lt;/li&gt;
&lt;li&gt;...&lt;/li&gt;
&lt;li&gt;Profit!&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;As simple as it sounds, in retrospect a lot of machine learning is simply following this recipe.  Let's repeat this ad nauseam.&lt;/p&gt;
&lt;h2&gt;Density Estimation&lt;/h2&gt;
&lt;p&gt;We'll start with the problem of density estimation.  Let's say we have some black box that generates samples.  This is the real world $P$, outside of our control.  Despite not knowing how $p(x)$ is structured, we can push the button on the black box to generate samples.  What do we wish for? We wish we instead have a nice description of those same images.  We wish that those images instead came from a box of our own design, some parametric model or probability distribution with knobs that we can adjust to bring it into alignment with the real world, our dream world $q_\theta(x)$ with parameters $\theta$.&lt;/p&gt;
&lt;figure id="#density-estimation" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/density-estimation.png"
    alt="A graphical version of density estimation, the left shows the distribution $p(x)$ where $X$ is a random variable denoting images. The right shows the same, but labeled as $q_\theta(x)$."&gt;
  &lt;figcaption&gt;
  Figure 3. Density Estimation. &lt;a href="#imgkey"&gt;&lt;sup&gt;6&lt;/sup&gt;&lt;/a&gt;
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;aside&gt;&lt;sup id="imgkey"&gt;6&lt;/sup&gt;
These figures are meant to be graphical models, where the plates represent repeated samples (here $N$), and each circle represents a random variable.  The arrows denote causal relationships between the variables.  I try to keep the colors somewhat consistent, but also label the random variables in each.  The turtle emoji (🐢) is meant to denote images, "turtle" markes a label, the little plot will represent a &lt;i&gt;representation&lt;/i&gt;, $D$ is data, $\theta$ are parameters and $\phi$ is the state of the universe.
&lt;/aside&gt;
&lt;p&gt;Following the recipe, our recipe then is to minimize the KL divergence between the real world and our ideal one:&lt;/p&gt;
&lt;p&gt;$$ I[p; q] = \left\langle \log \frac{p(x)}{q_\theta(x)} \right\rangle_p $$&lt;/p&gt;
&lt;p&gt;To belabor the point, in terms of our interpretation of KL divergence, this makes sense. $I[p;q]$ measures how easy it is for us to distinguish between $p$ and $q$ using samples from $p$. We have samples from $p$, while $q_\theta(x)$ is a whole set of worlds we can index with our parameters $\theta$.  We seek a setting of those parameters which make it as difficult as possible for us or anyone else to tell the difference between the real world $P$ and our imaginary one $Q$.  Minimizing the KL divergence does exactly that.&lt;/p&gt;
&lt;p&gt;Unfortunately, naively, this objective requires that we be able to evaluate $\log p(x)$, the density the real world assigns to the samples it generates.  This is out of reach, we don't know what the real world is doing, but here is where the KL divergence helps us out yet again.  It decomposes into two terms:&lt;/p&gt;
&lt;p&gt;$$ \underbrace{\left\langle \log \frac{p(x)}{q_θ(x)} \right\rangle}_{I[p;q]} = \underbrace{\left\langle \log p(x) \right\rangle \vphantom{\left\langle \frac p q \right\rangle} }_{-H[p]} + \underbrace{\left\langle -\log q_θ(x) \right\rangle \vphantom{\left\langle \frac p q \right\rangle}}_{H[p;q]},  $$&lt;/p&gt;
&lt;p&gt;the (negative) &lt;em&gt;entropy&lt;/em&gt; of the true data generating process ($H[p]$), and the &lt;em&gt;cross-entropy&lt;/em&gt; between $p$ and $q$: ($H[p;q]$), aka the &lt;em&gt;likelihood&lt;/em&gt; of the data samples from $p$ under $q$.  The entropy of the true data generating process isn't something that we control, as far as we're concerned its a constant and we don't need to worry about it. Just like that, we see that minimizing the KL divergence between the real world and the world of our desires, in this simple single random variable setup recovers ordinary minimum &lt;a href="https://en.wikipedia.org/wiki/Cross-entropy"&gt;cross-entropy&lt;/a&gt; learning, aka maximum likelihood learning, but with a different and hopefully well-motivated origin.  We adjust the parameters of our model $q_\theta(x)$ so as to maximize the likelihood of the data $\log q_\theta(x)$, why? So that we and anyone else would struggle as much as possible to distinguish between the real world and our model.  With this same motivation, lots of other machine learning objectives will fall into place.&lt;/p&gt;
&lt;p&gt;There are two caveats worth discussing but I've pushed them to appendices.  &lt;a href="#appdim"&gt;The first&lt;/a&gt; is that it bugs me that splitting the log density ratio is awkward in terms of dimensional analysis, and &lt;a href="#appemp"&gt;the second&lt;/a&gt; is that while this gives us a meaningful objective, it requires that we be able to take expectations with respect to the true distribution.  If we have only finite samples in the form of a training set, that introduces complications.  I want to acknowlege that reusing a fixed dataset is a problem that has to be dealt with, I want to highlight that it isn't a problem with the &lt;em&gt;objective&lt;/em&gt;.  Our KL divergence objective is telling us the right thing to do, we need to work out real world issues about how to best implement that objective.  This requires some real world complications that are outside the scope of this discussion.&lt;/p&gt;
&lt;h2&gt;Supervised Learning&lt;/h2&gt;
&lt;p&gt;Let's complicate things slightly.  Instead of imagining that we have a single random variable in the real world, imagine instead we have a pair of variables, $X$ and $Y$.  For concreteness, imagine the $X$ are images and the $Y$ are their associated labels in some dataset.&lt;/p&gt;
&lt;figure id="#supervised-learning" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/supervised-learning.png"
    alt="A graphical version of supervised learning, the left shows the distribution $p(y,x)$ where $X$ is a random variable denoting images and $Y$ are some labels. The right shows the world of our dreams, where we draw the images $X$ from the same process as in the real world $p(x)$ but we have a machine that applies the labels: $q(y|x)$."&gt;
  &lt;figcaption&gt;
  Figure 4. Supervised Learning.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;What are we after? What does success look like? Let's imagine that what we desire is the ability to assign labels to data.  What we wish were the case was that we used the same process to draw the images $q(x) = p(x)$, but instead of using the real world process to assign labels, ideally the labels would instead come from a device under our control: $q_\theta(y|x)$. &lt;a href="#theta"&gt;&lt;sup&gt;7&lt;/sup&gt;&lt;/a&gt;  Just as before, we simply minimize the KL divergence between these two joints and we obtain an objective:&lt;/p&gt;
&lt;aside&gt; &lt;sup id="theta"&gt;7&lt;/sup&gt;
I'm going to start dropping the subscript $\theta$ for the parameters.
&lt;/aside&gt;
&lt;p&gt;$$ \left\langle \log \frac{p(x,y)}{p(x)q(y|x)} \right\rangle, $$&lt;/p&gt;
&lt;p&gt;Just as above, when we drop constants outside of our control, we end up with the usual maximum likelihood objective we are used to:&lt;/p&gt;
&lt;p&gt;$$ \left\langle \log \frac{p(x)p(y|x)}{p(x)q(y|x)} \right\rangle = \left\langle \log \frac{p(y|x)}{q(y|x)} \right\rangle. $$
With the same caveats about proper handling of dimensions and issues stemming from using a fixed set of finite samples.&lt;/p&gt;
&lt;p&gt;This conditional likelihood optimization objective is truly the workhorse of modern machine learning.  However, I feel as thought its a bit dishonest.  In practice we rarely care too much about the actual predictive task we are mimicking with our parametric conditional density.  Very few people actually care about assigning &lt;a href="https://en.wikipedia.org/wiki/ImageNet"&gt;ImageNet&lt;/a&gt; labels to images.  Instead, the explosion in deep learning is mostly due to a happy little accident.  When we train very large, very expressive conditional distributions to minimize the conditional KL for something like ImageNet labeling with large datasets, we've discovered that the &lt;em&gt;representations&lt;/em&gt; formed by some intermediate (usually penultimate) layer in that neural network are useful for a wide array of different image tasks. This didn't have to be the case, but we got a bit lucky.&lt;/p&gt;
&lt;p&gt;What if we wanted to learn a useful representation? What would true representation learning look like?&lt;/p&gt;
&lt;h2&gt;Variational Autoencoders&lt;/h2&gt;
&lt;p&gt;So far we've only ever represented the world as it &lt;em&gt;is&lt;/em&gt; and haven't yet taken the step of &lt;em&gt;augmenting&lt;/em&gt; the &lt;em&gt;real world&lt;/em&gt; with something new.  If we want to learn a representation, that's something that lives in the real world. That's a new
random variable.&lt;/p&gt;
&lt;p&gt;Let's start with an unsupervised case.  We have images and we want to form a representation of those images.  In our real world, we have the images $X$ drawn from some distribution outside our control ($p(x)$).  Now we'll &lt;em&gt;augment&lt;/em&gt; the
real world with a new random variable $Z$; our &lt;em&gt;representation&lt;/em&gt;.  We'll parameterize this with a neural network $p(z|x)$ that defines a tractable distribution for our stochastic representation $Z$. This is our &lt;em&gt;encoder&lt;/em&gt;, which maps an image $X$ to a distribution for its representation.
We want to consider a whole slew of possible &lt;em&gt;real worlds&lt;/em&gt;, each world consisting of a different setting of the parameters of our encoder, and thus each world consisting of a different joint distribution $p(x,z)$.  Now our parameters $\theta$ essentially index one of
a wide array of possible joint distributions $p(x,z)$.  How do we decide amongst these?  What does success look like?  We are seeking a world in which we can &lt;em&gt;encode&lt;/em&gt; images into a useful representation $p(z|x)$, one way to define success would be if those learned representations were really like &lt;em&gt;latents&lt;/em&gt; for the images themselves.  Wouldn't it be swell if instead the world worked by looking at our own learned representation and used that to formulate the images themselves?  Wouldn't it be grand if that joint distribution factorized in the opposite direction: $q(x,z) = q(z)q(x|z)$.  This is the usual generative model story, where we first draw a latent variable $z$ from some prior distribution and then &lt;em&gt;decode&lt;/em&gt; it through a stochastic map $q(x|z)$ to formulate our image.  Such a latent would be demonstrably useful for generating images.&lt;/p&gt;
&lt;figure id="vae" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/vae.png"
    alt="A graphical version of a VAE."&gt;
  &lt;figcaption&gt;
  Figure 5. Variational Autoencoders.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;Having defined both the real worlds under consideration $p(x,z)$ and the definition of success $q(x,z)$, our objective is the universal one of minimizing the KL divergence betwixt the two, from $p$ to $q$.  We try to make it as hard as possible for us or anyone else to distinguish between the real world in which we send images forward through an encoder to form a representation and some hypothetical world in which those representations were drawn from some prior and acted as a latent for a decoder that generated images.  We've just recreated the ELBO or Evidence Lower Bound Objective:&lt;/p&gt;
&lt;p&gt;$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle_p = \left\langle \log \frac{p(x)p(z|x)}{q(x|z)q(z)} \right\rangle_p \geq 0. $$&lt;/p&gt;
&lt;p&gt;Since this is a joint KL and all KLs are nonnegative, this objective is non-negative.  Furthermore, because of the monotonicity of KL, we know this is a bound on something we might care about, the marginal KL of our generative or reverse path:
$$ \left\langle \log \frac{p(x)p(z|x)}{q(x|z)q(z)} \right\rangle_p \geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle_p  \geq 0. $$
So, as a bonus, if we push down on this joint KL objective, since this bounds the marginal KL on $X$, we can be assured that this machine composed of three parts, the encoder $p(z|x)$, decoder $q(x|z)$ and marginal (or prior) $q(z)$ will, as we adjust their tunable parameters, additionally make progress on the generative path: $z \sim q(z), x \sim q(x|z)$ itself being as indistinguishable as possible from the original image generating process $p(x)$.  Building and training the representative learning objective, as a side effect, ensures we also manage to build a good generative model.&lt;/p&gt;
&lt;p&gt;We can split this objective up and name the various terms:
$$ \underbrace{\left\langle -\log q(x|z) \vphantom{\left\langle \frac p q \right\rangle} \right\rangle_p}_{D} + \underbrace{\left\langle \log \frac{p(z|x)}{q(z)}\right\rangle_p}_{R} \geq \underbrace{\left\langle -\log q(x) \vphantom{\left\langle \frac p q \right\rangle} \right\rangle_p}_{L} \geq \underbrace{\left\langle -\log p(x) \vphantom{\left\langle \frac p q \right\rangle} \right\rangle_p}_{H}, $$
or in short:
$$ D + R \geq L \geq H, $$&lt;/p&gt;
&lt;aside&gt; &lt;sup id="brokenelbo"&gt;8&lt;/sup&gt;
&lt;i&gt;Fixing a Broken ELBO&lt;/i&gt;. AA Alemi, B Poole, I Fischer, JV Dillon, RA Saurous, K Murphy. ICML 2018. arXiv: &lt;a href="https://arxiv.org/abs/1711.00464"&gt;1711.00464&lt;/a&gt;.
&lt;/aside&gt;
a geometric story we tell in more detail in prior work.&lt;a href="#brokenelbo"&gt;&lt;sup&gt;8&lt;/sup&gt;&lt;/a&gt;  The first term, the *distortion*, measures how well we are able to recover the original image after encoding it with the encoder $z \sim p(z|x)$ and then trying to decode back to the original image $q(x|z)$.  The second term in the objective is the *rate*, which measures the information theoretic cost of the encoding itself.  If Alice and Bob were attempting to communicate the encoding $z$, the KL between the encoding distribution and the prior measures the excess cost of communicating the encoding.
&lt;p&gt;If we are careful to split up the objective into its various reparameterization independent components, we can also explore some trade-offs between the different terms in the objective, adding some Lagrange multipliers, obtaining the $\beta$-VAE.&lt;a href="#betavae"&gt;&lt;sup&gt;9&lt;/sup&gt;&lt;/a&gt;:
$$ \left\langle -\log q(x|z) \right\rangle_p + \beta \left\langle \log \frac{p(z|x)}{q(z)}\right\rangle_p. $$&lt;/p&gt;
&lt;aside&gt; &lt;sup id="betavae"&gt;9&lt;/sup&gt;
&lt;i&gt;beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework&lt;/i&gt;.
I Higgens et al. ICLR 2016. &lt;a href="https://openreview.net/forum?id=Sy2fzU9gl"&gt;[OpenReview]&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;All told, the universal recipe has given us a proper &lt;em&gt;representation learning&lt;/em&gt; objective, albeit unsupervised.  We have defined what it could mean for a representation to be a good one and we are able to search now in the space of all possible representations.  Unfortunately, a bit is a bit and unless we bring some kind of auxiliary information to the table, the success and utility of this objective is often left to inductive biases in our particular choices of variational families.&lt;/p&gt;
&lt;h2&gt;Variational Information Bottleneck&lt;/h2&gt;
&lt;p&gt;If we want to be a bit more explicit in our representation learning objectives, we could &lt;em&gt;color the bits&lt;/em&gt; by bringing and auxiliary variable to the table.  Imagine our real world distribution consists of pairs, $(x,y)$ drawn from some joint distribution $p(x,y)$ outside of our control.  Imagine images $X$ and labels $Y$.  As before, we can augment this world with a new random variable $Z$, a &lt;em&gt;representation&lt;/em&gt;, which, in this example, we are interested in depending only on the image part, $p(z|x)$.  We do this because we'd like to be able to compute the representation of some downstream image without having access to its label.  As before, we've now defined a whole slew of possible worlds, consisting of all possible encoding distributions paired with our joint input distribution $p(x,y,z) =p(x,y)p(z|x)$.  How do we decide amongst these? What does success look like?  Let's define success as being able to use our learned representation $Z$, not to recreate the image, but only predict the auxiliary information $Y$.  This gives us a set of diagrams as in Figure 6 below.&lt;/p&gt;
&lt;figure id="vib" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/vib.png"
    alt="A graphical version of VIB."&gt;
  &lt;figcaption&gt;
  Figure 6. Variational Information Bottleneck.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;aside&gt; &lt;sup id="brokenelbo"&gt;8&lt;/sup&gt;
&lt;i&gt;Deep Variational Information Bottleneck&lt;/i&gt;. AA Alemi, I Fischer, JV Dillon, K Murphy. ICLR 2017. arXiv: &lt;a href="https://arxiv.org/abs/1612.00410"&gt;1612.00410&lt;/a&gt;.
&lt;/aside&gt;
Following the universal recipe and taking the KL divergence between these two joints lets us reinvent the Variational Information Bottleneck:&lt;a href="#vib"&gt;&lt;sup&gt;10&lt;/sup&gt;&lt;/a&gt;
&lt;p&gt;$$ \left\langle \log \frac{p(y|x) p(z|x)}{q(y|z) q(z)} \right\rangle_p \geq \left\langle \log \frac{p(y|x)}{q(y|x)}\right\rangle_p \geq 0. $$
Because KL is monotonic, this joint objective bounds the marginal conditional likelihood and we can rest assured that our predictive engine is still trying to mimic the labeling distribution.  This objective learns a representation that specifically aims to retain only the information that is relevant to predicting the auxiliary information contained in $Y$.  Because the objective is representation centric, we also learn a stochastic representation that can truly compress the inputs.&lt;/p&gt;
&lt;!-- TODO: more info and some background of how VIB behaves. --&gt;
&lt;h2&gt;Semi-Supervised Learning&lt;/h2&gt;
&lt;p&gt;We say that VAEs came from trying to design a representation that could use the learned representation could recreate the images, and that VIB was motivated by saying we could use the learned representation to predict an auxiliary variable.  What if we instead wanted to do both?&lt;/p&gt;
&lt;figure id="semi-supervised" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/semi-supervised.png"
    alt="A graphical version of a Semi-supervised VAE."&gt;
  &lt;figcaption&gt;
  Figure 7. Semi-Supervised Variational Autoencoder.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;We then obtain a type of semi-supervised VAE:&lt;/p&gt;
&lt;p&gt;$$ \left\langle -\beta \log q(x|z) - \gamma \log q(y|z) + \log \frac{p(z|x)}{q(z)} \right\rangle_p. $$
Here $\beta$ and $\gamma$ have been inserted to let us play with the trade-offs between how much emphasize we place on the reconstruction and auxiliary variable respectively.&lt;/p&gt;
&lt;h2&gt;Diffusion&lt;/h2&gt;
&lt;p&gt;As I outline in more detail in an &lt;a href="diffusion.html"&gt;earlier post&lt;/a&gt;, modern diffusion models can also be cast in this universal objective form.  We imagine a simple fixed forward process that iteratively adds Gaussian noise to an image, and try to learn a reverse process parameterized in a clever way.&lt;/p&gt;
&lt;figure id="diffusion" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/diffusion.png"
    alt="A graphical version of Diffusion."&gt;
  &lt;figcaption&gt;
  Figure 8. Variational Diffusion.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;The Variational interpretation of diffusion models makes clear that they are little more than deep hierarchical VAEs, though with some tricks that make training them much more tractable than a general hierarchical VAE.&lt;/p&gt;
&lt;h2&gt;Bayesian Inference&lt;/h2&gt;
&lt;p&gt;So far we've focused on &lt;em&gt;local representation learning&lt;/em&gt;, wherein we want to form a representation of each example or image.  Let's now think a bit about &lt;em&gt;global representation learning&lt;/em&gt;.  We are going to observe an entire dataset and want to somehow summarize what we've learned.  Now we imagine a forward process in which we sample a whole set of data, $D$, and need to form some kind of summary statistic or description of the data: $p(\theta|D)$. What would success look like here?  We'll if we aren't willing to assume much, we still might be willing to assume our data is &lt;em&gt;exchangeable&lt;/em&gt;, that is that the order the data was generating in doesn't matter.  &lt;a href="https://en.wikipedia.org/wiki/Bruno_de_Finetti"&gt;De Finetti&lt;/a&gt; tells us this is equivalent to being able to describe the data as being &lt;em&gt;conditionally i.i.d.&lt;/em&gt; (independent and identically distributed).  That is, we will describe success as taking the form of a sort of generative story:
$$ q(\theta) q(D|\theta), $$
where we draw the summary $\theta$ from some &lt;em&gt;prior&lt;/em&gt; and use it to generate the data with some &lt;em&gt;likelihood&lt;/em&gt; which we can take to decompose: $q(D|\theta) = \prod_i q(x_i|\theta)$.&lt;/p&gt;
&lt;figure id="bayes" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/bayes.png"
    alt="A graphical version of Bayesian Inference."&gt;
  &lt;figcaption&gt;
  Figure 9. (Variational) Bayesian Inference.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;It's the same story we've told several times now, our universal recipe gives us an objective, the KL divergence between these two joints which aims to make them as indistinguishable as possible:
$$ \left\langle \log \frac{p(D)p(\theta|D)}{q(\theta)q(D|\theta)} \right\rangle_p . $$
If we drop the constant terms outside of our control and separate terms into pieces and insert a trade-off parameter, we've reinvented a generalize form of variational Bayesian inference:
$$ \left\langle -\beta \log q(D|\theta) + \log \frac{p(\theta|D)}{q(\theta)} \right\rangle_p. $$
If we set $\beta=1$ and make our $p(\theta|D)$ expressive enough to cover the space of all possible distributions, minimizing this objective recovers the Bayesian posterior.  If we simply restrict our attention to some kind of parametric family of distributions $p(\theta|D)$ this is the ELBO used in variational Bayes.  Lots of names for the same idea: try to form a global representation of data that is as indistinguishable as possible from the data being exchangeable.&lt;/p&gt;
&lt;h2&gt;Bayesian Neural Network&lt;/h2&gt;
&lt;p&gt;We don't have to stop now, let's imagine we want to generate a global summary of data in the form of the best settings of the parameters of a neural network to make some supervised predictions.  We can do that to, we simply follow the universal recipe.  We draw the real world and the world of our desires.&lt;/p&gt;
&lt;figure id="bnn" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/bnn.png"
    alt="A graphical version of Bayesian Neural Networks."&gt;
  &lt;figcaption&gt;
  Figure 10. Bayesian Neural Networks.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;And take the KL betwixt them:
$$ \left\langle -\beta \log q(y|x,\theta) + \log \frac{p(\theta|D)}{q(\theta)} \right\rangle_p, $$
and we've reinvented Bayes By Backprop.&lt;a href="#bbb"&gt;&lt;sup&gt;11&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="bbb"&gt;11&lt;/sup&gt;
&lt;i&gt;Weight Uncertainty in Neural Networks&lt;/i&gt; Blundell et al. ICML 2015. arXiv: &lt;a href="https://arxiv.org/abs/1505.05424"&gt;1505.05424&lt;/a&gt;
&lt;/aside&gt;
&lt;h2&gt;TherML&lt;/h2&gt;
&lt;p&gt;From here you might be wondering what it would look like if we tried to be as honest as possible about the sort of standard practice in machine learning today.  In our &lt;a href="https://arxiv.org/abs/1807.04162"&gt;earlier work&lt;/a&gt;&lt;a href="#therml"&gt;&lt;sup&gt;12&lt;/sup&gt;&lt;/a&gt; we did exactly that and came up with the following diagram:&lt;/p&gt;
&lt;aside&gt; &lt;sup id="therml"&gt;12&lt;/sup&gt;
&lt;i&gt;TherML: Thermodynamics of Machine Learning&lt;/i&gt;
AA Alemi, I Fisher. ICML2018 TFADGM Workshop. arXiv:&lt;a href="https://arxiv.org/abs/1807.04162"&gt;1807.04162&lt;/a&gt;
&lt;/aside&gt;
&lt;figure id="therml" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/therml.png"
    alt="A graphical version of TherML."&gt;
  &lt;figcaption&gt;
  Figure 10. TherML.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;This gave us an objective that seemed to include all of the previous things discussed as special cases and left open the door for interesting behavior on the spots in between.&lt;/p&gt;
&lt;p&gt;Rearranging the objective into terms:
$$ \left\langle \gamma \underbrace{\left(-\log q(y|z)\right) \vphantom{\log \frac{p(x)}{q(x)}}}_{C} + \delta \underbrace{\left(-\log q(x|z)\right) \vphantom{\log \frac{p(x)}{q(x)}}}_{D} + \sigma \underbrace{\log \frac{p(\theta|D)}{q(\theta)}}_{S} + \underbrace{\log \frac{p(z|x,\theta)}{q(\theta)}}_{R} \right\rangle_p \geq 0, $$
as we discuss in the paper we get an objective that let's us trade off between the ability of our representation to do reconstruction ($D$ term), predict auxillary variables ($C$ term), all the while being honest about the information our learning algorithm extracts from the dataset ($S$ term) and how expensive our learned representation is ($R$ term).  Inserting tradeoff parameters ($\gamma,\delta,\sigma$) would let you explore an entire three dimensional frontier of optimal solutions that explore all tradeoffs between these different criteria.&lt;/p&gt;
&lt;h2&gt;Variational Prediction&lt;/h2&gt;
&lt;p&gt;While most of the previous diagrams were all retellings of essentially the same story, more recently we've begun to wonder what it might look like if we try some more extreme rewirings of these kinds of diagrams.  What if we wanted to try to be so brazen as to invent something that might be an alternative to Bayesian inference, as a different sort of diagram that could provide a global representation learning objective.  One candidate would be the following:&lt;/p&gt;
&lt;figure id="vp" class="right"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/kl-is-all-you-need/vp.png"
    alt="A graphical version of Variational Prediction."&gt;
  &lt;figcaption&gt;
  Figure 11. Variational Prediction.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt; 
&lt;p&gt;Which we explore in some detail in &lt;a href="https://arxiv.org/abs/2307.07568"&gt;our recent work&lt;/a&gt;&lt;a href="#vp"&gt;&lt;sup&gt;13&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="vp"&gt;13&lt;/sup&gt;
&lt;i&gt;Variational Prediction&lt;/i&gt;.
AA Alemi, B Poole AABI2023. arXiv:&lt;a href="https://arxiv.org/abs/2307.07568"&gt;23607.07568&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;I'm not sure this is better, but its certainly different.&lt;/p&gt;
&lt;h2&gt;Closing&lt;/h2&gt;
&lt;p&gt;This post got fairly repetitive, but honestly that was the point.  A whole slew of existing and not yet invented machine learning objectives all seem to follow a very simple &lt;em&gt;universal recipe&lt;/em&gt;.  Simply draw an accurate causal model of the world, then augment it with anything you wish and finally draw a second diagram in the same random variables that corresponds to your marker of success.  Take the KL between the two and you've got yourself a reasonable objective.  I hope this helps you understand some of these and potentially invent new ones of your own.&lt;/p&gt;
&lt;p&gt;&lt;small&gt;Special thanks to Mark Kurzeja, John Stout and Mallory Alemi for helpful feedback on this post.&lt;/small&gt;&lt;/p&gt;
&lt;span id="appdim"&gt;
&lt;h2&gt;Appendix A - Dimensional Consistency&lt;/h2&gt;
&lt;/span&gt;
&lt;p&gt;There is one caveat, I'm &lt;a href="kl.html#appendix-a"&gt;a particular stickler&lt;/a&gt; for decomposing KL divergences in this way. I don't think it makes any dimensional sense.  I can't take the logarithm of a dimensional quantity, let alone a density.  To fix the glitch, let's instead try to explicitly choose some tractable base measure $m(x)$ and insert it into our original objective:&lt;/p&gt;
&lt;p&gt;$$ \left\langle \log \frac{p(x)}{q_\theta(x)} \right\rangle_p = \left\langle \log \frac{p(x) m(x)}{q_\theta(x)m(x)} \right\rangle = \left\langle \log \frac{p(x)}{m(x)} \right\rangle_p + \left\langle \log \frac{m(x)}{q_\theta(x)} \right\rangle_p . $$&lt;/p&gt;
&lt;p&gt;Now, we've decomposed the KL divergence between $P$ and $Q$ into two terms, the first is the KL divergence between $P$ and $M$, our base density. Just as before, this is some constant outside our control. As long as we fix $m(x)$, given that $p(x)$ is fixed, their KL divergence is fixed and no changes we make to $\theta$ have any effect, so we can drop this (now appropriately reparameterization-independent) term from our objective.  We're left with the weight of evidence samples from $p$ provide in favor of $m$ against $q$. If we try to adjust the parameters of $q_\theta(x)$ to make it as easy as possible to distinguish it from some base measure $m(x)$, under samples from $p$, we ensure that we drive $q$ &lt;em&gt;towards&lt;/em&gt; $p$.  If we use ordinary path gradients the choice of $m(x)$ here won't actually affect the optimization trajectory. It will, however, help us sleep at night, ensuring that our objective is a truly reparameterization-invariant quantity. &lt;a href="#controlvariates"&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="controlvariates"&gt;5&lt;/sup&gt;
If I'm being honest, this is something that bothers me that I don't fully understand.  Using a baseline model here functionally takes the same form as control variates like the baselines used in REINFORCE, but here they don't help at all (we aren't taking an expectation with respect to $q$ here). Regardless, it really feels like an appropriate base measure &lt;i&gt;ought&lt;/i&gt; to help.  I can't help but think that it signals a problem with the gradients we take in machine learning today. Things like &lt;a href="https://arxiv.org/abs/2206.07137"&gt;RHO-Loss&lt;/a&gt; reinforce this idea.
&lt;/aside&gt;
&lt;span id="appemp"&gt;
&lt;h2&gt;Appendix B - Finite Samples and the Empirical Distribution&lt;/h2&gt;
&lt;/span&gt;
&lt;p&gt;We motivated that a useful objective for learning a parametric distribution is to minimize the KL divergence between the true distribution and our parametric distribution, i.e. we should adjust the parameters of our distribution to maximize the likelihood of samples from the true distribution.  In practice however, we typically only have access to a &lt;em&gt;finite&lt;/em&gt; number of samples from the true distribution and this introduces a difficulty.  If we wanted to, we could generate an unbiased estimate of the expected likelihood of our model using a finite number of samples from the true distribution:
$$ -\left\langle \log q(x|\theta) \right\rangle_p \approx -\frac{1}{N} \sum_{i=1}^{N} \log q(x_i|\theta). $$
Nothing wrong here.  There is similarly nothing wrong with taking the gradient of this Monte Carlo estimate to generate an unbiased estimate of the gradient of the true likelihood:
$$ -\nabla_\theta \left\langle \log q(x|\theta) \right\rangle_p \approx -\frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \log q(x_i|\theta). $$
The problem only occurs if we start to &lt;em&gt;reuse&lt;/em&gt; the same samples.  These Monte Carlo estimates are only &lt;em&gt;unbiased&lt;/em&gt; estimates of the true expectation if the samples are independent.  If we start to take multiple gradient steps with overlapping samples we start to introduce some bias.  Taken to the extreme, if we simply maximize the &lt;em&gt;empirical&lt;/em&gt; likelihood on a fixed set of finite samples:
$$ \sum_{i=1}^N \log q(x_i|\theta), $$
We are no longer minimizing the KL divergence between the &lt;em&gt;true&lt;/em&gt; distribution $p(x)$ and our parametric distribution $q(x|\theta)$, instead we are minimizing the KL divergence between the &lt;em&gt;empirical&lt;/em&gt; distribution $\hat p$ and our parametric distribution $q(x|\theta)$:
$$ \hat p \equiv \frac 1 N \sum_{i=1}^N \delta(x - x_i). $$
If we had a very large number of samples, this empirical estimate would be pretty close to our true $\hat p \sim p$, but with finite samples it is always a distinct distribution from the true.  If we minimize the empirical risk, or maximize the empirical likelihood what we are really doing is getting our parametric distribution to be as indistinguishable as possible from the empirical distribution. This is equivalent to saying we should match sampling with replacement from our training set.  This is really where all of the issues of over-fitting come from.  The degree to which matching the empirical distribution rather than the true distribution is a problem depends on how little data we have (relative to its sort of extent or coverage) and how flexible our parametric model is (the degree to which it can memorize the data we show it and nothing else).  In the context of classical machine learning this is where &lt;em&gt;regularization&lt;/em&gt; comes to bear, we typically add some additional terms to our objective beyond just the empirical likelihood to attempt to get our learned model to better approximate the true distribution rather than the empirical.&lt;/p&gt;
&lt;p&gt;I want to acknowledge that this is a problem, but in the context of the current discussion I want to point out that this &lt;em&gt;isn't&lt;/em&gt; a problem with our &lt;em&gt;objective&lt;/em&gt;.  It is a good idea to try to minimize the KL divergence between the true distribution and our parametric model.  After we decide on this objective, unfortunately, there are practical issues we have to consider about how to target this objective tractably and accurately.&lt;/p&gt;
&lt;!-- TODO: bigger thread is that once we see that everything is just drawing diagrams, we can ask the meta question of how we *ought* to draw diagrams --&gt;
&lt;!--
## Appendix - Pointwise Bounds
TODO:

 * Reinforcement Learning
 * Learning from human preferences ala. DPO and a density estimation perspective on learning from human feedback.
 * Other Semi-supervised or Contrastive learning.
--&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/kl-is-all-you-need.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Mon, 08 Jan 2024 00:00:00 -0500</pubDate></item><item><title>Leap Day</title><description>Going overboard to prove the local newspaper wrong.</description><guid isPermaLink="true">https://blog.alexalemi.com/ob/nbs/leap-day.html</guid><category domain="https://alexalemi.com/posts/">obtudes</category><pubDate>Fri, 29 Mar 2024 00:00:00 -0400</pubDate></item><item><title>Training LLMs over Neurally Compressed Text</title><link>https://arxiv.org/abs/2404.03626</link><description>Trying to train transformers on top of transformers with arithmetic compression. / B Lester, J Lee, AA Alemi, J Pennington, A Roberts, J Sohl-Dickstein, N Constant / 2404.03626 / TMLR, ICLR2025</description><guid isPermaLink="true">https://alexalemi.com/publications/llm-compression.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/llm-compression.pdf" length="1002504" type="application/pdf"/><pubDate>Mon, 01 Apr 2024 00:00:00 -0400</pubDate></item><item><title>Scaling Exponents Across Parameterizations and Optimizers</title><link>https://arxiv.org/abs/2407.05872</link><description>Understanding parameterizations and how to scale them. / K Everett, L Xiao, M Wortsman, AA Alemi, R Novak, PJ Liu, I Gur, J Sohl-Dickstein, LP Kaelbling, J Lee, J Pennington / 2407.05872 / ICML 2024</description><guid isPermaLink="true">https://alexalemi.com/publications/scaling-parameterizations.pdf</guid><category domain="../publications/">publications</category><enclosure url="https://alexalemi.com/publications/scaling-parameterizations.pdf" length="5652437" type="application/pdf"/><pubDate>Mon, 01 Jul 2024 00:00:00 -0400</pubDate></item><item><title>A Degree of Certainty</title><description>Let's measure probability in degrees.</description><content:encoded>&lt;link href="https://fonts.googleapis.com/css2?family=Merriweather:ital,wght@0,300;0,400;1,300&amp;amp;display=swap" rel="stylesheet"&gt;
&lt;style&gt;
    svg { font-family: Merriweather; }
    pre { white-space: pre-wrap; }
&lt;/style&gt;
&lt;script type="module" src="./assets/Meter.js"&gt;&lt;/script&gt;
&lt;p&gt;With the upcoming election, I found myself thinking about the old &lt;a href="https://www.nytimes.com/2024/03/05/us/elections/super-tuesday-needle.html"&gt;NYTimes Needle&lt;/a&gt; and, more generally, about how to best represent and communicate probabilities.&lt;sup&gt;&lt;a href="#kaytalk"&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;&lt;i&gt;Note: if you want to see how this looks in the context of the 2024 Presidential election, see &lt;a href="https://www.alexalemi.com/random/election/"&gt;here&lt;/a&gt;.&lt;/i&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="kaytalk"&gt;1&lt;/sup&gt; 
 For a fantastic overview, see Matthew Kay's talk: &lt;a href="https://youtu.be/E1kSnWvqCw0?si=8oi5U6eAmjROWXdx"&gt;A biased tour of the uncertainty visualization zoo&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;We already have many ways to discuss degrees of belief: &lt;a href="https://en.wikipedia.org/wiki/Probability"&gt;probabilties&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Percentage"&gt;percents&lt;/a&gt;,&lt;sup&gt;&lt;a href="#percent"&gt;2&lt;/a&gt;&lt;/sup&gt; &lt;a href="https://en.wikipedia.org/wiki/Odds"&gt;odds&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Logit"&gt;log-odds&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Nat_(unit)"&gt;nats&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Bit"&gt;bits&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Hartley_(unit)"&gt;decibans&lt;/a&gt;, etc.
Why don't we add another to the mix.  What if we measure degrees of belief in... degrees.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="percent"&gt;2&lt;/sup&gt; 
 Not to mention &lt;a href="https://en.wikipedia.org/wiki/Per_mille"&gt;per mille (‰)&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Basis_point#Permyriad"&gt;permyriad (‱)&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Per_cent_mille"&gt;per cent mille (pcm)&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Parts-per_notation"&gt;parts per million (ppm)&lt;/a&gt;, parts per billion (ppb), etc...
&lt;/aside&gt;
&lt;p&gt;Specifically, let's use the following transformation:
$$ \theta = \arccos \sqrt p,  \qquad p = \cos^2 \theta .$$&lt;/p&gt;
&lt;figure id="scale" class="right"&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/degree-scale-font.svg"
    alt="A visual representation of the degree scale."&gt;
  &lt;figcaption&gt;
  Figure 1. A visual representation of the mapping.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;This mapping has a beautiful mathematical justification, gives rise to beautiful visualizations,
beautifully aligns with our existing intuitions and has a beautifully simple approximation.
What more could you want.&lt;/p&gt;
&lt;h2&gt;Mathematical Justification&lt;/h2&gt;
&lt;p&gt;What gives? Where does this mapping come from?  Why do we need another way to describe probabilities.&lt;/p&gt;
&lt;p&gt;None of the common ways to measure proabilities are statistically &lt;em&gt;uniform&lt;/em&gt;.  What do I mean by this? Not all 1% changes in probability mean the same thing.  Going from 98% to 99% certainty is a much bigger deal than going from 50% to 51%.  It requires more evidence.  99% is more &lt;em&gt;distinguishable&lt;/em&gt; from 98% than 51% is from 50%.  We intuitively know this, no one says they are 61% certain about something, but people will say they are 99% or 95% certain and expect these to mean different things.&lt;/p&gt;
&lt;p&gt;To measure this mathematically, we need to look at the most distinguished mathematical measure of distinguishability: the &lt;a href="kl.html"&gt;KL divergence&lt;/a&gt;.
For two Bernoulli distributions with probabilities $p$ and $p + \delta$, the KL divergence is:&lt;/p&gt;
&lt;p&gt;$$ D[p; p+\delta] \equiv p \log \frac{p}{p+\delta} + (1-p) \log \frac{1-p}{1-p-\delta} \approx -\frac{\delta^2}{2 p (1-p)} + \cdots. $$&lt;/p&gt;
&lt;p&gt;To leading order, this is quadratic in the change $\delta$ and depends inversely on the probability $p$ and its complement $1-p$.  If we interpret this as a kind of squared distance, the square root of this gives us the usual &lt;a href="https://en.wikipedia.org/wiki/Jeffreys_prior"&gt;Jeffreys prior&lt;/a&gt; for the Bernoulli problem:&lt;/p&gt;
&lt;p&gt;$$ p(p) = \frac{1}{\pi \sqrt{p (1-p)} }. $$&lt;/p&gt;
&lt;figure id="jeffreys"&gt;
  &lt;center&gt;
  &lt;img src="figures/KLsmallchange_standard.png"
    alt="Jeffrey's prior for the Bernoulli problem."&gt;
  &lt;figcaption&gt;
  Figure 2. Unit infinitestimal changes in the probability have different statistical effects.  The effect is fairly extreme at the extremes.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Here we can clearly see that as move towards 0 or 1, the statistical distance blows up.  Going from $0.99$ to $0.991$ is 26 times larger a change in terms of KL than going from $0.50$ to $0.501$.  Clearly, probabilities measured in percentages are very non-uniform.&lt;/p&gt;
&lt;p&gt;If we took as our prior the distribution $1/(\pi\sqrt{p(1-p)})$ we would be weighing the probabilities proportional to this statistical distance. That is, we would be putting equal weight on equally &lt;em&gt;distinguishable&lt;/em&gt; probabilities.  This is what motivated &lt;a href="https://en.wikipedia.org/wiki/Jeffreys_prior"&gt;Jeffreys&lt;/a&gt; to make his prior. He wanted a truly &lt;em&gt;non-informative&lt;/em&gt; prior.  Naively, Laplace suggested a &lt;em&gt;uniform&lt;/em&gt; prior as being non-informative.  But what does &lt;em&gt;uniform&lt;/em&gt; mean?  If you start with a uniform prior on percentages, it's very non-uniform when transformed into log-odds.  Uniform in log-odds is very non-uniform in terms of percentages.  If you start with a uniform prior in percents, you'll get a different posterior than if you start with a uniform prior in log-odds.  Clearly, your choice of parameterization is influencing your outcome.&lt;/p&gt;
&lt;p&gt;If what we care about is the amount of information you need to modify your beliefs, we should weigh our beliefs in proportion to the amount of evidence they would need to move. This is what led Jeffreys to his prior, in the form we see above.  He showed that this is proportional to the square root of the determinant of the &lt;a href="https://en.wikipedia.org/wiki/Fisher_information_metric"&gt;Fisher metric&lt;/a&gt;.  Regardless of your choice of parameterization, if you compute the determinant of the Fisher metric in that parameterization and take its square root, you'll recover Jeffreys prior.  It is parameterization independent in this sense.&lt;/p&gt;
&lt;p&gt;While Jeffreys found a principled motivation for how to define &lt;em&gt;uniformity&lt;/em&gt; in a reparameterization independent way, what we don't have yet is a sense of what a principled &lt;em&gt;parameterization&lt;/em&gt; is.  Not all parameterizations are created equal.  Percentages diverge at the extremes. We should be able to do better.&lt;/p&gt;
&lt;p&gt;Let's try a second common parameterization. What if we tried to work in terms of log-odds?&lt;/p&gt;
&lt;p&gt;$$ \chi = \log \frac{p}{1-p}, $$&lt;/p&gt;
&lt;p&gt;We get KL divergences that take the form:&lt;/p&gt;
&lt;p&gt;$$ D[\chi; \chi+\delta] \approx \frac{\delta^2}{4 + 4 \cosh \chi} + \cdots, $$&lt;/p&gt;
&lt;p&gt;which has the opposite problem as seen below.  There is no divergence at the ends, there is a disappearance.&lt;/p&gt;
&lt;figure id="logits"&gt;
  &lt;center&gt;
  &lt;img src="figures/KLsmallchange_logits.png"
    alt="Jeffrey's prior for the Bernoulli problem, in logit space."&gt;
  &lt;figcaption&gt;
  Figure 3. Unit infinitestimal changes in logits have different statistical effects.  They vanish at the extremes.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;aside&gt;&lt;sup id="jeffrey-logit"&gt;3&lt;/sup&gt; 
 Coincidentally, though its not often discussed, this is the form that Jeffrey's prior takes when expressed in terms of log-odds. $1/\sqrt{4 + 4 \cosh \chi}$.
&lt;/aside&gt;
&lt;p&gt;Now, moving from $0.00$ to $0.01$ in log-odds is 42 times farther a statistical distance than going from $5.00$ to $5.01$ in log-odds.&lt;sup&gt;&lt;a href="#jeffrey-logit"&gt;3&lt;/a&gt;&lt;/sup&gt; At the extremes, log-odds become &lt;em&gt;indistinguishable&lt;/em&gt;.  A log-odds of 7 is closer to 5 than 0.01 is to 0.00.&lt;/p&gt;
&lt;p&gt;Very small changes in percentage near 1.0 require massive amounts of evidence to justify.  Massive changes in log-odds away from 0 require very little evidence to justify. Neither of these is ideal.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="alternative"&gt;4&lt;/sup&gt; 
  In writing this post it occurred to me that this might actually be a better way to derive Jeffrey's prior in the first place.  One could say that Jeffrey's prior is a &lt;i&gt;uniform&lt;/i&gt; prior (in the Laplace sense) in the parameterization for which the KL divergence is also &lt;i&gt;uniform&lt;/i&gt;.  Transforming this uniform prior in the uniform parameterization to any other is what gives you the square root of the determinant of the metric form we are used to seeing.
&lt;/aside&gt;
&lt;p&gt;The question then becomes: &lt;em&gt;What is the best parameterization?&lt;/em&gt;  How close to
uniform can we get? Is there a parameterization of degrees of belief for which
the statistical metric is flat? Equivalently, the question becomes, is there a
parameterization for which Jeffrey's prior is uniform.&lt;a href="#alternative"&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Let's look for a transformation, $\theta(p)$, such that, Jeffrey's prior, $p(p) = 1/(\pi\sqrt{p(1-p)})$, transforms into the uniform prior: $p(\theta) = 1$.&lt;/p&gt;
&lt;p&gt;Densities transform like:&lt;/p&gt;
&lt;p&gt;$$ p(p)\, \mathrm{d}p = p(\theta)\, \mathrm{d}\theta. $$&lt;/p&gt;
&lt;p&gt;Substituting what we know, we want to solve:&lt;/p&gt;
&lt;p&gt;$$ \frac{\mathrm{d}p}{\pi \sqrt{p(1-p)}} = \mathrm{d}\theta . $$&lt;/p&gt;
&lt;p&gt;The solution takes the form (up to proportionality):&lt;/p&gt;
&lt;p&gt;$$ \theta = \arccos \sqrt p, \qquad p = \cos^2 \theta . $$&lt;/p&gt;
&lt;p&gt;This is the mapping we opened the post with.  In this parameterization, we have that the KL divergence is &lt;em&gt;flat&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;$$ D[\theta; \theta + \delta] \approx 2\delta^2  + \cdots . $$&lt;/p&gt;
&lt;p&gt;It is in this parameterization that a small change in the parameter means the same thing at every value of the parameter.  This parameterization is &lt;em&gt;uniform&lt;/em&gt; in a deep sense.  Jeffrey's prior, expressed in this $\theta$ parameter is uniform.&lt;/p&gt;
&lt;figure id="thetas"&gt;
  &lt;center&gt;
  &lt;img src="figures/KLsmallchange_theta.png"
    alt="Jeffrey's prior for the Bernoulli problem, in theta space."&gt;
  &lt;figcaption&gt;
  Figure 4. Unit infinitestimal changes in angles have uniform statistical effects.  
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;aside&gt;&lt;sup id="leading"&gt;5&lt;/sup&gt; 
 Technically, the KL divergence is only uniform to &lt;i&gt;leading order&lt;/i&gt;.  There are higher order corrections that show up and which are most extreme at the edges of the space. 
&lt;/aside&gt;
&lt;p&gt;This is, in some sense, the most natural parameterization of probabilities.  In terms of ordinary probabilities, the space is curved, the metric isn't flat, the world is distorted as we move around the space.  In terms of these &lt;em&gt;degrees&lt;/em&gt; ($\theta$), the metric is flat.  A 1° change means the same thing, statistically, regardless of where we start.&lt;a href="#leading"&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Visualization&lt;/h2&gt;
&lt;p&gt;We will set the range of probabilities to be from 0° to 90°. This will allow us to visualize the whole space as a quarter circle, which conveniently resembles a meter when turned on its side.&lt;/p&gt;
&lt;p&gt;&lt;probability-meter id="interactive" probability="0.53"&gt;&lt;/probability-meter&gt;&lt;/p&gt;
&lt;div class="controls"&gt;
        &lt;input type="range" id="probabilitySlider" min="0" max="1" step="0.0001" value="0.53"&gt;
        &lt;input type="number" id="probabilityInput" min="0" max="1" step="0.0001" value="0.53"&gt;
&lt;/div&gt;
&lt;p&gt;A probability of &lt;span id="probVal"&gt;53&lt;/span&gt;% corresponds to an angle of &lt;span id="angVal"&gt;43.28&lt;/span&gt;°.&lt;/p&gt;
&lt;p&gt;This meter is interactive, you can adjust the probability with the slider or input box.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="quantum"&gt;6&lt;/sup&gt; 
 Bengtsson, Ingemar, and Karol Życzkowski. &lt;a href="https://www.google.com/books/edition/_/sYswDwAAQBAJ?hl=en&amp;gbpv=0"&gt;Geometry of quantum states: an introduction to quantum entanglement&lt;/a&gt;. Cambridge University Press, 2017.
&lt;/aside&gt;
&lt;aside&gt;&lt;sup id="quinn"&gt;7&lt;/sup&gt;
Quinn, Katherine N, et al. “Visualizing Probabilistic Models and Data with Intensive Principal Component Analysis.” Proceedings of the National Academy of Sciences, vol. 116, no. 28, 24 June 2019, pp. 13762–13767, &lt;a href="https://arxiv.org/abs/1810.02877"&gt;arxiv.org/abs/1810.02877&lt;/a&gt;, &lt;a href="https://doi.org/10.1073/pnas.1817218116"&gt;10.1073/pnas.1817218116&lt;/a&gt;. Accessed 23 Oct. 2024.
&lt;/aside&gt;
It turns out that relative angle between two probabilities is related to the &lt;a href="https://en.wikipedia.org/wiki/Bhattacharyya_distance"&gt;Bhattacharyya distance&lt;/a&gt;.
If we take the straight line chordal distance between two probabilities on this arc, it is equivalent to the &lt;a href="https://en.wikipedia.org/wiki/Hellinger_distance"&gt;Hellinger distance&lt;/a&gt;.&lt;sup&gt;&lt;a href="#quantum"&gt;6&lt;/a&gt;,&lt;a href="#quinn"&gt;7&lt;/a&gt;&lt;/sup&gt;
&lt;h2&gt;Intuitions&lt;/h2&gt;
&lt;p&gt;Having identified this mathematically elegant parameterization of degrees of belief, the question remains: is it practical for everyday use?&lt;/p&gt;
&lt;p&gt;Well, the more I think about it, the more I think this might actually be a decent idea.  People already are familiar with
angles and degrees.  We have a sense of how large 1° is, or 5° or 30°.  We can visualize where these would fall on the meter.&lt;/p&gt;
&lt;p&gt;Another benefit of angles is that we already have a strong sense that they are relative.&lt;/p&gt;
&lt;p&gt;When probabilities are close to certain, it would be most natural to measure them relative to the right:&lt;/p&gt;
&lt;p&gt;&lt;probability-meter id="off-one" probability="1.0" 
labels='{"angles": [15, 30, 45, 60, 75], "labels": ["15°", "30°", "45°", "60°", "75°"]}'&gt;
&lt;/probability-meter&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="texas"&gt;8&lt;/sup&gt; 
 As predicted by the &lt;a href="https://www.economist.com/interactive/us-2024-election/prediction-model/president/texas"&gt;Economist&lt;/a&gt; model, at the time of writing this post.
&lt;/aside&gt;
For instance, if I say its 10° from certain that Trump will win Texas,&lt;a href="#texas"&gt;&lt;sup&gt;8&lt;/sup&gt;&lt;/a&gt; its clear what I mean.
&lt;probability-meter probability="0.97" 
    labels='{"angles": [10, 80], "labels": ["Trump", "Harris"]}'&gt;
&lt;/probability-meter&gt;
&lt;p&gt;However, we can just as easily measure angles relative to the middle for things that are a toss up:
&lt;probability-meter id="off-middle" probability="0.5"
labels='{"angles": [15, 30, 45, 60, 75], "labels": ["+30°", "+15°", "0°", "-15°", "-30°"]}'&gt;
&lt;/probability-meter&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="economist"&gt;9&lt;/sup&gt; 
 I just refreshed &lt;a href="https://www.economist.com/interactive/us-2024-election/prediction-model/president"&gt;the economist&lt;/a&gt; model and it has 56-44 in favor of Trump, at the time of writing the post.
&lt;/aside&gt;
&lt;p&gt;For instance, we might say that overall, the election is leaning 3.45° in favor of Trump. &lt;a href="#economist"&gt;&lt;sup&gt;9&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;probability-meter probability="0.56" 
labels='{"angles": [10, 80], "labels": ["Trump", "Harris"]}'&gt;
&lt;/probability-meter&gt;&lt;/p&gt;
&lt;p&gt;This is clear and easy to visualize and reason about.
3.45° tilted to the right off of vertical is the same as 41.55° out of 90°, but we have a much better intuitive sense of the former.
Meanwhile, in terms of percentages, we would say Trump has a 56% chance of winning, we have a much harder time expressing this as a 6% advantage off-even (we might say he has a 12% edge over Harris).  This is the whole reason the NYTimes used their needle visualization in the first place.  The NYTimes needle provides a useful visual aid, but would be &lt;em&gt;misleading&lt;/em&gt; as the probabilities approach 0 or 1, since their linear mapping would distort the changes at the edges.  Our nonlinear map maintains a statistical &lt;em&gt;uniformity&lt;/em&gt; throughout the whole range.&lt;/p&gt;
&lt;p&gt;We could just as easily measure angles with respect to impossibility in the case of rare events:
&lt;probability-meter id="off-zero" probability="0.0"
labels='{"angles": [15, 30, 45, 60, 75], "labels": ["75°", "60°", "45°", "30°", "15°"]}'&gt;
&lt;/probability-meter&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="virginia"&gt;10&lt;/sup&gt; 
 As predicted by the &lt;a href="https://www.economist.com/interactive/us-2024-election/prediction-model/president/virginia"&gt;Economist&lt;/a&gt; model, at the time of writing this post.
&lt;/aside&gt;
&lt;p&gt;For instance, we might say there is a 18° chance of Trump winning Virginia:
&lt;probability-meter probability="0.09"
labels='{"angles": [10, 80], "labels": ["Trump", "Harris"]}'&gt;
&lt;/probability-meter&gt;&lt;/p&gt;
&lt;p&gt;This versatility comes at no additional mental cost. We already naturally re-orient our discussion of angles in this way.  Probabilites and their statistical metric are symmetric about even.  Probabilities very near 1 are similar to those very close to 0, but when we talk about percentages, this symmetry is obscured. Log-odds are better in this regard, but much less commonly used.&lt;/p&gt;
&lt;h3&gt;Kent's Words of Estimative Probability&lt;/h3&gt;
&lt;p&gt;In the meters on this page, as a visual aid, I've colored six bands of 15° increments.  It turns out that these perfectly line up with &lt;a href="https://en.wikipedia.org/wiki/Words_of_estimative_probability"&gt;Kent's words of Estimative Probability&lt;/a&gt;.&lt;/p&gt;
&lt;figure id="kent"&gt;
  &lt;center&gt;
  &lt;img src="figures/kent-needle.svg"
    alt="Kent's words of estimative probability line up perfectly on the degree scale."&gt;
  &lt;figcaption&gt;
  Figure 5. Kent's words of estimative probability line up perfectly on the degree scale.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;In an effort to better communicate uncertainty to a lay audience,
many people have tried to come up with intuitive names or mappings for different percentages.
These always end up corresponding to awkward, unevenly spaced probabilities.  For example, Kent, said that 93% corresponds to what people consider "almost certain".  93% seems like a
strange value.  I always wondered where 93% came form, or why people's intuitions about probabilities were so unevenly spaced.  However, if you take Kent's thresholds and map them to &lt;em&gt;degrees&lt;/em&gt;, they are perfectly evenly spaced at 15° increments.  This suggests that people correct for the statistical unevenness of percentages through experience.  The words we use to describe certainty are uniform, even if our most popular &lt;em&gt;unit&lt;/em&gt; for measuring certainty is not. This suggests that human perceptions of probabilities might already be better aligned with degrees.&lt;/p&gt;
&lt;p&gt;More thoughts on human perception below in &lt;a href="#app-perception"&gt;Appendix D&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;Approximate Calculation&lt;/h2&gt;
&lt;p&gt;While this mapping seems interesting, no one can compute $\arccos \sqrt p$ in their head.  Fortunately, as we show below in &lt;a href="#app-taylor"&gt;Appendix A&lt;/a&gt;, near the middle the map is linear and near the edges it looks like a square root, so if we want an accurate, easy to calculate, pencil and paper version of the mapping, we can split our probabilities into three regions, below 0.25, between 0.25 and 0.75, and above 0.75.&lt;/p&gt;
&lt;p&gt;Since we have that $180/\pi \approx 60$ if we want to estimate the degrees off of even, for a given probability near 50%, in our head we can use:
$$ \Delta\theta(p) \sim 60 \Delta p, $$
while for $p$ values near the extremes, we can calculate the relative angle you are from either completely certain or impossible as:
$$ \Delta\theta(p) \sim 60 \sqrt{\Delta p}. $$&lt;/p&gt;
&lt;p&gt;If you need a good way to mentally calculate a square root of $p$: take a guess $g$ for the square root, and then compute the average of $g$ and $p/g$.  You can iterate this many times to get as accurate as you desire.&lt;sup&gt;&lt;a href="#cook"&gt;11&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="cook"&gt;11&lt;/sup&gt; 
    This is the &lt;a href="https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Heron's_method"&gt;Babylonian method&lt;/a&gt;, aka an application of &lt;a href="https://en.wikipedia.org/wiki/Newton%27s_method#Examples"&gt;Newton's root finding method&lt;/a&gt;.
    For this and a whole slew of useful mental arithmetic tips, see &lt;a href="https://www.johndcook.com/blog/mental-functions/"&gt;John D. Cook's Blog, The Endeavor&lt;/a&gt;. 
&lt;/aside&gt;
&lt;p&gt;This simple to compute approximate mapping turns out to be very accurate.  It is good to half a degree over the whole range as shown below in Figure 6.&lt;/p&gt;
&lt;figure id="approx-error"&gt;
  &lt;center&gt;
  &lt;img src="figures/degree_approx.png"
    alt="Errors in the simple approximate method."&gt;
  &lt;figcaption&gt;
  Figure 6. Errors in the Approximate mapping.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;For example, before we said the economist model had Trump's probability of winning at 56%, to estimate this in degrees we take $60 \times 0.06$ to get 3.6°, compared with the more exact 3.45°. If we think there is a 10% chance of rain, we say that that is $60 \times \sqrt{0.10} = 60 \times \sqrt{10} / 10 \approx 19^\circ$, compared with the more exact 18.43°.  This method is very practical and very accurate.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;I don't know about you, but I'm convinced.  We should measure degrees of belief in degrees.&lt;/p&gt;
&lt;p&gt;This creates a very intuitive visual representation for probabilities, and one that is statistically uniform in an interesting and useful way.  It isn't all that hard to compute, especially if we are alright with a half degree accuracy as in the previous section.  With a little bit of time, I think we could come to intuit what a 1° or 5° or 10° or 30° shift in probabilities &lt;em&gt;felt&lt;/em&gt; like. Some might even say, we already do.
And, unlike with either probabilities or odds, that useful internal sense would work well for us regardless of the baseline rate. A 5° shift away from center means the same sort of thing as a 5° shift away from certainty.&lt;/p&gt;
&lt;p&gt;Give a shot.  In &lt;a href="#app-widget"&gt;Appendix C&lt;/a&gt; I've made available the code for the widgets that appear on this page, which should make it easy for anyone to try.&lt;/p&gt;
&lt;h1&gt;Appendix A - Taylor Expansions&lt;/h1&gt;
&lt;p id="app-taylor"&gt;
If we Taylor expand this map near $p=1/2$, the map is approximately linear:
$$ \theta(p) \approx \frac{\pi}{4} - \left( p - \frac 12 \right) - \frac 23 \left( p - \frac 12 \right)^3 + \cdots . $$
&lt;/p&gt;
&lt;p&gt;Near $p=0$ its square root like:
$$ \theta(p) \approx \frac{\pi}{2} - \sqrt p - \frac{p^{\frac 3 2}}{6} - \cdots . $$
And similarly near $p=1$:
$$ \theta(p) \approx \sqrt{1-p} + \frac{(1-p)^{\frac 3 2}}{6} + \cdots. $$&lt;/p&gt;
&lt;h1&gt;Appendix B - Categorical Generalization&lt;/h1&gt;
&lt;p&gt;This idea easily extends to Categorical distributions, where the flat statistical manifold corresponds to the positive octant of the n-sphere as discussed in Bengtsson et al.
&lt;a href="#quantum2"&gt;&lt;sup&gt;62&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="quantum2"&gt;62&lt;/sup&gt; 
 Bengtsson, Ingemar, and Karol Życzkowski. &lt;a href="https://www.google.com/books/edition/_/sYswDwAAQBAJ?hl=en&amp;gbpv=0"&gt;Geometry of quantum states: an introduction to quantum entanglement&lt;/a&gt;. Cambridge university press, 2017.
&lt;/aside&gt;
&lt;p&gt;&lt;span id="app-widget"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h2&gt;Appendix C - Widget&lt;/h2&gt;
&lt;p&gt;To kickstart its adoption, I've created a &lt;code&gt;WebComponents&lt;/code&gt; element, so that you can simply add the &lt;a href="%22https://blog.alexalemi.com/assets/Meter.js%22"&gt;script&lt;/a&gt; as a module to your page:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-html"&gt;&amp;lt;script type="module" src="https://blog.alexalemi.com/assets/Meter.js"&amp;gt;&amp;lt;/script&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;in your &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; section and later insert:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-html"&gt;&amp;lt;probability-meter probability="0.53"&amp;gt;&amp;lt;/probability-meter&amp;gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;elements to your page and it will render as:&lt;/p&gt;
&lt;p&gt;&lt;probability-meter id="appendix" probability="0.53"&gt;&lt;/probability-meter&gt;&lt;/p&gt;
&lt;p&gt;&lt;span id="app-perception"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;h1&gt;Appendix D - Human Perception&lt;/h1&gt;
&lt;aside&gt;&lt;sup id="good"&gt;13&lt;/sup&gt; 
Good, I. J. "Weight of evidence: A brief survey." Bayesian statistics 2 (1985): 249-270. &lt;a href="https://www.cs.tufts.edu/comp/150FP/archive/jack-good/weight-of-evidence.pdf"&gt;[pdf]&lt;/a&gt;
&lt;/aside&gt;
&lt;aside&gt;&lt;sup id="jaynes"&gt;14&lt;/sup&gt; 
Jaynes, Edwin T. Probability theory: The logic of science. Cambridge university press, 2003. &lt;a href="http://www-biba.inrialpes.fr/Jaynes/prob.html"&gt;[link]&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;It is generally claimed that human perception aligns well with log-odds.  Good&lt;a href="#good"&gt;&lt;sup&gt;13&lt;/sup&gt;&lt;/a&gt; and Jaynes&lt;a href="#jaynes"&gt;&lt;sup&gt;14&lt;/sup&gt;&lt;/a&gt; both advocated the use of &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Hartley_(unit)"&gt;decibans&lt;/a&gt;&lt;/em&gt;. These work great for accumulating evidence and doing bayesian updates.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="ubiquitous"&gt;15&lt;/sup&gt; 
Zhang, Hang, and Laurence T. Maloney. "Ubiquitous log odds: a common representation of probability and frequency distortion in perception, action, and cognition." Frontiers in neuroscience 6 (2012): 1. &lt;a href="https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2012.00001/full"&gt;[link]&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;In the field of human perception, I've often seen references to Zhang et al.&lt;a href="#ubiquitous"&gt;&lt;sup&gt;15&lt;/sup&gt;&lt;/a&gt; to justify the claim that human perception is well aligned with log-odds.  In the paper they collected a bunch of human perceptual studies and show that you can use a mapping that is linear in log odds to explain the data.  For example, here is Figure 1 from the paper:&lt;/p&gt;
&lt;figure&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/ubiquitouslogodds.jpg"
    alt="Figure 1 from the Ubiquitous log odds paper."&gt;
  &lt;figcaption&gt;
  Figure 7. Figure 1 from Zhang et al. showing the linear in log-odds fit to the perceptual data.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Here, the blue lines show fits of a two parameter function:&lt;/p&gt;
&lt;p&gt;$$ \textsf{Lo}(\pi) = \gamma \textsf{Lo}(p) + (1-\gamma) \textsf{Lo}(p_0), $$&lt;/p&gt;
&lt;p&gt;which describes a linear map acting on the log odds of the true probability and some baseline probability to describe the log-odds of the perceptual probability.  The paper considers it a success that they can use the simple two parameter function to get a mapping that shows good agreement with the experimental data.&lt;/p&gt;
&lt;p&gt;You know what these curves look like? They look like our arcsine transformation.  Without any parameters, here is a plot of:&lt;/p&gt;
&lt;p&gt;$$ \arcsin \sqrt p. $$&lt;/p&gt;
&lt;p&gt;This is the same as our proposed mapping (just with the opposite sign).&lt;/p&gt;
&lt;figure&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/arcsinetransform.png"
    alt="Arcsine transformation over the same ranges."&gt;
  &lt;figcaption&gt;
  Figure 8. Arcsine transformation over the same sort of ranges as in the Figure above.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Look's pretty good to me.&lt;/p&gt;
&lt;h1&gt;Appendix E - ArcSin Transformation&lt;/h1&gt;
&lt;p&gt;It seems as though there is a history of using the "arcsin" transformation to transform probabilites for statistical models.  It seems like this was more popular before the logistic model took off.&lt;/p&gt;
&lt;p&gt;I found several references in this direction:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Double arcsin transform not appropriate for meta-analysis. Röver and Friede. &lt;a href="https://arxiv.org/abs/2203.04773"&gt;arXiv:2203.04773&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The arcsine is asinine: the analysis of proportions in ecology. Warton and Hui. &lt;em&gt;Ecology&lt;/em&gt; 92(1), 2011, pp. 3-10. &lt;a href="https://esajournals.onlinelibrary.wiley.com/doi/pdf/10.1890/10-0340.1"&gt;[link]&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;The Square Root Transformation in Analysis of Variance. Bartlett. &lt;em&gt;Supplement to the Journal of the Royal Statistical Society&lt;/em&gt;. Vol 3. No 1. 1936. &lt;a href="https://www.jstor.org/stable/2983678"&gt;[link]&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Transformations Related to the Angular and the Square Root. Freeman and Tukey. &lt;em&gt;Ann. Math Statist.&lt;/em&gt; 21(4): 607-611 (1950). &lt;a href="https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-21/issue-4/Transformations-Related-to-the-Angular-and-the-Square-Root/10.1214/aoms/1177729756.full"&gt;[link]&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Many of the references are critical of the "arcsine" transformation, and I would tend to agree.  For something like a logistic regression model, if you map the probabilities to a fixed interval, you're going to have difficulty interpreting the coefficients of your effects.  My understanding is that people were using this arcsine transformation and then fitting models of the form:&lt;/p&gt;
&lt;p&gt;$$ \theta \sim X \beta,  $$&lt;/p&gt;
&lt;p&gt;for some observations $X$, learning some coefficients $\beta$, but since $\theta$ is bounded, these models naturally make unphysical predictions if you extrapolate them.  The logistic model doesn't have the same problem, since log-odds are unbounded.&lt;/p&gt;
&lt;p&gt;While I agree that measuring degrees of belief in degrees doesn't work great for linear models, I still think it would work well for talking about and communicating probabilites.&lt;/p&gt;
&lt;h1&gt;Appendix F - Connection to Quantum Mechanics&lt;/h1&gt;
&lt;p&gt;The final connection I want to point out is easier to see if we recast the Bernoulli likelihood in terms of our new angles:&lt;/p&gt;
&lt;p&gt;$$ \Pr(X) = \begin{cases} \cos^2 \theta &amp;amp; X = 1 \\ \sin^2 \theta &amp;amp; X = 0 \end{cases} . $$&lt;/p&gt;
&lt;p&gt;The probability that we observe our random variable in state 1 is the square of some angle $\theta$.  This reminds me of &lt;a href="https://en.wikipedia.org/wiki/Qubit"&gt;qubits&lt;/a&gt;, and
the geometrical story of quantum mechanics and its relation to probability as told in Scott Aaronson's &lt;a href="https://www.scottaaronson.com/democritus/lec9.html"&gt;blog post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;One could write this in &lt;a href="https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation"&gt;Dirac notation&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;$$ \left| \psi \right\rangle = \cos \theta \left| 1 \right\rangle + \sin \theta \left| 0 \right\rangle $$&lt;/p&gt;
&lt;p&gt;and use &lt;a href="https://en.wikipedia.org/wiki/Born_rule"&gt;Born's rule&lt;/a&gt; to derive the probabilites, i.e. you must take the square modulus of the amplitude to get the probability.&lt;/p&gt;
&lt;p&gt;I wonder whether there is more to this analogy...&lt;/p&gt;
&lt;script defer&gt;
    const slider = document.getElementById('probabilitySlider');
    const input = document.getElementById('probabilityInput');
    const meter = document.getElementById('interactive');
    const angVal = document.getElementById('angVal');
    const probVal = document.getElementById('probVal');

    function probToAngle(x) {
        return Math.acos(Math.sqrt(parseFloat(x)))
    }

    function updateProbability(value) {
            value = parseFloat(value);
            slider.value = value;
            input.value = value;
            meter.setAttribute('probability', value);
            probVal.innerHTML = (value * 100).toFixed(2);
            angVal.innerHTML = (probToAngle(value) * 180 / Math.PI).toFixed(2);
    }
    slider.addEventListener('input', (e) =&gt; updateProbability(e.target.value));
    input.addEventListener('input', (e) =&gt; updateProbability(e.target.value));

&lt;/script&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/a-degree-of-certainty.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Wed, 14 Aug 2024 00:00:00 -0400</pubDate></item><item><title>A Quarter for your Thoughts</title><description>Get the precision of one sig fig essentially for free.</description><content:encoded>&lt;!-- 
Notes from Mallory, need a better opener.

Restructure, reorganize to be:

  - order of magnitude
  - one digit sigfig
  - two digit sigfig
  - actual sigfigs
  
Then transistion to the fractional log things, can talk about familiar, look in log space, make things evenly spaced in log.

  - one few ten
  - quarters
  - decibels
  - semidecibels
  - centibels


Need a figure for one sig fig and for semidecibels, and maybe two sigfigs.
--&gt;
&lt;p&gt;Let's tour some fun ways to do mental math to various precisions.&lt;/p&gt;
&lt;p&gt;I like doing order of magnitude problems, so own a wide array of books on the subject, including books full of fun estimation problems.  One of the books I own is &lt;a href="https://www.amazon.com/Maths-Back-Envelope-calculate-anything/dp/0008324581"&gt;Maths On The Back Of An Envelope&lt;/a&gt; by Rob Eastaway.  In it, he suggests a form of estimation he brands &lt;a href="https://www.theguardian.com/science/alexs-adventures-in-numberland/2013/apr/04/zequals-symbol-sums-mathematics"&gt;zequals&lt;/a&gt;, i.e. you round numbers to only a single significant digit and do your calculation that way.&lt;a href="#frontend"&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;a href="#numberphile"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;  This form of "ruthless" approximation is meant to make it easy to do mental arithmetic.&lt;/p&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="frontend"&gt;1&lt;/sup&gt; 
 Trying to find other sources, I'm finding educational resources that refer to this as "front-end estimation", e.g. &lt;a href="https://study.com/academy/lesson/how-to-use-front-end-estimation.html#:~:text=all%20other%20digits.-,Estimation%20is%20finding%20an%20approximate%20value%20for%20a%20calculation%2C%20called,Calculate%20using%20the%20rounded%20values."&gt;here&lt;/a&gt;.
&lt;/aside&gt;
&lt;aside&gt;&lt;sup id="numberphile"&gt;2&lt;/sup&gt;
It also featured in an old &lt;a href="https://youtu.be/aOJOfh2_4PE?si=npC8b4px-B2XwHM1"&gt;Numberphile&lt;/a&gt; video.
&lt;/aside&gt;
&lt;p&gt;The branding is cute. I think trying to round things to a single significant digit is something people tend to do pretty naturally when they are estimating.  However, thinking about it again, I think there is a neat way to get the same or better precision with even less of the fuss.  We should use quarter orders of magnitude, or work directly in decibels.  Let's build up to that.&lt;/p&gt;
&lt;h2&gt;Order of Magnitude&lt;/h2&gt;
&lt;p&gt;When you get into &lt;a href="https://en.wikipedia.org/wiki/Fermi_problem"&gt;Fermi problems&lt;/a&gt; you often start by simply tracking the orders of magnitude, i.e. you round every number to its nearest power of 10 and only keep track of those.  This makes for very speedy estimates, but the resolution is obviously only good to about an order of magnitude.  This works great for trying to answer questions like my son asked the other day: "If we suddenly got an extra electron on each of our atoms, what would happen?"&lt;a href="#electrons"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="electrons"&gt;3&lt;/sup&gt; 
We can estimate the sudden energy increase: $\frac{ke\frac{200\text{ lbs}}{16\text{ g/mol}}\times N_A}{3\text{ ft}} \sim 10^{27}\text{ J}$. Or 10 times the energy the sun releases in a second, similar to the energy that would be released if the moon hit the earth. For lots of other fun problems like this, see &lt;a href="https://what-if.xkcd.com/"&gt;what-if?&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;We don't really care about accuracy that is better than a factor of 10, we are more interested if it would be like a punch to the gut, a bomb going off, or Armageddon.  Order of magnitude math is good for this kind of thing, and easy to do mentally or on paper.  You only need to track the powers of 10.  Multiplication and division become as easy as addition and subtraction:&lt;/p&gt;
&lt;p&gt;$$ 10^a 10^b / 10^c = 10^{a + b - c}. $$&lt;/p&gt;
&lt;h2&gt;One Significant Digit&lt;/h2&gt;
&lt;p&gt;The natural next step up in precision brings us back to Eastman's &lt;em&gt;zequals&lt;/em&gt;, or one-significant-digit arithmetic.  Here we'll reduce every number to just its leading significant digit.&lt;/p&gt;
&lt;p&gt;For example, the speed of light: $299\,792\,458 \text{ m/s}$ becomes simply $3 \times 10^{8} \text{ m/s}$.&lt;/p&gt;
&lt;p&gt;The full multiplication table for this system is the one we all learned in grade school:&lt;/p&gt;
&lt;center&gt;
    &lt;table&gt;
        &lt;thead&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$3$&lt;/td&gt;&lt;td&gt;$4$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$6$&lt;/td&gt;&lt;td&gt;$7$&lt;/td&gt;&lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$9$&lt;/td&gt;
        &lt;/tr&gt;&lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr&gt;&lt;td&gt;$1$&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$3$&lt;/td&gt;&lt;td&gt;$4$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$6$&lt;/td&gt;&lt;td&gt;$7$&lt;/td&gt;&lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$9$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$2$&lt;/td&gt;
            &lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$4$&lt;/td&gt;&lt;td&gt;$6$&lt;/td&gt;&lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;&lt;td&gt;$12$&lt;/td&gt;&lt;td&gt;$14$&lt;/td&gt;&lt;td&gt;$16$&lt;/td&gt;&lt;td&gt;$18$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$3$&lt;/td&gt;
            &lt;td&gt;$3$&lt;/td&gt;&lt;td&gt;$6$&lt;/td&gt;&lt;td&gt;$9$&lt;/td&gt;&lt;td&gt;$12$&lt;/td&gt;&lt;td&gt;$15$&lt;/td&gt;&lt;td&gt;$18$&lt;/td&gt;&lt;td&gt;$21$&lt;/td&gt;&lt;td&gt;$24$&lt;/td&gt;&lt;td&gt;$27$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$4$&lt;/td&gt;
            &lt;td&gt;$4$&lt;/td&gt;&lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$12$&lt;/td&gt;&lt;td&gt;$16$&lt;/td&gt;&lt;td&gt;$20$&lt;/td&gt;&lt;td&gt;$24$&lt;/td&gt;&lt;td&gt;$28$&lt;/td&gt;&lt;td&gt;$32$&lt;/td&gt;&lt;td&gt;$36$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$5$&lt;/td&gt;
            &lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;&lt;td&gt;$15$&lt;/td&gt;&lt;td&gt;$20$&lt;/td&gt;&lt;td&gt;$25$&lt;/td&gt;&lt;td&gt;$30$&lt;/td&gt;&lt;td&gt;$35$&lt;/td&gt;&lt;td&gt;$40$&lt;/td&gt;&lt;td&gt;$45$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$6$&lt;/td&gt;
            &lt;td&gt;$6$&lt;/td&gt;&lt;td&gt;$12$&lt;/td&gt;&lt;td&gt;$18$&lt;/td&gt;&lt;td&gt;$24$&lt;/td&gt;&lt;td&gt;$30$&lt;/td&gt;&lt;td&gt;$36$&lt;/td&gt;&lt;td&gt;$42$&lt;/td&gt;&lt;td&gt;$48$&lt;/td&gt;&lt;td&gt;$54$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$7$&lt;/td&gt;
            &lt;td&gt;$7$&lt;/td&gt;&lt;td&gt;$14$&lt;/td&gt;&lt;td&gt;$21$&lt;/td&gt;&lt;td&gt;$28$&lt;/td&gt;&lt;td&gt;$35$&lt;/td&gt;&lt;td&gt;$42$&lt;/td&gt;&lt;td&gt;$49$&lt;/td&gt;&lt;td&gt;$56$&lt;/td&gt;&lt;td&gt;$63$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$8$&lt;/td&gt;
            &lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$16$&lt;/td&gt;&lt;td&gt;$24$&lt;/td&gt;&lt;td&gt;$32$&lt;/td&gt;&lt;td&gt;$40$&lt;/td&gt;&lt;td&gt;$48$&lt;/td&gt;&lt;td&gt;$56$&lt;/td&gt;&lt;td&gt;$64$&lt;/td&gt;&lt;td&gt;$72$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$9$&lt;/td&gt;
            &lt;td&gt;$9$&lt;/td&gt;&lt;td&gt;$18$&lt;/td&gt;&lt;td&gt;$27$&lt;/td&gt;&lt;td&gt;$36$&lt;/td&gt;&lt;td&gt;$45$&lt;/td&gt;&lt;td&gt;$54$&lt;/td&gt;&lt;td&gt;$63$&lt;/td&gt;&lt;td&gt;$72$&lt;/td&gt;&lt;td&gt;$81$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;figcaption&gt;
  Figure 1. Ordinary multiplication table.
  &lt;/figcaption&gt;
&lt;/center&gt;
&lt;p&gt;While you have likely had this table memorized since you were about five years
old, we can admit that it is somewhat complicated.  It requires memorizing 100
entries.  And while we can quickly do single digit multiplications, division is
much harder.  How many people know what 1/8 is, even to one significant digit?
Granted, with the extra costs we've got an increased precision.&lt;/p&gt;
&lt;p&gt;Unfortunately, because most of the type of math we do when we do order of
magnitude problems is multiplication and division, the relative error in our
system is set by the smallest &lt;em&gt;multiplicative&lt;/em&gt; factor between symbols.  In this
case, the gap between 1 and 2 is quite large, and this system struggles to
maintain accuracy to within a factor of 2 in general.  Here we are using 10
symbols per decade but only get accuracy to a factor of two.&lt;/p&gt;
&lt;h2&gt;Two Significant Digits&lt;/h2&gt;
&lt;p&gt;If we wanted to have even greater precision, we could do our calculations to
two significant digits.  This starts to be beyond most people's capability for
what they can do in their head.  If not for multiplication, certainly for
division.  It requires use of 100 symbols, though again, we have a lot of
practice with these symbols and how they multiply.  Doing two significant digit
arithmetic is a lot easier to do on paper and it what I would typically use back in
my undergrad physics classes.  Keeping around two significant digits ensures
that our answers are good to a factor of 1.1 or so, i.e. 10%.&lt;/p&gt;
&lt;p&gt;Can we do better?&lt;/p&gt;
&lt;h1&gt;More Exotic Logarithmic Systems&lt;/h1&gt;
&lt;p&gt;We can get higher accuracy with fewer symbols if we distribute our symbols evenly on a logarithmic scale.&lt;sup&gt;&lt;a href="#benford"&gt;1&lt;/a&gt;&lt;/sup&gt;
&lt;a href="#natural-log"&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="benford"&gt;1&lt;/sup&gt; 
  This is a better match for how numbers are naturally distributed, as we explored in a &lt;a href="benford.html"&gt;previous post&lt;/a&gt;.
&lt;/aside&gt;
&lt;aside&gt; &lt;sup id="natural-log"&gt;2&lt;/sup&gt;
There is also evidence to suggest we naturally think about numbers logarithmically.
&lt;a href="https://www.scientificamerican.com/article/a-natural-log/"&gt;
&lt;i&gt;A Natural Log: Our Innate Sense of Numbers is Logarithmic, Not Linear&lt;/i&gt;. Kurt Kleiner. Scientific American. Aug 2008.
&lt;/a&gt;
&lt;/aside&gt;
&lt;h2&gt;One - Few - Ten&lt;/h2&gt;
&lt;p&gt;One of the biggest bang-for-your-buck type systems is one I like to call one-few-ten.  If we simply track half-order-of-magnitude we can, with only two symbols, achieve ~30% relative errors.&lt;/p&gt;
&lt;p&gt;Now, instead of rounding each number to the nearest power of 10, you round each number to the nearest half-power of 10.  This sounds like it might be complicated but ends up being very simple in practice.&lt;/p&gt;
&lt;figure id="half"&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/halfdial.svg" alt="The layout of the half-orders of magnitude."&gt;
  &lt;figcaption&gt;
  Figure 2. Half orders of magnitude.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;When you play with slide rules, you quickly really internalize how close $\pi$ is to the square root of 10.&lt;/p&gt;
&lt;p&gt;$$ \sqrt{10} = 10^{\frac 12} \approx 3.18 \approx 3.14 \approx \pi $$&lt;/p&gt;
&lt;p&gt;So, in practice you simply round each number to either the nearest power of 10 or $\pi$ times the nearest power of 10.&lt;/p&gt;
&lt;p&gt;For example:
$$
\begin{align*}
1.0 &amp;amp;\rightarrow 1 \\
1.2 &amp;amp;\rightarrow 1 \\
151.23 &amp;amp;\rightarrow 100 \\
42,000 &amp;amp;\rightarrow \pi \times 10^4 \\
0.0723 &amp;amp;\rightarrow 0.1 \\
40 &amp;amp;\rightarrow \pi \times 10.
\end{align*}
$$&lt;/p&gt;
&lt;p&gt;Looking at the Logarithmic circular dial above, the precise rounding points are $10^{1/4} \approx 1.78$ and $10^{3/4} \approx 5.6$, which at the level of precision we are dealing with you could call 2 and 6.  To use the One-Few-Ten system then, if the first digit of a number is between 2 and 6, you call it $\pi$ times the relevant power of 10, and otherwise you just round it to the nearest power of 10.  Two $\pi$s make another ten: $\pi^2 \approx 10$.   The 'arithmetic' is quite simple:&lt;/p&gt;
&lt;figure id="pi-table"&gt;
&lt;center&gt;
    &lt;table&gt;
        &lt;thead&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$\pi$&lt;/td&gt;
        &lt;/tr&gt;&lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr&gt;&lt;td&gt;$1$&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$\pi$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$\pi$&lt;/td&gt;
            &lt;td&gt;$\pi$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;figcaption&gt;
  Figure 3. Half-orders-of-magnitude multiplication table.
  &lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;So, this doesn't really add any sort of mental burden, but increases our accuracy from being good to only an order of magnitude or factor of 10, to being good to a factor of $\pi$.&lt;/p&gt;
&lt;p&gt;I think everyone should be done one-few-ten type arithmetic as a default, and wish it was more popular.  Can we do better than this without increasing the mental burden too much?&lt;/p&gt;
&lt;h2&gt;1-2-5&lt;/h2&gt;
&lt;p&gt;If we go from tracking half orders of magnitude to thirds and round we get the common &lt;a href="https://en.wikipedia.org/wiki/Preferred_number#1-2-5_series"&gt;1-2-5 series&lt;/a&gt; used in much of the world's currencies.&lt;/p&gt;
&lt;figure id="half"&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/thirddial.svg" alt="The layout of the half-orders of magnitude."&gt;
  &lt;figcaption&gt;
  Figure 4. Third orders of magnitude.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;The multiplication rules here are quite simple but slighly unintuitive.&lt;/p&gt;
&lt;figure id="third-table"&gt;
&lt;center&gt;
    &lt;table&gt;
        &lt;thead&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;
        &lt;/tr&gt;&lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr&gt;&lt;td&gt;$1$&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$2$&lt;/td&gt;
            &lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$5$&lt;/td&gt;
            &lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;&lt;td&gt;$20$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;figcaption&gt;
  Figure 5. Third-orders-of-magnitude aka 1-2-5 multiplication table.
  &lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;In the worst case, we have a gap of 5/2 = 2.5 between subsequent symbols in this formulation, nearly as good as zequals but not quite.  Can we do better?&lt;/p&gt;
&lt;h2&gt;Quarter-Orders-of-Magnitude&lt;/h2&gt;
&lt;p&gt;Why yes, I believe we can.  Let's use quarter orders-of-magnitude!  If we split the decade into four pieces, we can achieve a relative accuracy of $10^{1/4} \sim 1.8$, better than the factor of 2 we got from the zequals system, and only using 4 symbols instead of 10.&lt;/p&gt;
&lt;figure id="scale"&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/quarters-dial.svg" alt="The layout of the quarter orders of magnitude."&gt;
  &lt;figcaption&gt;
  Figure 6. Quarter orders of magnitude.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;$$ \def\tripi{\tau\!\!\!\:\pi} $$&lt;/p&gt;
&lt;p&gt;To make this even easier to intuit, I suggest using the symbols, $1, \tau, \pi,$ and $\tau\!\!\!\:\pi$ (pronounced "tripi")  Here $\pi$ is nearly equal to the $\pi$ we know and love, and $\tau = 10^{1/4} \approx 1.78$ while $\tau\!\!\!\:\pi = 10^{3/4} \approx 5.62$.   In terms of these symbols, the arithmetic table makes a lot of intuitive sense, you simply add the legs, with 4 legs equaling a factor of 10.&lt;/p&gt;
&lt;style&gt;
    td { text-align: center; }
    td:nth-child(1) { border-left: none; background-color: whitesmoke; }
    thead td { border-top: none; background-color: whitesmoke; }
    table { border: 1px solid black; }
&lt;/style&gt;
&lt;figure id="mult-table"&gt;
&lt;center&gt;
    &lt;table&gt;
        &lt;thead&gt;&lt;tr&gt;&lt;td&gt;&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$\tau$&lt;/td&gt;&lt;td&gt;$\pi$&lt;/td&gt;&lt;td&gt;$\tau\!\!\!\:\pi$&lt;/td&gt;
        &lt;/tr&gt;&lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr&gt;&lt;td&gt;$1$&lt;/td&gt;
            &lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$\tau$&lt;/td&gt;&lt;td&gt;$\pi$&lt;/td&gt;&lt;td&gt;$\tau\!\!\!\:\pi$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$\tau$&lt;/td&gt;
            &lt;td&gt;$\tau$&lt;/td&gt;&lt;td&gt;$\pi$&lt;/td&gt;&lt;td&gt;$\tau\!\!\!\:\pi$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$\pi$&lt;/td&gt;
            &lt;td&gt;$\pi$&lt;/td&gt;&lt;td&gt;$\tau\!\!\!\:\pi$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;&lt;td&gt;$\tau10$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;&lt;td&gt;$\tau\!\!\!\:\pi$&lt;/td&gt;
            &lt;td&gt;$\tau\!\!\!\:\pi$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;&lt;td&gt;$\tau10$&lt;/td&gt;&lt;td&gt;$\pi10$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;figcaption&gt;
  Figure 7. Quarter-orders-of-magnitude multiplication table.
  &lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;As can be seen from the dial, the appropriate thresholds to round at are $10^{1/8} \approx 1.33, 10^{3/8} \approx 2.37, 10^{5/8} \approx 4.22,$ and $10^{7/8} \approx 7.50$.  As a bit of a mnemonic to remember these thresholds, remember that we are doing things in quarters, the first threshold is 1.3, and 1+3 = 4, then 2.4 and 4.2 which are opposites of one another and finally 7.5 which looks a lot like 3/4.  Now you can have even better precision than you get from single-significant-digit arithmetic, but without all of the mental cost.  Division is also very doable, again, you need only count the legs.  $10 / \tau = \tau\!\!\!\:\pi,  \tau\!\!\!\:\pi / \pi = \tau$, etc.&lt;/p&gt;
&lt;p&gt;I haven't seen this "quarters" system described elsewhere. I've started using it myself and I think it has a lot of promise.  It seems to be a very good compromise between speed, ease of use, and accuracy.&lt;/p&gt;
&lt;h2&gt;Decibels&lt;/h2&gt;
&lt;p&gt;As a final, only slightly outlandish proposal, I'll suggest that if we wanted a system that was accurate to 25%, we could simply use &lt;a href="https://en.wikipedia.org/wiki/Decibel"&gt;decibels&lt;/a&gt;.  This amounts to expressing each number as 10 times a power of 10:&lt;/p&gt;
&lt;p&gt;$$ 299 792 458  \color{darkseagreen}{\text{ m/s}} = {\color{steelblue}{2.99792458}} \times 10^{\color{salmon}{8}} {\color{darkseagreen}{\text{ m/s}}} = 10^{{\color{salmon}8}{\color{steelblue}{.476820703}}} {\color{darkseagreen}{\text{m/s}}} = {\color{salmon}8}{\color{steelblue}{4.76820703}} \text{ dB}\{\color{darkseagreen}{\text{m/s}}\} $$&lt;/p&gt;
&lt;p&gt;If we round this to the nearest integer, we end up with an approximation that is good to $10^{1/10} \approx 25\%$.&lt;/p&gt;
&lt;p&gt;$$ {\color{steelblue}{3.0}} \times 10^{\color{salmon}{8}} {\color{darkseagreen}{\text{ m/s}}} 
\approx {\color{salmon}8}{\color{steelblue}{5}} \text{ dB}\{\color{darkseagreen}{\text{m/s}}\} $$&lt;/p&gt;
&lt;aside&gt; &lt;sup id="notation"&gt;3&lt;/sup&gt; 
  This would be even more compact if we came up with a cute notational convertion for expressing units of decibel quantities.  I'm going to suggest we use $\{ \cdot \}$ brackets for this.  If you see units in curly brakets, it means the number should be interpreted as being a decibel quantity, and we wouldn't have to write $\text{dB}$ everywhere.  At this point, I think its actually less ink than the usual scientific notation, as we've eliminated the $\times 10$ from all of our numbers.
&lt;/aside&gt;
&lt;p&gt;We are used to doing calculations with numbers in scientific notation.  This already separates a number into two pieces, its power of 10 and its significand, or part that is left over.  Doing arithmetic to two significant digits means writing every number in scientific notation with two significant digits.  If we convert this into decibels, that power of 10 now becomes the 10s place for the number, while the units place represents what fraction of a decade the significant represents.  This ends up being roughly as much ink on the page as we would have used otherwise.&lt;a href="#notation"&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;figure id="decibel-scale"&gt;
  &lt;center&gt;
  &lt;img width="100%" src="figures/decibeldial.svg" alt="The layout of the decibels."&gt;
  &lt;figcaption&gt;
  Figure 8. Decibels.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Granted, this does require memorizing how to convert numbers to decibels, but this isn't all that difficult, this is a one time cost.  As pointed out by &lt;a href="https://www.johndcook.com/blog/2022/02/28/power-two-lex/"&gt;John Cook&lt;/a&gt; there is a clever way to approximately find the integer decibel values if you forget them.  First make a list of the first powers of two:&lt;/p&gt;
&lt;p&gt;$$ 2, 4, 8, 16, 32, 64, 128, 512 $$
then lexigraphically order them:
$$ 128, 16, 2, 256, 32, 4, 512, 64, 8 $$
then insert decimals after the first digits:
$$ 1.28, 1.6, 2, 2.56, 3.2, 4, 5.12, 6.4, 8 . $$
As you can see, this very well approximates the locations of the integer decibels on the scale above.  Without this trick, as long as you can remember that $3 \text{ dB} = 2$, and that $10 \text{ dB} = 10$ you can also create the approximate scale on the fly as I show in the appendix below.&lt;/p&gt;
&lt;p&gt;Fortunately, this turns multiplication and division into simple addition and subtraction of integers, something we are much better primed to do.  If we wanted to match the precision of two-significant-digit arithmetic we would need to track the nearest half decibel as well, but even this is pretty easy to do in our head. Quick, what is 4.5 + 8?  Now try to do 2.8 * 6.3 to two sigfigs.  How about 4.5 - 8 and 2.8 / 6.3?  Which of those was easier?  I think doing arithmetic with half integers is a lot easier, especially subtraction compared to division.&lt;/p&gt;
&lt;!--
There are corners of the internet where people argue whether [Seximal](https://www.seximal.net/), [Dozenal](https://en.wikipedia.org/wiki/Duodecimal) or [decimal](https://en.wikipedia.org/wiki/Decimal) are the best way to represent numbers&lt;a href="#others"&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;.   
&lt;aside&gt; &lt;sup id="others"&gt;4&lt;/sup&gt; 
  Amongst &lt;a href="https://en.wikipedia.org/wiki/Numeral_system"&gt;others&lt;/a&gt;.
&lt;/aside&gt;
--&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Try out the quarters thing.  Its fun and gives as good precision as single-significant-digit arithmetic.  If you are feeling more adventurous and want higher precision, try decibels.&lt;/p&gt;
&lt;h2&gt;Appendix: Recreating the logarithm scale&lt;/h2&gt;
&lt;p&gt;If we start with the facts that $0 \text{ dB} = 10^{0/10} = 1, 10 \text{ dB} = 10^{10/10} = 10$ and remember just a single fact that $3 \text{ dB} = 10^{3/10} \approx 2$ and $\sqrt{10} \approx \pi$ we can fill in the rest of the logarithms, at least approximately.&lt;/p&gt;
&lt;p&gt;The first thing we do is recognize that $3 \text{ dB} + 3 \text{ dB} = 6 \text{ dB} = 2 \times 2 = 4$ and then also that $9 \text{ dB} = 8$&lt;/p&gt;
&lt;p&gt;Then we can also fill in $10 \text{ dB} - 3 \text{ dB} = 7 \text{ dB} = 10/2 = 5$ and $4 \text{ dB} = 5/2 = 2.5$ and $1 \text{ dB} = 10/2/2 = 1.25$.&lt;/p&gt;
&lt;p&gt;Finally we use the fact that $5 \text{ dB} = \sqrt{ 10 } \approx \pi$ to fill in that as well as $2 \text{ dB} = 5 \text{ dB} - 3 \text{ dB} = \frac \pi 2$ and $8 \text{ dB} = 2 \pi$.&lt;/p&gt;
&lt;figure id="decibel-table"&gt;
&lt;center&gt;
    &lt;table&gt;
        &lt;thead&gt;&lt;tr&gt;
            &lt;td&gt;$\text{dB}$&lt;/td&gt;&lt;td&gt;$0$&lt;/td&gt;&lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$3$&lt;/td&gt;&lt;td&gt;$4$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$6$&lt;/td&gt;&lt;td&gt;$7$&lt;/td&gt;&lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$9$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;
        &lt;/tr&gt;&lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr&gt;
            &lt;td&gt;$\approx$&lt;/td&gt;&lt;td&gt;$1$&lt;/td&gt;&lt;td&gt;$1.25$&lt;/td&gt;&lt;td&gt;$1.6$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$2.5$&lt;/td&gt;&lt;td&gt;$\pi$&lt;/td&gt;&lt;td&gt;$4$&lt;/td&gt;&lt;td&gt;$5$&lt;/td&gt;&lt;td&gt;$2\pi$&lt;/td&gt;&lt;td&gt;$8$&lt;/td&gt;&lt;td&gt;$10$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;hint&lt;/td&gt;&lt;td&gt;$10^0$&lt;/td&gt;&lt;td&gt;$\frac{10}{2^3}$&lt;/td&gt;&lt;td&gt;$\frac{\pi}{2}$&lt;/td&gt;&lt;td&gt;$2$&lt;/td&gt;&lt;td&gt;$\frac{10}{2^2}$&lt;/td&gt;&lt;td&gt;$\sqrt{10}$&lt;/td&gt;&lt;td&gt;$2^2$&lt;/td&gt;&lt;td&gt;$\frac{10}{2}$&lt;/td&gt;&lt;td&gt;$2\pi$&lt;/td&gt;&lt;td&gt;$2^3$&lt;/td&gt;&lt;td&gt;$10^1$&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
    &lt;/table&gt;
  &lt;figcaption&gt;
  Figure 9. Decibel estimates and their derivations. Remember $3 \text{ dB} = 2$ and $\sqrt{10} \approx \pi$.
  &lt;/figcaption&gt;
&lt;/center&gt;
&lt;/figure&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/quarters.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Mon, 25 Nov 2024 00:00:00 -0500</pubDate></item><item><title>Sliderules Rule</title><description>I made a zine about them and a digital sliderule you can use.</description><content:encoded>&lt;p&gt;Sliderules rule.  I'm rather enamored with them.&lt;/p&gt;
&lt;p&gt;I've collected several over the years.  One of my favorite
is my old &lt;a href="https://collection.maas.museum/object/383283"&gt;Soviet KL-1&lt;/a&gt;.&lt;/p&gt;
&lt;figure id="#benford"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/circular-slide-rule.jpg"
    alt="Benford's distribution."&gt;
  &lt;figcaption&gt;
  Figure 1. This is a picture of one of my &lt;a href="https://collection.maas.museum/object/383283"&gt;Soviet KL-1 circular slide rules&lt;/a&gt;, which previously featured in &lt;a href="benfords.html"&gt;my post about Benford's law&lt;/a&gt;.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;Unlike a pocket
calculator, it feels like using a sliderule helps you develop a better number
sense, rather than rob you of one.
An elegant weapon for a more civilized age.&lt;/p&gt;
&lt;h2&gt;Digital Sliderule&lt;/h2&gt;
&lt;p&gt;In order to ensure that I always have a sliderule at my disposal, I recently build a &lt;em&gt;digital&lt;/em&gt; one, available at
&lt;a href="https://sliderule.alexalemi.com"&gt;sliderule.alexalemi.com&lt;/a&gt;.&lt;/p&gt;
&lt;figure id="#sliderule"&gt;
    &lt;center&gt;
    &lt;a href="https://alexalemi.com/random/sliderule/"&gt;
    &lt;img src="figures/mysliderule.png"
        style="opacity: 1; border: 1px solid black; transition: opacity 0.1s ease;"
        onmouseover="this.style.opacity='0.6'"
        onmouseout="this.style.opacity='1'" /&gt;
    &lt;/a&gt;
  &lt;figcaption&gt;
  Figure 2. The digital sliderule I made at &lt;a href="https://sliderule.alexalemi.com"&gt;sliderule.alexalemi.com&lt;/a&gt;.
  &lt;/figcaption&gt;
    &lt;/center&gt;
&lt;/figure&gt;
&lt;h2&gt;Zine&lt;/h2&gt;
&lt;p&gt;If you're new to slide rules, in keeping with the tangible, physical theme, I made a &lt;a href="https://en.wikipedia.org/wiki/Zine"&gt;Zine&lt;/a&gt; that attempts to introduce how they work.  You can read it here:&lt;/p&gt;
&lt;!--
&lt;img src="figures/sliderule-zine-1.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-2.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-3.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-4.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-5.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-6.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-7.png" /&gt;&lt;br&gt;
&lt;img src="figures/sliderule-zine-8.png" /&gt;&lt;br&gt;
--&gt;
&lt;style&gt;
    .zine-viewer {
        max-width: 600px;
        margin: 20px auto;
        text-align: center;
    }

    .zine-container {
        position: relative;
        display: inline-block;
        background: white;
        border: 1px solid #ddd;
        border-radius: 8px;
        overflow: hidden;
        cursor: pointer;
    }

    .zine-image {
        display: block;
        max-width: 100%;
        height: auto;
    }

    .click-area {
        position: absolute;
        top: 0;
        bottom: 0;
        width: 50%;
        cursor: pointer;
    }

    .click-left {
        left: 0;
    }

    .click-right {
        right: 0;
    }

    .click-area:hover {
        background: rgba(0,0,0,0.05);
    }

    .page-counter {
        margin-top: 10px;
        font-family: Arial, sans-serif;
        color: #666;
        font-size: 14px;
    }
&lt;/style&gt;
&lt;div class="zine-viewer"&gt;
    &lt;div class="zine-container"&gt;
        &lt;img id="zineImage" class="zine-image" src="figures/sliderule-zine-1.png" alt="Zine page"&gt;
        &lt;div class="click-area click-left" onclick="prevPage()"&gt;&lt;/div&gt;
        &lt;div class="click-area click-right" onclick="nextPage()"&gt;&lt;/div&gt;
    &lt;/div&gt;
    &lt;div class="page-counter"&gt;
        Page &lt;span id="currentPage"&gt;1&lt;/span&gt; of 8
    &lt;/div&gt;
&lt;/div&gt;
&lt;script&gt;
    let currentPage = 1;
    const totalPages = 8;

    function updatePage() {
        document.getElementById('zineImage').src = `figures/sliderule-zine-${currentPage}.png`;
        document.getElementById('currentPage').textContent = currentPage;
    }

    function nextPage() {
        if (currentPage &lt; totalPages) {
            currentPage++;
            updatePage();
        }
    }

    function prevPage() {
        if (currentPage &gt; 1) {
            currentPage--;
            updatePage();
        }
    }

    // Keyboard navigation
    document.addEventListener('keydown', (e) =&gt; {
        if (e.key === 'ArrowLeft') prevPage();
        if (e.key === 'ArrowRight') nextPage();
    });
&lt;/script&gt;
&lt;p&gt;Or you can download a &lt;a href="assets/sliderule-zine.pdf"&gt;PDF&lt;/a&gt; copy that you can print and &lt;a href="https://www.42ndstreet.org.uk/media/etdlxppk/zine-guide-colour.jpg"&gt;fold&lt;/a&gt; yourself.&lt;/p&gt;
&lt;p&gt;If you want to learn more, there is an old &lt;a href="https://sliderulemuseum.com/Manuals/M220_AnEasyIntroductionToTheSlideRule_IsaacAsimov_1965.pdf"&gt;book by Isaac Asimov&lt;/a&gt;,  or a great &lt;a href="https://www.youtube.com/watch?v=oYQdKbQ-sgM"&gt;1957 Educational film&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Make your own&lt;/h2&gt;
&lt;p&gt;You can &lt;a href="https://www.sliderulemuseum.com/REF/scales/MakeYourOwnSlideRule_ScientificAmerican_May2006.pdf"&gt;print your own&lt;/a&gt; using instructions in an old &lt;a href="https://www.physics.wisc.edu/ingersollmuseum/wp-content/uploads/sites/10/2020/04/scientificamerican0506-80-WhenSlideRulesRuled.pdf"&gt;Scientific American article&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Or an innovated &lt;a href="https://sliderulemuseum.com/SR_Scales.shtml#YingHum"&gt;circular slide rule&lt;/a&gt; created by Ying Hum that can be printed and fit in an old CD jewel case.&lt;/p&gt;
&lt;p&gt;Another option would be to 3D print your own. &lt;a href="https://www.youtube.com/watch?v=qTd03m8rsfg"&gt;Alex Desilets created plans&lt;/a&gt; for a series of 3D printed sliderules.&lt;/p&gt;
&lt;figure id="#printed-sliderule"&gt;
  &lt;center&gt;
  &lt;img src="figures/printed-sliderule.jpg" width="100%" alt='3d printed sliderule'/&gt;
  &lt;figcaption&gt;
  Figure 3. A circular sliderule I 3d printed.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/sliderules.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Fri, 20 Jun 2025 00:00:00 -0400</pubDate></item><item><title>Plaque</title><description>A dirt-simple bring-your-own-editor 'reactive' python notebook package.</description><content:encoded>&lt;p&gt;I'm a big fan of the new generation of &lt;em&gt;reactive&lt;/em&gt; notebooks that don't have internal state,
projects like &lt;a href="https://clerk.vision/"&gt;clerk&lt;/a&gt;&lt;sup&gt;&lt;a href="#clerktudes"&gt;1&lt;/a&gt;&lt;/sup&gt;
for clojure, &lt;a href="https://marimo.io/"&gt;marimo&lt;/a&gt; for python, &lt;a href="https://plutojl.org/"&gt;pluto.jl&lt;/a&gt; for julia and &lt;a href="https://observablehq.com/notebook-kit/kit"&gt;observable notebook kit&lt;/a&gt;&lt;sup&gt;&lt;a href="#notebook-kit"&gt;2&lt;/a&gt;&lt;/sup&gt; for javascript.  Each of these deal with the biggest complaint about &lt;a href="https://ipython.org/notebook.html"&gt;jupyter notebooks&lt;/a&gt; and &lt;a href="https://colab.research.google.com/"&gt;colab notebooks&lt;/a&gt;; namely that they have hidden state&lt;sup&gt;&lt;a href="#suck"&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
&lt;aside&gt; &lt;sup id="clerktudes"&gt;1&lt;/sup&gt; 
  E.g. see my &lt;a href="https://github.clerk.garden/alexalemi/clerktudes"&gt;clerktudes&lt;/a&gt;, mostly neat visual versions of some of &lt;a href="https://github.com/alexalemi/advent"&gt;my adventofcode solutions&lt;/a&gt;.
&lt;/aside&gt;
&lt;aside&gt; &lt;sup id="notebook-kit"&gt;2&lt;/sup&gt; 
  I just converted my old &lt;a href="https://blog.alexalemi.com/ob/nbs/leap-day.html"&gt;Leap Day post&lt;/a&gt; to the new &lt;a href="https://observablehq.com/blog/observable-2-0"&gt;Observable 2.0 Notebook Kit&lt;/a&gt;.
&lt;/aside&gt;
&lt;aside&gt; &lt;sup id="suck"&gt;3&lt;/sup&gt; 
  E.g. see &lt;a href="https://youtu.be/7jiPeIFXb6U?si=JCZSTm_zVKDUJcIQ"&gt;the Joel Grus' infamous talk&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;They each use what I think is the best trick in the book for solving a problem, if you don't want to be sick to some issue, make yourself invariant to it.  These projects don't look at your code linearly, from top to bottom, they instead see your code as organized into &lt;em&gt;cells&lt;/em&gt; and resolve a dependency graph for how they all relate, ensuring that they only ever run cells when they change or once of their parent cells change.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://marimo.io/"&gt;Marimo&lt;/a&gt; is a really nice entry into the space, I enjoy it a lot, but it asks that I abandon my ordinary development environment&lt;sup&gt;&lt;a href="neovim"&gt;4&lt;/a&gt;&lt;/sup&gt; and use theirs instead.  &lt;a href="https://clerk.vision/"&gt;Clerk&lt;/a&gt; and the just-released
&lt;a href="https://observablehq.com/notebook-kit/kit"&gt;Observable Notebook Kit&lt;/a&gt; instead let you use your ordinary editor and simply watch for whenever you make updates to the code.  This means you get to keep all of the niceties of the editor environment you've perfected over the years, and get to take advantage of nice tools like &lt;a href="https://docs.astral.sh/ruff/"&gt;linters&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Language_Server_Protocol"&gt;language servers&lt;/a&gt;, tab completion, autocomplete, and potentially things like llm agents and &lt;a href="https://www.anthropic.com/claude-code"&gt;claude code&lt;/a&gt;.&lt;/p&gt;
&lt;aside&gt; &lt;sup id="neovim"&gt;4&lt;/sup&gt; 
  I like &lt;a href="https://zed.dev/"&gt;Zed&lt;/a&gt; and &lt;a href="https://neovim.io/"&gt;neovim&lt;/a&gt; these days.
&lt;/aside&gt;
&lt;p&gt;My old coworker &lt;a href="https://danijar.com/"&gt;Danijar&lt;/a&gt; had a nice project he called &lt;a href="https://github.com/danijar/handout"&gt;Handout&lt;/a&gt;, which let you build nice rendered python handouts and always executed fresh from top to bottom.  You got to use your own editor and didn't have any hidden state, but this, much like marimo, required writing nonstandard python code.&lt;/p&gt;
&lt;p&gt;Feeling like python was still missing something quite like &lt;a href="https://clerk.vision"&gt;clerk&lt;/a&gt;,
I decided to try to roll my own. I call it &lt;a href="https://github.com/alexalemi/plaque"&gt;Plaque&lt;/a&gt;, as in the ornamental tablet.  It uses ordinary python code and &lt;code&gt;# %%&lt;/code&gt; style cell boundaries like &lt;a href="https://jupytext.readthedocs.io/en/latest/"&gt;Jupytext&lt;/a&gt;.  The bit of inspiration I had is that we can get essentially all of the benefits of a full reactive environment with very little work if we simply look for cells that change and run only those and all of the cells further down the page. With a bit more code we can actually use the python &lt;code&gt;ast&lt;/code&gt; module to try to skip cells that don't reference any of the things we changed.  With very little code we end up with a system that I've found useful enough to begin using it regularly at work.&lt;/p&gt;
&lt;p&gt;Here's a short video demo:&lt;/p&gt;
&lt;center&gt;&lt;iframe width="560" height="315" src="https://www.youtube.com/embed/DlbA1aOMsFw?si=QJQgb0kG-TWA04PK" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen&gt;&lt;/iframe&gt;&lt;/center&gt;
&lt;p&gt;If you use &lt;a href="https://docs.astral.sh/uv/"&gt;&lt;code&gt;uv&lt;/code&gt;&lt;/a&gt; you should be able to try it out right now:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;uvx plaque serve your-nb.py&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Check it out and let me know what you think.&lt;/p&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/plaque.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Thu, 31 Jul 2025 00:00:00 -0400</pubDate></item><item><title>Dyson Spheres</title><description>The absurdity of dyson spheres on any timescale we'll experience.</description><content:encoded>&lt;p&gt;In 1960, Freeman Dyson wrote &lt;a href="https://fermatslibrary.com/s/search-for-artificial-stellar-sources-of-infrared-radiation"&gt;a single page&lt;/a&gt;
"little joke"&lt;sup&gt;&lt;a href="#littlejoke"&gt;1&lt;/a&gt;&lt;/sup&gt; about how an advanced civilization
undergoing exponential energy growth would quickly exeeds the capacity of their planet
and would have to start harvesting a reasonable fraction of the energy of their host star.&lt;/p&gt;
&lt;aside&gt; &lt;sup id="littlejoke"&gt;1&lt;/sup&gt;
  As he says in &lt;a href="https://youtu.be/huAIfzUoyhU?si=UaM-gGpLH84-iHrs&amp;t=124"&gt;this interview&lt;/a&gt;,
	or Angela Collier discusses in &lt;a href="https://youtu.be/fLzEX1TPBFM?si=8Xv9fb-mzu0RKWlz"&gt;her video&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;The paper goes through the basic math and physics for the energy and timescales required to build what has now become known as a &lt;a href="https://en.wikipedia.org/wiki/Dyson_sphere"&gt;Dyson sphere&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Dyson spheres started off as a far fetched sci-fi concept, but increasingly seem to be taken seriously by various corners of the internet, to the point where people are willing to put real money down on the possibility of a Dyson sphere by 2030 on prediction markets like &lt;a href="https://manifold.markets/levifinkelstein/will-we-have-at-least-one-dyson-sph"&gt;manifold.markets&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I think these people are crazy and wrong and I wanted to try to explain (for myself as much as anyone else) why.&lt;/p&gt;
&lt;h2&gt;Some Background Energy Numbers&lt;/h2&gt;
&lt;p&gt;Fundamentally, I don't think we have the energy budget to build one, and I worry that most people vastly underestimate the amount of energy required.&lt;/p&gt;
&lt;p&gt;Let's establish some baseline numbers.  As a human, you're burning energy at $ \sim 100 \, \textrm{W} $  continuously.&lt;sup&gt;&lt;a href="#lightbulb"&gt;2&lt;/a&gt;&lt;/sup&gt; &lt;sup&gt;&lt;a href="#percapita"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="lightbulb"&gt;2&lt;/sup&gt;
    You're like a light bulb ($ 2000 \,\textrm{kcal/day}$)
&lt;/aside&gt;
&lt;p&gt;All of human civilization currently consumes energy at a rate of ~ $21 \,\textrm{TW} = 2 \times 10^{13} \,\textrm{W}$.&lt;sup&gt;&lt;a href="#civenergy"&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt; &lt;sup id="civenergy"&gt;4&lt;/sup&gt;
    $186,000 \text{ TWh/year} = 21 \text{ TW}$. From
    Hannah Ritchie, Pablo Rosado, and Max Roser (2020) - “Energy Production and Consumption” Published online at OurWorldinData.org. Retrieved from: '&lt;a href="https://ourworldindata.org/energy-production-consumption"&gt;https://ourworldindata.org/energy-production-consumption&lt;/a&gt;' [Online Resource]
&lt;/aside&gt;
&lt;aside&gt; &lt;sup id="percapita"&gt;3&lt;/sup&gt;
    This is $\sim 2.6 \text{ kW}$ per person, or $\sim 26 \times$ how much energy you're radiating in heat.
&lt;/aside&gt;
&lt;p&gt;The sun has a total luminosity of ~ $400 \, \textrm{YW} = 4 \times 10^{26} \, \textrm{W}$.&lt;sup&gt;&lt;a href="#sunluminosity"&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="sunluminosity"&gt;5&lt;/sup&gt;
    From &lt;a href="https://en.wikipedia.org/w/index.php?title=Solar_luminosity&amp;oldid=1281026335"&gt;wikipedia: Solar luminosity&lt;/a&gt;.
&lt;/aside&gt;
&lt;p&gt;At $ 1 \,\textrm{AU}$ this is a solar flux of $ 1369 \,\textrm{W/m}^2$ so that the total solar flux incident on the earth is:&lt;/p&gt;
&lt;p&gt;$$
1369 \, \textrm{W/m}^2 \cdot \pi R^2 = 170 \,\textrm{PW} = 170\,000 \,\textrm{TW} = 1.7\times 10^{17} \,\textrm{W}
$$&lt;/p&gt;
&lt;p&gt;This is why the science fiction idea of a Dyson sphere is so alluring, as far as the sun is concerned, the Earth takes up a very small fraction of the sky.&lt;sup&gt;&lt;a href="#earthppm"&gt;6&lt;/a&gt;&lt;/sup&gt;  On this planet we are limited to getting access to &lt;em&gt;only&lt;/em&gt; 8,000 times the power we currently consume.&lt;sup&gt;&lt;a href="#solarlimit"&gt;7&lt;/a&gt;&lt;/sup&gt;  Meanwhile, if we could capture some reasonable fraction of the total solar output, this is another factor of 2 billion in terms of our potential power draw.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="earthppm"&gt;6&lt;/sup&gt;
    $$\frac{\pi R_{\oplus}^2}{4 \pi (1 \text{ au})^2} \sim 1/2 \,\textrm{ppb}$$
&lt;/aside&gt;
&lt;p&gt;While these numbers seem large and out of the way, as laid out in the fantastic article &lt;a href="https://tmurphy.physics.ucsd.edu/papers/limits-econ-final.pdf"&gt;Limits to Economic Growth by Tom Murphy&lt;/a&gt;,&lt;sup&gt;&lt;a href="#8"&gt;8&lt;/a&gt;&lt;/sup&gt; even these fantastical limits are very close in terms of exponential growth.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="murphyarticle"&gt;8&lt;/sup&gt;
    Murphy Jr, Thomas W. "Limits to economic growth." Nature Physics 18.8 (2022): 844-847.
&lt;/aside&gt;
&lt;p&gt;If we look at &lt;a href="https://ourworldindata.org/energy-production-consumption"&gt;ourworldindata&lt;/a&gt;, we can see that over the last 60 years or so, our primary energy consumption has been growing between -1% and 4% or so.  Let's make it a round 2.3% growth so that in a century we get a factor of 10.&lt;sup&gt;&lt;a href="#century"&gt;9&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;aside&gt;&lt;sup id="murphyarticle"&gt;8&lt;/sup&gt;
    i.e. $e^{0.023 * 100} \sim 10$
&lt;/aside&gt;
&lt;figure id="owid"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/owid-change-energy-consumption.png"
    alt="Annual change in primary energy consumption."&gt;
  &lt;figcaption&gt;
  Figure 1. Annual change in primary energy consumption from &lt;a href="https://ourworldindata.org/grapher/change-energy-consumption"&gt;ourworldindata&lt;/a&gt;.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;As &lt;a href="https://tmurphy.physics.ucsd.edu/papers/limits-econ-final.pdf"&gt;Tom shows in his Figure 1&lt;/a&gt;, this is a great match for the growth in power consumption over a longer range&lt;/p&gt;
&lt;figure id="tomarticlefig"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/tom-energy-growth.png"
    alt="Historical growth in primary energy production."&gt;
  &lt;figcaption&gt;
  Figure 2. Historical energy growth. From
  Murphy Jr, Thomas W. "Limits to economic growth." Nature Physics 18.8 (2022): 844-847. &lt;a href="https://tmurphy.physics.ucsd.edu/papers/limits-econ-final.pdf"&gt;[pdf]&lt;/a&gt;
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;The problem with exponentials, as we are well aware, is that if you extrapolate them they get very intense very quickly.  If you run this 2.3% growth exponential forward, it quickly becomes uncomfortable, as Tom shows in his &lt;a href="https://dothemath.ucsd.edu/2011/07/galactic-scale-energy/"&gt;accompanying blog post&lt;/a&gt;:&lt;/p&gt;
&lt;figure id="tomblogfig"&gt;
  &lt;center&gt;
  &lt;img width="95%" src="figures/galaxy-1024x768.png"
    alt="Projecting exponential energy growth into the future."&gt;
  &lt;figcaption&gt;
  Figure 3. Projecting 2.3% energy growth into the future. From &lt;a href="https://dothemath.ucsd.edu/2011/07/galactic-scale-energy/"&gt;Tom Murphy's blog post&lt;/a&gt;.
  &lt;/figcaption&gt;
  &lt;/center&gt;
&lt;/figure&gt;
&lt;p&gt;In a mere 275 years, i.e. as far forward as the year 1750 was back, we would exceed the total solar flux on land that we can reasonably capture (at 20% efficiency).  But if you want to ignore those limits, its only 345 years to get to 100% of the total solar flux on land, and only 400 years to get to the total solar flux on the Earth… period.&lt;/p&gt;
&lt;p&gt;Our civilization is already operating at a global scale, and we are already close to some fundamental physical limits to what we can do on this planet.  This naturally makes people wonder about alternatives, makes people look to the stars, as they represent the only possibilities for a future that would support the level of economic growth and development for the next several hundred years that could resemble the last several hundred years.&lt;/p&gt;
&lt;p&gt;Unfortunately, physics tends to stretch those dreams thin.&lt;/p&gt;
&lt;h2&gt;Dyson's original paper&lt;/h2&gt;
&lt;p&gt;If you haven't already, it's worth taking a moment to read
&lt;a href="https://fermatslibrary.com/s/search-for-artificial-stellar-sources-of-infrared-radiation"&gt;
Dyson's original letter to Science&lt;/a&gt; that laid out the idea of a Dyson sphere,&lt;sup&gt;&lt;a href="#dyson-paper"&gt;10&lt;/a&gt; it's only a single page.  The point of the original paper was to make the case that in SETI style searches for intelligent life elsewhere in the galaxy, we should consider advanced civilizations that may have built a structure around their star, and in so doing, modify the spectral emissions of that star.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="dyson-paper"&gt;10&lt;/sup&gt;
    Dyson, Freeman J. "Search for artificial stellar sources of infrared radiation." Science 131.3414 (1960): 1667-1668.
    &lt;a href="https://fermatslibrary.com/s/search-for-artificial-stellar-sources-of-infrared-radiation"&gt;
        [fermat's library]&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;To do this, he tries to lay out the feasibility of such a civilizational program.  He points out that in our solar system, we &lt;em&gt;reasonably&lt;/em&gt; have access to the mass of Jupiter $ 2\times 10^{30}\,\textrm{g} $ and the total solar luminosity of $4\times 10^{33} \,\textrm{erg/s} = 4\times 10^{26} \,\textrm{W}$ as above.&lt;/p&gt;
&lt;p&gt;The first thing Dyson points out is that while this kind of scale feels infeasible for us to achieve, with exponential growth being what it is, we might expect our civilization (if it had unencumbered growth) to reach that level in a mere 3000 years at 1% growth.  In the figure above, we see we reach total solar output in only 1350 years with the 2.3% rate of growth.&lt;/p&gt;
&lt;p&gt;Dyson does some back of the envelope calculations to show that if we built a structure out of Jupiter (or a Jupiter equivalent of mass) at about 2 AU could be to 2 to 3 meters thick (and so one could conceivably live in it).  But of course to actually assemble such a structure would require dissembling Jupiter itself!&lt;/p&gt;
&lt;p&gt;Dyson actually works this out for us, and says that to pull about Jupiter would require about $ 10^{44} \,\textrm{erg}$ and then makes this seems reasonable by pointing out that this would only take 800 years of total solar output.  Let's examine these numbers a little closer.&lt;/p&gt;
&lt;p&gt;First as a rough back of the envelope estimate of the binding energy of a planet, we can just do some dimensional analysis and say that it should be on the order of:&lt;/p&gt;
&lt;p&gt;$$
\frac{G M^2}{R} = G \frac{(2\times 10^{30} \,\textrm{g})^2}{70\,000 \,\textrm{km}} \sim 4\times 10^{43}\,\textrm{erg} = 4\times 10^{36}\,\textrm{J}
$$&lt;/p&gt;
&lt;p&gt;Where $M$ is the mass of the planet, $R$ is its radius and $G$ is the gravitational constant.  So far so good.  To pull apart a planet the size of Jupiter takes a lot of energy.  If we had at our disposal our entire solar output, it would take nearly a millennium to do it!&lt;/p&gt;
&lt;p&gt;If we had a Dyson sphere already, something that captured a reasonable fraction of our total solar output, if we then dedicated all of that power to the singular task of ripping Jupiter apart, it would still take ~300 years to do so.&lt;/p&gt;
&lt;p&gt;What if we didn't have the Dyson sphere yet? What if we only had access to the total solar flux on our own planet.  So, let's merely assume we have covered our entire planet in 100% efficient solar panels and similarly put the entire production towards dissembling Jupiter, how long would that take?&lt;/p&gt;
&lt;p&gt;$$
\frac{4 \times 10^{36} \, \textrm{J}}{170\,\textrm{PW}} = 10^{12} \,\textrm{years}
$$&lt;/p&gt;
&lt;p&gt;Our roughly 50x the total age of the universe!&lt;/p&gt;
&lt;p&gt;However, you might object that we could use the power from the budding Dyson sphere (or swarm) to power further construction of the swarm.
Since $dm/dt \propto dm $ for whatever mass fraction you've already eaten, the final answer will be only a constant factor larger than the original estimate of ~300 years.  This is true, but this is one of those cases where its illustrative to work out what that "constant factor" is. For Dyson's parameters, that constant factor is&lt;/p&gt;
&lt;p&gt;$$
\log \frac{(4\times 10^{36} \, \textrm{J}) (1369 \,\textrm{W/m}^2) (70\,000\, \textrm{km}) }{(20 \,\textrm{TW}) (2 \,\textrm{m}) (3 \,\textrm{g/cm}^3) G (2\times10^{30} \,\textrm{g})}  \sim \log (3\times10^{13} ) \sim 30
$$&lt;/p&gt;
&lt;p&gt;So it would &lt;em&gt;only&lt;/em&gt; take ~$ 30 \times 300 \sim 10\,000$ years.&lt;/p&gt;
&lt;p&gt;Given that we only unreasonably have at most another 275 years of sustained economic growth, I don't see how building a Dyson sphere is possible, or if it is it would take many many millennia.&lt;/p&gt;
&lt;h2&gt;Mercury&lt;/h2&gt;
&lt;p&gt;Some modern proposals for a Dyson swarm seem to recognize that dissassembling Jupiter is perhaps a bit much and so instead propose more &lt;em&gt;modest&lt;/em&gt; proposals, such as using 10% of the total mass of Mercury.  Let's imagine we try to do this in the next 25 years, as some people seem to believe is possible.&lt;/p&gt;
&lt;p&gt;First, let's do a rough order of magnitude calculation.  The gravitational energy will be roughly:&lt;/p&gt;
&lt;p&gt;$$
\frac{GMm}{R} = \frac{G \cdot 0.1 \cdot (3 \times 10^{23}\,\textrm{kg})^2}{2400 \,\textrm{km}} = 3\times 10^{29} \, \textrm{J}
$$&lt;/p&gt;
&lt;p&gt;Which if we try to expend this much energy in the next 25 years, this equates to a power of:&lt;/p&gt;
&lt;p&gt;$$
\frac{3\times 10^{29}\,\textrm{J}}{25 \,\textrm{years}} = 3\times 10^{20}\,\textrm{W} = 300\,000\,000 \,\textrm{TW}
$$&lt;/p&gt;
&lt;p&gt;This exceeds the total solar flux on the earth by a factor of 2000x !&lt;/p&gt;
&lt;p&gt;There is absolutely no way this can happen.  We would need a Dyson swarm to build one.&lt;/p&gt;
&lt;p&gt;The Earth's finite limits are within reach.  As Tom points out in the article, currently the waste heat from our own energy production is only 10x smaller than the $ \sim 1 \, \text{W/m}^2$ forcing we worry about from Global warming.  Meaning, at 2.3% growth, within the next century, the &lt;em&gt;waste heat&lt;/em&gt; from our own power production will match and thereafter dominate over the current effects of global warming with regards to warming the planet.  We don't have much headroom left.&lt;/p&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Overall, I haven't seen a lot of discussion in the wild about the energy requirements for things like Dyson spheres.  I was pleasantly surprised to see that Dyson himself at least thought about these implications.  For him, it was largely about the search for other intelligent life in the universe.&lt;/p&gt;
&lt;p&gt;I worry that with a lot of exponential curves, its very easy to continue the dotted line, but all exponentials have to come to an end. They must.  There are always finite limits and even if those limits seem to be far away, with any exponential they come racing up exponentially fast.&lt;/p&gt;
&lt;p&gt;One thing I'd like to try to do personally is be more aware of the limits.  Even in the context of things like AI scaling, I think its very easy for us to extrapolate a curve but not consider the costs associated with that extrapolation, and whether those costs are in keeping with physical constraints.&lt;/p&gt;
&lt;p&gt;Humanity is already operating at a global scale, we honestly do not have a lot of additional headroom left to grow.  Machine learning has been growing exponentially, but may not have very much farther it can go, given data or compute constraints.  Seems like the time is now to think more judiciously about how to make the best of what growth we have left.&lt;/p&gt;
&lt;p&gt;If you want to learn more about our Finite planet and some of the challenges we'll face with energy, I highly recommend &lt;a href="https://open.umn.edu/opentextbooks/textbooks/980"&gt;Tom Murphy's Book,&lt;/a&gt;&lt;sup&gt;&lt;a href="#murphybook"&gt;11&lt;/a&gt;&lt;/sup&gt; which is freely accessible online.&lt;/p&gt;
&lt;aside&gt;&lt;sup id="murphybook"&gt;11&lt;/sup&gt;
    Murphy Jr, Thomas W. Energy and human ambitions on a finite planet. 2021. &lt;a href="https://open.umn.edu/opentextbooks/textbooks/980"&gt;[website]&lt;/a&gt;
&lt;/aside&gt;
&lt;p&gt;The bottom line is that the next 300 years are going to have to look different than the previous 300 years.  Maybe they look different because we escape the bounds of this planet and start living some cyberpunk future, or maybe we'll face the consequences of hitting the finite bounds of this planet.  In either case, the present is quite special and very near a turning point.&lt;/p&gt;
</content:encoded><guid isPermaLink="true">https://blog.alexalemi.com/dyson-spheres.html</guid><category domain="https://alexalemi.com/posts/">posts</category><pubDate>Sun, 21 Sep 2025 00:00:00 -0400</pubDate></item><item><title>EP15: The Information Bottleneck and Scaling Laws with Alex Alemi</title><link>https://open.spotify.com/show/4gaj9tzCrNjP9e66pJQGnl</link><description>I appeared on an episode of the Information Bottleneck Podcast and talked about scaling laws. / Information Bottleneck Podcast</description><guid isPermaLink="true">https://alexalemi.com/talks/information-bottleneck-podcast.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sat, 01 Nov 2025 00:00:00 -0400</pubDate></item><item><title>Information Theory for Representation Learning</title><link>https://docs.google.com/presentation/d/1CfCGJd3DeyGEdR1tbkNiNWh32UffwSbamFfttSxyvEE/present?usp=sharing</link><description>An overview of how information theoretic principles motivate and advance representation learning, combining variational bounds on mutual information with deep neural networks across unsupervised, supervised, Bayesian, and predictive settings. / Emory TBIO Journal Club</description><guid isPermaLink="true">https://alexalemi.com/talks/infotheory-emory.html</guid><category domain="https://alexalemi.com/talks/">talks</category><pubDate>Sun, 01 Mar 2026 00:00:00 -0500</pubDate></item></channel></rss>