AlexAlemi.comhttps://alexalemi.com/rss.xmlFollow my publications and talks.http://www.rssboard.org/rss-specificationpython-feedgenhttp://alexalemi.com/favicon.icoAlexAlemi.comhttps://alexalemi.com/rss.xmlenFri, 19 May 2023 16:04:54 +0000I was born on Wednesdayhttps://thephysicsvirtuosi.com/posts/old/i-was-born-on-wednesday/A classic logic puzzle explained.https://thephysicsvirtuosi.com/posts/old/i-was-born-on-wednesday/postsWed, 26 May 2010 00:00:00 -0400How Long Can you Balance A (Quantum) Pencilhttps://thephysicsvirtuosi.com/posts/old/how-long-can-you-balance-a-quantum-pencil/Simple and probably wrong calculation for the ultimate length of time a pencil can balance.https://thephysicsvirtuosi.com/posts/old/how-long-can-you-balance-a-quantum-pencil/postsWed, 16 Jun 2010 00:00:00 -0400A tweet is worth at least 140 wordshttps://thephysicsvirtuosi.com/posts/old/a-tweet-is-worth-at-least-140-words/Greedy twitter compression scheme.https://thephysicsvirtuosi.com/posts/old/a-tweet-is-worth-at-least-140-words/postsTue, 30 Aug 2011 00:00:00 -0400The Linear Theory of Battleshiphttps://thephysicsvirtuosi.com/posts/old/the-linear-theory-of-battleship/Winning at battleship with a dirt simple model.https://thephysicsvirtuosi.com/posts/old/the-linear-theory-of-battleship/postsMon, 03 Oct 2011 00:00:00 -0400Physics of the weird boing sound on racquetball courts.https://physics.stackexchange.com/questions/127282/physics-of-weird-boing-sound-in-racquetball-courts/127447#127447A model that recreates the boing sound.https://physics.stackexchange.com/questions/127282/physics-of-weird-boing-sound-in-racquetball-courts/127447#127447postsMon, 21 Jul 2014 00:00:00 -0400How effective is speeding?https://physics.stackexchange.com/questions/123753/how-effective-is-speeding/123760#123760A simple model looking at how effective speeding is at saving time and money.https://physics.stackexchange.com/questions/123753/how-effective-is-speeding/123760#123760postsWed, 09 Jul 2014 00:00:00 -0400Can I compute the mass of a coin based on the sound of its fall?https://physics.stackexchange.com/questions/121879/can-i-compute-the-mass-of-a-coin-based-on-the-sound-of-its-fall/121932#121932Using the sound of coins dropping to predict their values.https://physics.stackexchange.com/questions/121879/can-i-compute-the-mass-of-a-coin-based-on-the-sound-of-its-fall/121932#121932postsThu, 26 Jun 2014 00:00:00 -0400'Live' Logistic Coronavirus Death Counterhttps://observablehq.com/@alemi/live-corona-death-counterAn approximate 'live' corona death counter.https://observablehq.com/@alemi/live-corona-death-counterpostsFri, 27 Mar 2020 00:00:00 -0400Coronavirus Logistic Growth Plotshttps://observablehq.com/@alemi/logistic-growth-plotsA distinct way to view Coronavirus growth.https://observablehq.com/@alemi/logistic-growth-plotspostsMon, 13 Apr 2020 00:00:00 -0400Probabilistic Machine Learning: An Introductionhttps://github.com/probml/pml2-book/releases/latestCo-wrote the Information Theory Chapter for the book.https://probml.github.io/pml-book/book2.htmlpostsTue, 08 Feb 2022 00:00:00 -0500Simple Diffusion Colabhttps://github.com/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynbA simple self-contained Colab introducing latent diffusion.https://colab.sandbox.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynbpostsThu, 15 Sep 2022 00:00:00 -0400Why KL?Why is the KL divergence so special?<p>The <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Liebler
divergence</a>,
or KL divergence, or relative entropy, or relative information, or information
gain, or expected weight of evidence, or information divergence
(it goes by a lot of different names) is unique
among the ways to measure the difference between two probability
distributions. It holds a special and privileged place, being used to define
all of the core concepts in information theory, such as mutual information.</p>
<p>Why is the relative information so special and where does it come from?
How should you interpret it? What is a nat anyway? In this
note, I'll try to give a better understanding and set of intuitions about
what KL is, why it's interesting, where it comes from and what it's good for.</p>
<h2>Information Gain</h2>
<p>Let's see if we can motivate the form of the KL axiomatically.</p>
<p>Imagine we have some prior set of beliefs summarized as a probability distribution $q$.
In light of some kind of evidence, we update our beliefs to a new distribution $p$.
How <em>much</em> did we update our beliefs? How do we quantify
the <em>magnitude</em> of that update? What are some properties we might want this
hypothetical function to have? Let $I[p; q]$ denote the function that measures
how much we moved beliefs when we switch from beliefs $q$ to beliefs $p$. We'll
call this amount of update the <em>information gain</em> when we move from $q$ to $p$.
<sup><a href="#hobson">1</a></sup></p>
<aside> <sup id="hobson">1</sup>
What follows is my own reconstruction of the fabulous paper:
<a href="https://link.springer.com/article/10.1007/BF01106578">
<b>A New Theorem of Information Theory</b> by Arthur Hobson
</a>.
</aside>
<p>We want our information function to satisfy the following properties:</p>
<ol>
<li>It's <strong>continuous</strong>. A small change in the distributions makes a small change in the amount of information in the move.</li>
<li>It's permutation or <strong>reparameterization independent</strong>. It doesn't matter if we change the units we've specified our distributions in or if we relabel the sides of our dice, the answer shouldn't change.</li>
<li>We want it to be <strong>non-negative</strong> and have the value $I = 0$ if and only if $p = q$. If $p=q$ we haven't updated our beliefs and so have no information gain.</li>
<li>We want it to be <strong>monotonic</strong> in a natural sense. If we, for instance, start with some uniform distribution over the 24 people in a game of <a href="https://en.wikipedia.org/wiki/Guess_Who%3F">Guess Who?</a> and then update to only 5 remaining suspects, $I$ should be larger than if there were still 12 remaining suspects.</li>
<li>Finally, we want our information function to <strong>decompose</strong> in a natural and <strong>linear</strong> way.<sup><a href="#renyi">2</a></sup> In particular, we want to be able to relate the information between two joint distributions in terms of the information between their marginal and conditional distributions.</li>
</ol>
<aside> <sup id="renyi">2</sup>
If one relaxes the requirement for linear decomposition and instead just requires that our information
function decompose in a convex way, you get the generalized set of
<a href="https://en.wikipedia.org/wiki/R%C3%A9nyi_entropy#R%C3%A9nyi_divergence">Rényi divergences</a>.
See: <a href="https://projecteuclid.org/euclid.bsmsp/1200512181">
<i>On Measures of Entropy and Information</i> by Alfréd Rényi.</a>
</aside>
<p>These are all very natural properties for our information function to have. That last point about composition needs to be elaborated.
The point is that we have alternative ways we might express a probability distribution. Apropos of nothing, imagine we
are concerned that we might have been exposed to a disease and are thinking about getting a test done. There are two random variables
under consideration, we will label them $\mathcal{D}$ for whether we actually had the disease or not,
and $\mathcal{T}$ for whether
the test result is positive. Each of these random variables can take on two possible states, we'll denote them as
$\mathcal{D} \in \{ D, \overline D \}, \mathcal{T} \in \{ T, \overline T \}$.
$D$ represents the state of our having-had-the-disease random variable $\mathcal{D}$ being positive, meaning we actually
did have the disease. $\overline D$ denotes we actually didn't.
With two binary random variables, there are 4 possible outcomes $(\{ DT, D\overline T, \overline D T, \overline D \overline T\})$
and fully specifying our set of beliefs requires 3 independent probabilities.</p>
<aside> <sup id="kent">3</sup>
An “<i>Almost Certainly Not</i>” is 7% on
the <a href="https://en.wikipedia.org/wiki/Words_of_estimative_probability">Kent's words of Estimative Probability</a> list.
</aside>
<aside> <sup id="covid">4</sup>
See for instance the RDT Cellex Inc. <a href="https://www.centerforhealthsecurity.org/resources/COVID-19/serology/Serology-based-tests-for-COVID-19.html">SARS-COV-2 Test</a>.
</aside>
<p>What are our prior beliefs?
Let's imagine while we are concerned we might have had the disease, but if we are being honest,
we almost certainly didn't,<sup><a href="#kent">3</a></sup>
so we'll put our prior belief in having had the disease at 7%. $(q(D) = 0.07)$.
How do we expect the antibody test to go if we have it done?
You do a bit of research and discover
that if you had had the disease, the sensitivity or <em>true positive rate</em> of the
test you're about to take is 93.8% $(q(T|D) = 0.938)$.
The specificity or <em>true negative rate</em> of that
same test is 95.6% $(q(\overline T | \overline D) = 0.956)$. <sup><a href="#covid">4</a></sup></p>
<figure id="#conditional" class="right">
<center>
<img width="45%" src="figures/KLdiagram2.svg"
alt="Conditional characterization of distribution.">
<img width="45%" src="figures/KLdiagram.svg"
alt="Joint characterization of distribution.">
<figcaption>
Figure 1. Two equivalent ways to express the joint distribution $q(\mathcal{D}\mathcal{T})$.
</figcaption>
</center>
</figure>
We've just specified our prior beliefs with 3 numbers, imagining our process as having two steps,
first, we either had the disease or not $(q(\mathcal{D}))$ and then, conditioned on that
we get the result of our test $(q(\mathcal{T}|\mathcal{D}))$.
Equivalently, we could have just given the joint probability distribution, as shown in Figure 1.
<p>The point now is that if we were to update our beliefs, in the diagram on the right there is just a single
distribution $q(\mathcal{D},\mathcal{T})$, in the one on the left there are essentially three different distributions
$(q(\mathcal{D}), q(\mathcal{T}|D), q(\mathcal{T}| \overline D))$ and we want
some sort of <em>structural</em> consistency between the two sides:
$$
I[p(\mathcal{D},\mathcal{T}); q(\mathcal{D},\mathcal{T})] \quad \textrm{versus} \quad
I[p(\mathcal{D}); q(\mathcal{D})], I[p(\mathcal{T}|D); q(\mathcal{T}|D)],
I[p(\mathcal{T}|\overline D), q(\mathcal{T}|\overline D)] .
$$</p>
<p>The consistency we will require is that our information measure decomposes linearly between
these two different descriptions. The information between the joints should be a weighted
linear combination of the informations of three constituent distributions.
In this particular case we will require:
$$ I[p(\mathcal{D},\mathcal{T}); q(\mathcal{D},\mathcal{T})] = I[p(\mathcal{D}); q(\mathcal{D})] + p(D) I[p(\mathcal{T}|D); q(\mathcal{T}|D)] + p(\overline D) I[p(\mathcal{T}|\overline D), q(\mathcal{T}|\overline D)] .
$$
In words: The information in the full joint update is the information update for
your belief in whether or not you had the disease $(q(\mathcal D))$ <em>plus</em> the informations
in the two conditional distributions, but weighted by how often we find ourselves in each of those
branches, as measured by our updated beliefs $(p(\mathcal{D}))$.</p>
<p>More generally we are requiring that our information function satisfies a natural <em>chain rule</em>:
$$ I[ p(X,Y); q(X,Y) ] = I[ p(X); q(X) ] + \mathbb{E}_{p(X)} \left[ I[ p(Y|X); q(Y|X) ] \right] $$</p>
<p>Notice that it is here, in this sort of structural independence that we make
our information function manifestly asymmetric. Here our $p$ distribution
becomes distinguished over our $q$ as it is the one we use to weight the child
contributions. This makes sense if we imagine or if $p$ is the actual
distribution that events are drawn from, for it means that this will correspond
to the information we would observe in expectation.</p>
<p>The interesting thing is that if you want your information function to satisfy
all of these seemingly reasonable properties, that is enough to determine it
<em>uniquely</em>. The only function satisfying all of these properties is the
relative entropy, or KL divergence we all know and love:
$$
I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)}
$$</p>
<p>See <a href="https://link.springer.com/article/10.1007/BF01106578">
<b>A New Theorem of Information Theory</b> by Arthur Hobson
</a> for a complete proof,
but here I'll offer a more colloquial argument like the one
given by Ariel Caticha.<sup><a href="#caticha">5</a></sup></p>
<aside> <sup id="caticha">5</sup>
<i>Lectures on Probability, Entropy and Statistcal Physics</i> by
Ariel Caticha. <a href="https://arxiv.org/abs/0808.0012">arXiv:0808.0012</a>
</aside>
<p>We will start with and focus on the continuous setting, where we have two probability
distributions $p$ and $q$. We seek a functional that takes our two distributions
and gives back our information gain and we seek one that is <em>local</em> in the physics sense,
meaning that our <em>functional</em> can be written as the integral of a <em>function</em> depending
only on the values the probability densities take at each point:
$$ I[p;q] = \int \mathrm dx\, \mathcal{A}(x, p(x), q(x)). $$</p>
<p>Our requirement that our information gain be
<em>reparameterization independent</em> means it has to
be invariant to any remapping of our coordinates, or in other words,
it has to be dimensionless. Imagine $x$ has units of a length, here our integral
measure $\mathrm dx$ has units of a length, and the densities $p(x), q(x)$ would
have units of an inverse length. In order to be dimensionally consistent
our functional must take the form:<sup><a href="#caveat">6</a></sup>
$$ I[p;q] = \int \mathrm dx\, p(x) f\left( \frac{p(x)}{q(x)} \right). $$</p>
<aside> <sup id="caveat">6</sup>
We could have just as well written it as $I[p;q] = \int \mathrm dx\, q(x) g\left( \frac{p(x)}{q(x)} \right)$ (that is, the form
of an <a href="https://en.wikipedia.org/wiki/F-divergence">f-divergence</a>), but
this is equivalent to the way we wrote it with $f(\mathcal{X}) = \mathcal{X} g(\mathcal X)$.
Putting the $p(x)$ as the integral measure better aligns with what we are about to do next.
</aside>
<p>Finally, our decomposability requirement above when written out in terms of
continuous densities takes the form:
$$ I[ p(x,y); q(x,y) ] = I[ p(x); q(x) ] + \int \mathrm dx\, p(x) I[p(y|x) ; q(y|x)] $$</p>
<p>Combining this linear decomposition requirement with our requirement for the
form required and pushing some equations around gives us:
$$
\begin{align}
I[ p(x,y); q(x,y) ] &= I[p(x); q(x)] + \int \mathrm dx\, p(x) I[p(y|x); q(y|x)] \\
\int \mathrm dx\, \mathrm dy\, p(x,y) f\left(\frac{p(x,y)}{q(x,y)} \right)&= \int \mathrm dx\, p(x) f\left(\frac{p(x)}{q(x)} \right) + \int \mathrm dx\, p(x) \int dy\, p(y|x) f\left(\frac{p(y|x)}{q(y|x)} \right) \\
\int \mathrm dx\, \mathrm dy\, p(x) p(y|x) f\left(\frac{p(x)p(y|x)}{q(x)q(y|x)} \right)&= \int dx\, dy\, p(x) p(y|x) \left[ f\left(\frac{p(x)}{q(x)} \right) + f\left(\frac{p(y|x)}{q(y|x)} \right)\right] .
\end{align}
$$
Notice that this demonstrates that our function $f$ must satisfy the property:
$$ f(ab) = f(a) + f(b). $$
This well known functional equation has a unique (up to a multiplicative constant) <em>continuous</em> solution:
$$ f(x) = c \log x. $$
We can roll the choice of multiplicative constant into our choice of basis for the logarithm and arrive at our final form
for our information gain:
$$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)}. $$</p>
<p id="#non-negative-proof">As for the non-negativity, our final form satisfies that property. Because we have that $\log x \leq x -1$:
$$ I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x)} = -\int \mathrm dx \, p(x) \log \frac{q(x)}{p(x)} \geq
-\int \mathrm dx\, p(x) \left( \frac{q(x)}{p(x)} - 1 \right) = 0. $$
<aside>
<img width="100%" src="figures/logbound.svg"
alt="Visual demonstration of log x < x - 1.">
</aside>
</p>
<h2>Bayes Rule</h2>
<aside> <sup id="caticha2">7</sup>
I first saw this form of motivation for Bayes Rule in
<i>Lectures on Probability, Entropy and Statistical Physics</i> by
Ariel Caticha. <a href="https://arxiv.org/abs/0808.0012">arXiv:0808.0012</a>
</aside>
Having identified the right way to measure how much information is gained when we update a distribution
from $q$ to $p$, why don't we put this to practical use and try to figure out how we
<i>ought</i> to update
our beliefs in light of evidence or observations.<sup><a href="#caticha2">7</a></sup>
<p>Returning to our disease testing example, let's say you get the test done and receive a
positive result $(\mathcal T = T)$.
What should your new distribution of beliefs be? Well, first off if we've observed the results of the test
we should probably have our updated beliefs reflect the observation we made, making it consistent with our
observation, setting $p(T) = 1$, but this doesn't fully specify $p$; we need two more numbers. How should
we set those?</p>
<p>Why don't we aim to be conservative and try to find a new set of beliefs
that are as close as possible to our prior beliefs while still being consistent with the
observation that we've made?<br />
Namely, let's look now for a joint distribution $p(\mathcal T, \mathcal D)$
that is as close as possible to $q(\mathcal T, \mathcal D)$ but for which we have that $p(T)=1$.
$$ \DeclareMathOperator{\argmin}{arg\,min} $$
$$ \argmin_{p(\mathcal D, \mathcal T)} I[p(\mathcal D, \mathcal T); q(\mathcal D, \mathcal T)] \quad \text{ s.t. }\quad p(T) = 1 $$
Now that we know
how to measure how much information is gained in updating our beliefs, we will
find the $p$ that minimizes this update while still being true to the observation we made.
Writing $p(\mathcal D,\mathcal T) = p(\mathcal T)p(\mathcal D|\mathcal T)$
and using our linear decomposition rule from above (the other way around), we have:
$$ I[p(\mathcal D,\mathcal T); q(\mathcal D,\mathcal T)] = I[p(\mathcal T);q(\mathcal T)] + I[p(\mathcal D|T);q(\mathcal D|T)]. $$
Because we've decided to fix $p(T)=1$ in order to be consistent with our
observation, the way to minimize the information between the joints is to set $p(\mathcal D|T)=q(\mathcal D|T)$ so
that our second term vanishes. In this particular case this means:
$$ p(T)=1 $$
$$ p(D|T) = q(D|T) = \frac{q(T|D)q(D)}{q(T|D)q(D) + q(T|\overline D)q(\overline D)} = 0.616 $$</p>
<p>Furthermore, the marginal distribution of our updated beliefs about our disease status is:
$$ p(D) = p(D|T)p(T) = q(D|T) = 0.616$$
In this particular case our updated belief is only 3 to 2 on
that we actually had the disease, despite our positive test result. In Figure 2
we show both our prior in this factorization as well as our new beliefs.</p>
<figure id="#posterior" class="right">
<center>
<img width="35%" src="figures/KLdiagram2q.svg"
alt="Prior distribution of beliefs.">
<img width="35%" src="figures/KLdiagram2p.svg"
alt="Posterior distribution of beliefs.">
<figcaption>
Figure 2. Our prior (left, blue, notice that we've swapped the order of the conditioning) and updated (right, orange) beliefs after observing that the test was positive.
</figcaption>
</center>
</figure>
<p>Notice what just happened. If we look for a new distribution that is as close as possible
to our previous distribution of beliefs (as measured by $I[p;q]$) which is also consistent
with our observations, we end up with an updated, or <em>posterior</em> set of beliefs given
by Bayes' Rule. Imagine we had some observable $x$ and some parameters $\theta$. Our
prior set of beliefs are described by the joint distribution $q(\theta,x) = q(x|\theta)q(\theta)$:
a <em>likelihood</em> $q(x|\theta)$ of how we expect the data to be distributed given
the parameter values and some <em>prior</em> $q(\theta)$ set of beliefs about what values
those parameters can take. If we make an observation and see some value for our observable $x=X$,
what ought our new beliefs be? If we search for the joint distribution $p(x,\theta)$ that is
as close as possible to our previous beliefs $q(x,\theta)$ but that no longer has any
uncertainty about the value the observable will take $(p(x) = \delta(x-X))$ we see
that minimizing the information gain:
$$ I[p;q] = I[p(x);q(x)] + \int \mathrm dx\, p(x) \, I[p(\theta|x); q(\theta|x)], $$
is accomplished if we set $p(\theta|x) = q(\theta|x)$, yielding the updated joint:
$$ p(x,\theta) = p(x)p(\theta|x) = \delta(x-X) q(\theta|x) $$
and the marginal beliefs about the parameters to be:
$$ p(\theta) = \int \mathrm dx\, p(x,\theta) = \int \mathrm dx\, \delta(x-X) q(\theta|x) = q(\theta|X), $$
or precisely what you probably thought it should have been anyway if you've heard
of Bayesian inference.</p>
<p>Although, if you stop to think about it, even though many of us know of and have
used <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes Theorem</a>
for a long time, the way it's normally presented, it is just a trivial statement
about how joint distributions factor.
$$ q(\theta, D) = q(\theta) q(D|\theta) = q(D) q(\theta|D) \implies
q(\theta|D) = \frac{q(D|\theta) q(\theta)}{q(D)}. $$
But, this is just a statement about distribution
$q$, our prior beliefs. It tells us nothing about how we should update those
beliefs in light of observations. However, the previous argument demonstrates
that if you want to set your updated beliefs such that they are as close
as possible to your prior beliefs while being consistent with your
observations, you should set your updated beliefs according to
Bayes' rule run on the prior beliefs.</p>
<h2>Expected Weight of Evidence</h2>
<p>Traditionally, KL is interpreted from a coding perspective, a view I've included in an appendix below,
but here I offer a different perspective from the viewpoint of model selection.</p>
<p>Above we saw that we can motivate Bayesian inference as choosing a posterior belief distribution
that has the minimal information gain over our prior distribution of beliefs while being consistent
with our observations. This guides us towards forming better belief distributions, but what if we
just have two different belief distributions and wish to decide between them?</p>
<p>Really what we want to know is what is the probability that our beliefs are correct in light of evidence?
Symbolically you might write this as $p(P|E)$ where $P$ is some belief distribution and $E$ is some
evidence, data, or observations. If we run Bayes Theorem we can see that:
$$ p(P|E) = \frac{p(E|P) p(P)}{p(E)}. $$
We can update our belief in our beliefs being correct by setting our updated
weight in the belief $p(P|E)$ to be proportional to our initial weight $p(P)$ times
the <em>likelihood</em> that the evidence we observed would have been generated if our belief was true $(p(E|P))$. The probability of the evidence given the belief $P$ is just the likelihood $P(E)$.
Proportional because we would need to know how likely the evidence would be $p(E)$ amongst all possible
beliefs. This last part, the <a href="https://en.wikipedia.org/wiki/Marginal_likelihood"><i>marginal likelihood</i></a>
is notoriously difficult to compute. In principle, it is asking us to evaluate how likely
the evidence would be from all possible models.</p>
<p>However, we can make further progress if we content ourselves to not necessarily knowing the
absolute probability our model or beliefs are correct, but instead just its probability relative
to some other model. If we consider the ratio of two different models $P$ and $Q$ we have:
$$ \frac{p(P|E)}{p(Q|E)} = \frac{p(E|P)}{p(E|Q)} \frac{p(P)}{p(Q)}. $$
Notice that the marginal likelihoods cancel out. This is saying that whatever prior relative odds for the two models
being correct, if we compute the <a href="https://en.wikipedia.org/wiki/Bayes_factor"><i>Bayes factor</i></a>
$\left( \frac{p(E|P)}{p(E|Q)} \right)$, it tells us how the relative probabilities of the two beliefs should update
in light of the evidence. Taking a log on both sides:
$$ \log \frac{p(P|E)}{p(Q|E)} = \log \frac{p(E|P)}{p(E|Q)} + \log \frac{p(P)}{p(Q)},$$
turns this multiplicative factor into an additive one.</p>
<p>If what we are deciding between is two different probability distributions, you may recognize that this additive <i>weight of evidence</i>
for $p$ over $q$ when we observe $x$ is precisely the integrand in our information gain:
$$ w[x; p,q] = \log \frac{p(x)}{q(x)}. $$
The log ratio of two probability distributions measures by how much you should update your prior log odds between the two distributions being
correct. The KL divergence is just then the expected weight of evidence if we draw samples from $p(x)$ itself:
$$ I[p;q] = \mathbb{E}_p\left[ \log \frac{p(x)}{q(x)} \right] = \mathbb{E}_p \left[ w[x; p,q] \right]$$</p>
<p>So, one way to interpret the relative entropy is that if our data was actually coming from the distribution $p$ and we had some other
hypothesis $q$, the $I[p;q]$ measures on average how much we should believe $p$ over $q$ on each observation. In order to make that
statement more precise, we need a better language to talk about the magnitudes of these quantities.</p>
<h2>How loud is the Evidence?</h2>
<p>Our measurement of the amount of information was only unique up to a choice of multiplicative constant. This is equivalent to
our choice of base for the logarithm. We can think of this as the <em>units</em> we use to measure our information. The traditional choices
would be to use the base-2 logarithm and measure the information in <em>bits</em>,<sup><a href="#bit">8</a></sup>
or to use the more mathematically convenient natural
logarithm and measure the information in <em>nats</em>. Another option is to measure the information in
<a href="https://en.wikipedia.org/wiki/Hartley_(unit)"><em>decibans</em></a> or <em>decibels</em> or <em>Hartley's</em>, wherein
we use ten times the base-10 logarithm.</p>
<aside> <sup id="bit">8</sup>
<i>bit</i> being short for <i>binary digit</i>.
<i>nat</i> is then short for <i>natural digit</i>.
People sometimes suggest <i>dit</i> for the base-10 </i>decimal digit</i>.
Turing suggested *ban* as short hand for the amount of evidence deduced about the setting
of the Enigma machine using the Banburismus method, itself named after the town of Banbury where
the team got their large card sheets used in the method.
For more discussion about the history and etymology of these and related units see section 4.8.1 of
<a href="https://books.google.com/books/about/Probability_Theory.html?id=tTN4HuUNXjgC&source=kp_book_description"><i>Probability Theory: The Logic of Science</i> by E.T. Jaynes</a>.
</aside>
$$ I[p;q] = 10 \int \mathrm dx\, p(x) \log_{10} \frac{p(x)}{q(x)}\, \textrm{dB} $$
<p>The nice thing about measuring information in decibans or <a href="https://en.wikipedia.org/wiki/Decibel">decibels</a>
is the people already have some familiarity with the unit, such as for measuring the <em>loudness</em> of sounds.
It's always a comparative measurement, for sound taking $10 \log_{10} \frac{P}{P_0}$ of the power
to some reference or baseline power. In the same way we could besides just measuring the KL between two distributions,
measure the comparative difference between any two probabilities on the log scale:
$$ 10 \log_{10} \frac{p(x)}{q(x)} \textrm{ dB}. $$</p>
<p>In particular, we could get some feeling for these quantities by comparing the probability something happens to the
probability it doesn't. Consider a simple binary outcome and taking $q=1-p$, in this case, the weight of evidence
that the thing happens versus it doesn't upon observing it happen once is:
$$ 10 \log_{10} \frac{p}{1-p} \text{ dB}. $$
This essentially gives us a new scale to measure probabilities on.
Instead of expressing probabilities as a number between 0 and 1,
here we are computing the log <em>odds</em> of an event happening on the decibel scale.</p>
<p>Below in Table 1 is a summary of the correspondence between decibans and odds or probabilities, and
in Figure 3 is a large visual representation you can play with.</p>
<figure>
<center>
<table>
<thead><th>db<th>odds<th>~odds<th>probability<th>spinner
<tr><td>0<td>1.00<td>1:1<td> 50%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-0" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>1<td>1.26<td>5:4<td>56%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-1" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>2<td>1.58<td>π:2<td>61%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-2" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>3<td>2.00<td>2:1<td>67%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-3" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>4<td>2.51<td>5:2<td>71.5%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-4" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>5<td>3.16<td>π:1<td>76%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-5" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>6<td>3.98<td>4:1<td>80%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-6" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>7<td>5.01<td>5:1<td>83%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-7" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>8<td>6.31<td>2π:1<td>86%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-8" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>9<td>7.94<td>8:1<td>89%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-9" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>10<td>10<td>10:1<td>91%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-10" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>11<td>12.6<td>4π:1<td>92.6%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-11" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>12<td>15.8<td>16:1<td>94%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-12" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
<tr><td>13<td>20<td>20:1<td>95%
<td><svg height="30" width="30" viewBox="0 0 20 20"> <circle r="10" cx="10" cy="10" fill="white" /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-13" stroke="#1f77b4" stroke-width="10" stroke-dasharray="0.942 2.200" /></svg>
</table>
</center>
<figcaption>
Table 1: A table of the correspondence between decibans/decibels and odds or probabilities.
</figcaption>
</figure>
<figure id="bigspin">
<center>
<svg height="300" width="300" viewBox="-2 -2 25 25"> <circle r="10" cx="10" cy="10" fill="white" stroke="black" stroke-width=0.2 /> <circle r="5" cx="10" cy="10" fill="whitesmoke" id="progress-100" stroke="#1f77b4" stroke-width="9.9" stroke-dasharray="3.141 3.141" /></svg>
<br />
<input value=0 type='number' style="width: 4em" id="percent" onchange="updatePercent();">
<label for="percent">dB</label>
<br/>
<input id="slider" style="width: 65%;" type="range" min="-23" step="0.1" max="23" value="0" class="slider" id="slider"
oninput="updateSlider();" >
<figcaption>Figure 3: A larger visual representation of decibels as a probability that you can play with. Here the set value
of decibels measure the weight of evidence between the spinner giving a blue versus a white outcome.</figcaption>
</center>
</figure>
<p>Another nice property of measuring evidence and probabilities in
decibels is that it seems like 1 dB roughly corresponds the smallest detectable value that people
notice in terms of a change in underlying distribution, being the difference between <i>even chance</i>
and 5 to 4 odds, <i>moderate probability</i> or <i>better than even chance</i>.</p>
<aside id="quantifying"><sup>9</sup>
<a href="https://projecteuclid.org/euclid.ss/1177012242"><i>Quantifying Probabilistic Expressions</i> by
Frederick Mosteller and Cleo Youtz</a>.
</aside>
Additionally, $10 \textrm{ dB}$ corresponds to 10 to 1 odds, or 91% probability, which people associate
with events being <i>almost certain</i> or happening <i>almost always</i>. <sup><a href="#quantifying">9</a></sup>.
<p>The traditional statistical threshold for reported results is a <a href="https://en.wikipedia.org/wiki/P-value">p-value</a>
of 0.05, which is often <a href="https://en.wikipedia.org/wiki/Misuse_of_p-values">misinterpreted</a>
to mean that the probability the null hypothesis is less than
5%. While this isn't what the p-value measures, if we obtain more than 13 dB of evidence against some
null hypothesis, this does mean that the relative odds that it is correct have decreased by a factor of 20,
taking us below 20 to 1 against if we started with even odds.</p>
<p>We have the conversions:
$$ 1 \textrm{ nat} = \frac{10}{\log 10} \textrm{ dB} = 4.34 \textrm{ dB} $$
$$ 1 \textrm{ bit} = \frac{10}{\log_2 10} \textrm{ dB} = 3.01 \textrm{ dB} $$</p>
<h2>Examples and Magnitudes</h2>
<h3>Double-headed Coin</h3>
<p>Let's say I have two coins in my pocket, the first is an ordinary unbiased coin, and the second is doubled-headed.
I give you one of them and you start flipping the coin. You get a heads, then another heads, then another. How many
heads would you need to see in a row until you're sure you've been given the doubled-headed coin? Let's
work out the relative entropy between these two distributions. On the one hand we have $p(H)=1, p(\overline H) =0$,
and the other $q(H) = q(\overline H)= 0.5$.</p>
<p>$$ I[p;q] = 10 \sum_i p_i \log_{10} \frac{p_i}{q_i} = -10 \log_{10} 2 = 3.01 \text{ dB} $$</p>
<p>The relative entropy of a sure thing and a coin flip is 3 decibels. This means that if we want to be more sure than 20 to 1
that we have the doubled-headed coin we'd need to observe 5 heads in a row, giving us 15 dB of evidence.</p>
<h3>Births</h3>
<p>Perhaps the first hypothesis test to be resolved with modern statistics was the question of whether more male or female
babies are born. Using data from 1745 to 1770, Laplace found that in those 26 years, 251,527 boys and 241,945 girls were born.
This gives a fraction of male births of $\sim 51\%$.
Is this just a statistical fluke, or are boys more common than girls at birth? What Laplace did was to analytically
work out the Bayesian posterior distribution for the probability that a male baby was born using a uniform prior, obtaining
a $\operatorname{Beta}(251528, 241946)$ distribution, for which the probability that the probability a male is born
is less than or equal to $1/2$ is
$$ \int_0^{1/2} \mathrm dx \, \operatorname{Beta}(x; 251528, 241946) \sim 10^{-42}$$
enough for Laplace to declare that he was <em>morally certain</em> that males
are born more frequently than females.</p>
<p>Let's work out the weight of evidence in this case, let's say we were comparing two hypotheses, the first
that males are born 51% of the time, and the second that they are born 50% of the time. With Laplace's data, the
total weight of evidence in this case is:</p>
<p>$$ 2515270 \log_{10} \frac{0.51}{0.50} + 2419450 \log_{10} \frac{0.49}{0.50} = 404 \text{ dB} $$
a whopping 400 decibels of evidence for males being born 51% of the time rather than 50%.<br />
At the same time, I'm not sure most people are aware that males are born with a higher proportion and it doesn't
seem to affect most people's lives. Why is that? Well, let's evaluate the relative entropy between
a 51% Bernoulli and a 50% Bernoulli:
$$ I = 5.1 \log_{10}\frac{0.51}{0.50} + 4.9 \log_{10} \frac{0.49}{0.50} = 8.7 \times 10^{-4} \text{ dB}. $$
Notice that the relative entropy is quite small. On average, if the true distribution was 51%, the evidence
we accumulate on each observed birth is less than 8 <em>microbels</em>. This means that on average in order to be reasonably
sure that the 51% hypothesis is true, we'd have to observe $\sim \frac{13}{8.7 \times 10^{-4}} \sim 15,000$ births.
This makes clear how with enough data we could both be very sure that males are born with a higher frequency
than females, but at the same time, this could have very little impact on our individual lives.</p>
<h3>Likelihoods and Learning</h3>
<p>What we would really like to do is learn a model of some real life distribution. If the true distribution of data is $p(x)$,
and we have some kind of parametric model $q(x;\theta)$, we would like to set our model parameters $\theta$ so that
we get as close as possible to the true distribution. In other words, we want to minimize the relative entropy from
the <em>real world</em> to our <em>model</em>:
$$\min I[p;q] = \int \mathrm dx\, p(x) \log \frac{p(x)}{q(x;\theta)}. $$
The biggest complication is that we don't actually know what the true distribution of the data is. We can, however, sample data. Luckily for us, as far as this as an objective for $\theta$ goes, we can treat the entropy of $p(x)$ as
a constant. This motivates the traditional maximum likelihood objective:
$$ \max \int \mathrm dx \, \log q(x;\theta). $$</p>
<aside id="gpt3"><sup>10</sup>
For instance, the latest <a href="https://arxiv.org/abs/2005.14165">GPT-3</a> model trained by OpenAI,
was trained on less than half of the training set. (See Table 2.2 in the paper.)
</aside>
If we had an infinite dataset, maximum likelihood is the same as minimizing the relative entropy between the real world and
our model. Unfortunately, we don't often have infinite datasets.<sup><a href="#gpt3">10</a></sup>
On finite datasets, maximum likelihood can still be interpreted as minimizing a KL divergence, but now
the KL divergence between the *empirical distribution* $\hat p(x) = \sum_i \delta(x - x_i) $
and our model $q(x;\theta)$.
<p>Unfortunately, the cross entropy is no longer reparameterization invariant a
point I elaborate in an appendix below, and so is difficult to interpret
directly, but if we take the difference of any two cross entropies, we can
still interpret that as the weight of evidence for one model with regards to
the other. Because of the lack of reparameterization independence, care must
be taken to ensure that the likelihoods of the two models are evaluated using
the same measure, but provided they are:</p>
<p>$$ L_1 - L_2 = \mathbb{E}\left[ \log q_1(x) \right] - \mathbb{E}\left[ \log q_2(x) \right] = \mathbb{E}\left[ \log \frac{q_1(x)}{q_2(x)} \right] $$</p>
<aside id="mnist"><sup>11</sup>
The entirety of which can fit in a <a href="https://twitter.com/alemi/status/1042658244609499137">tweet</a>.
</aside>
Given the size of test sets we have for modern image datasets, this means that very small changes in likelihood can be
interpreted as large confidences in the superiorities of models. Take for instance something as simple as binary static MNIST.<sup><a href="#mnist">11</a></sup> Here, with 10,000 test set images, a difference in likelihoods of 0.0013 dB or 0.0004 nats corresponds to 13 dB of evidence for the one model over the second.
<h2>Appendix A: Whither Continuous Entropy</h2>
<p>The relative entropy really is the proper way to define entropy. For all
of the things that Shannon got right, he flubbed a bit when he defined the
entropy of a distribution as:
$$ H(P) = -\sum_i p_i \log p_i $$</p>
<p>Why do I say he flubbed? Because this notion of entropy doesn't generalize
to continuous distributions. The continuous analog:
$$ H(P) = -\int \mathrm dx\, p(x) \log p(x) $$
isn't <em>reparameterization independent</em>. Consider for instance the distribution
of adult human heights: <sup><a href="#bimodal">12</a></sup></p>
<figure>
<center>
<img src="figures/adult_heights.svg"
alt="Distribution of adult heights.">
<figcaption>Figure 1. Distribution of adult heights. <sup><a href="#ourworld">13</a></sup></figcaption>
</center>
</figure>
<aside> <sup id="bimodal">12</sup>
Note that you may have heard that
<a href="https://www.johndcook.com/blog/2008/07/20/why-heights-are-normally-distributed/">heights are normally distributed</a>.
Adult male (or female) heights are normally distributed, but differ in their means and variances, making the
<a href="https://www.johndcook.com/blog/2008/11/25/distribution-of-adult-heights/">distribution of adult heights a mixture distribution</a>.
</aside>
<aside> <sup id="ourworld">13</sup>
Data taken from
<a href="https://ourworldindata.org/human-height">ourworldindata.org</a>.
</aside>
<p>If you measure the continuous entropy of this distribution measured
in centimeters you get 5.4 bits. If you instead measure the entropy
of the same distribution in feet you get 0.43 bits. If you instead
were to measure heights in meters it would be -1.3 bits! <sup><a href="#negative">14</a></sup></p>
<aside> <sup id="negative">14</sup>
It seems strange to have a negative entropy, but in this case, it is basically
reflecting the fact that in terms of meters, the human height distribution doesn't
span a whole meter in breadth, so it actually takes fewer *relative* bits
to specify a human height in meters than it would take to specify any
quantity in meters, because its uncertainty is less than a whole meter.
</aside>
<h2>Appendix B: Coding Interpretation</h2>
<p>The traditional interpretation offered for the KL is from the coding
perspective.
Imagine we have a simple 4-letter
alphabet that we want to communicate over the wire.
If the four letters occurred with different probabilities:
$p(A)=1/2, p(B)=1/4, p(C)=p(D)=1/8$, with an optimally designed <a
href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman Code</a> we could
encode our letters with a variable length code: $A:0, B:10, C:110, D:111$, and
on average we'd only be spending $1/2 + 2/4 + 3/8 + 3/8 = 7/4$ bits per letter.</p>
<figure>
<center>
<table>
<thead><th><th>A<th>B<th>C<th>D
<tr><td>$p$<td>1/2<td>1/4<td>1/8<td>1/8
<tr><td>p-code<td>0<td>10<td>110<td>111
<tr><td>$q$<td>1/4<td>1/4<td>1/4<td>1/4
<tr><td>q-code<td>00<td>01<td>10<td>11
</table>
<figcaption>
Table 2: A simple example of two different distributions over a 4 letter alphabet.
</figcaption>
</center>
</figure>
<p>Imagine however we didn't know what the true distribution of letters was and instead
designed an optimal code using a different distribution $q$. If we believed
each of the 4 letters were equally likely $(q(A)=q(B)=q(C)=q(D)=1/4)$, the optimal way to
encode messages would just assign a two bit code to each letter $(A : 00, B:01, C:10,
D:11)$. If we used this suboptimal code to send messages that were actually distributed
as $p$ it would cost $2/2 + 2/4 + 2/8 + 2/8 = 2$ bits per letter. Our incorrect
belief leads to a $2 - 7/4 = 1/4$ of a bit inefficiency. For these two distributions,
it shouldn't come as a surprise that the information gain is precisely 1/4 bits:
$$ I[p;q] = \sum_i p_i \log_2 \frac{p_i}{q_i} = 1/4 \textrm{ bits}. $$</p>
<p>For an optimally designed code, the code lengths go as $-\log p(x)$ for any symbol $x$.
Our information gain can be interpreted as a difference in expected code lengths under $p$:
$$ I[p;q] = \mathbb{E}_p[ -\log q ] - \mathbb{E}_p[-\log p ]. $$
The information gain $I[p;q]$ measures the <em>excess encoding cost</em> for trying to encode messages
from $p$ using a code designed for $q$.</p>
<script type='text/javascript'>
const SEGMENTS = 5;
const RADIUS = 5;
const CIRCUMFERENCE = 2 * Math.PI * RADIUS;
function fraction(i, db) {
const progress = document.getElementById('progress-' + i);
let odds = Math.pow(10.0, db / 10.0);
let p = odds / (1+odds);
let fill = CIRCUMFERENCE / SEGMENTS * p;
let space = CIRCUMFERENCE / SEGMENTS * (1-p);
let val = fill + " " + space;
progress.style.strokeDasharray = val;
}
for (let i = 0; i <= 13; i++) {
fraction(i,i);
}
function updateSlider() {
let value = document.getElementById("slider").value;
fraction(100, value);
document.getElementById("percent").value = value;
}
function updatePercent() {
let value = document.getElementById("percent").value;
document.getElementById("slider").value = value;
fraction(100, value);
}
</script>
postsFri, 07 Aug 2020 00:00:00 -0400A Path to the Variational Diffusion LossDeriving the (Variational) Diffusion and VAE losses from the non-negativity of KL.<p>Diffusion models have made quite a splash, especially after the open-source release of <a href="https://huggingface.co/spaces/stabilityai/stable-diffusion">Stable Diffusion</a>. What are diffusion models, where does the loss come from and what does a simple example look like? I've recently helped open-source a simple, pedagogical, self-contained
<a href="https://colab.research.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb">example colab</a>
of a diffusion model trained on EMNIST, which you can find as part of the <a href="https://arxiv.org/abs/2107.00630">Variational Diffusion Models (VDM)</a> <a href="https://github.com/google-research/vdm">github page</a>. In this post, I wanted to give some more background and a simple way to motivate where the loss function comes from.</p>
<h2>Non-negativity of KL</h2>
<aside><sup id="#p-and-q">1</sup>
I tend to reverse the use of $p$ and $q$ with respect to the rest of the world. Most people use $p$ for the generative model and $q$ for the approximate posterior. They do this because, for most people, the generative model is the star of the show and the approximate posterior is playing second fiddle. My reversal of the letters is deliberate. To me, the <i>forward process</i> $p(x,z)=p(x)p(z|x)$ composed of the <i>true image distribution</i> $p(x)$ and the <i>encoder</i> $p(z|x)$ is the star of the show. $p$ is the joint distribution that exists in the real world, $q$ is our approximation to it.
</aside>
<p>Let's say we want to build a latent-variable model, $q(x, z)$ where the likelihood of the data ($p(x)$), has high marginal likelihood: $\log q(x)$. Unfortunately, computing $\log q(x)$ involves an intractable integral over the latent variable, $z$.<sup><a href="#p-and-q">1</a></sup></p>
<aside><sup id="#kl">2</sup>
<a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler Divergence</a>, for more background on KL see my <a href="kl.html">other post</a>.
</aside>
<aside><sup id="#brakets">3</sup>
I use <a href="https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation">brakets</a> to show expectations and unless noted, always with respect to the full $p$ distribution.
$$ \left\langle \cdot \right\rangle_p = \mathbb{E}_p \left[ \cdot \right] = \int dx\, p(x) [\cdot] $$
<p>If I don't denote the distribution the expectation is with respect to on the brakets, it's always the full joint $p(x,\cdots)$. Notice that this works even if there
are fewer variables or conditioning variables left inside the terms in the brakets, as any excess variables will just marginalize out without issue in the expectation and any variables being conditioned on will be evaluated in expectation as desired.</p>
</aside>
<p>We can derive the tractable objective used to train these models using the observation that the KL<sup><a href="#kl">2</a></sup> divergence is non-negative and monotonic. The <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler (KL) divergence</a> between any two distributions is non-negative:<sup><a href="#brakets">3</a></sup>
$$ \left\langle \log \frac{p(x)}{q(x)} \right\rangle_p \geq 0. $$</p>
<p>If we marginalize out some subset of random variables the KL divergence of the marginal distributions has to be less. For any two random variables:
$$ \begin{align}
\left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle &= \left\langle \log \frac{p(x)p(z|x)}{q(x)q(z|x)} \right\rangle \\
&= \left\langle \log \frac{p(x)}{q(x)} \right\rangle + \left\langle \log \frac{p(z|x)}{q(z|x)} \right\rangle \\
&\geq \left\langle \log \frac{p(x)}{q(x)}\right\rangle \geq 0
\end{align} $$
Intuitively, if we think about KL divergence as a "distance" between probability distributions, two joint distributions always have to be at least as far apart as their marginals. As we just saw, the KL of the joint is the sum of the KL between the two marginals, as well as the expected KL of the conditional distributions (which has to be positive, as all KLs are).</p>
<h2>VAEs</h2>
<p>Imagine designing these joint distributions to have different flavors. Think of $p(x,z)$ as a <em>forward</em> process $p(x) p(z|x)$ that takes an image from some natural image distribution $p(x)$ and then encodes it into some representation $z$ with an encoder $p(z|x)$. This is a joint distribution over the two variables. Running the forward process would give us $(x,z)$ pairs, pairs of natural images and their encodings.
Next, imagine a different joint distribution, a <em>reverse process</em> $q(x,z)$ that takes some sample from a <em>prior</em> $q(z)$ and then runs it through a <em>decoder</em> $q(x|z)$ to generate a synthetic image. This is a generative model of the kind we might be used to building. This is also a fully-fledged joint distribution that we could sample from, in order to generate $(x,z)$ pairs. At initialization, these two distributions are very different. The goal of generative modeling is to bring these two joint distributions into alignment.</p>
<p>Based on the properties of the KL divergence, these two joint distributions must have a non-negative KL divergence that is monotonic to marginalizing out one of the variables:
$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log \frac{p(x) p(z|x)}{q(z) q(x|z)} \right\rangle \geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle \geq 0 $$
Notice what this is saying. The KL divergence between the joint distributions here is the expected log density ratio of the forward to the reverse model's likelihood, where the expectation -- the samples -- are taken with respect to the <em>forward</em> process $p(x,z)$. This joint KL is itself an upper bound for the KL divergence between the marginal distributions $p(x)$ and $q(x)$. $p(x)$ was our original image distribution, while $q(x)$ is the distribution of synthetic images drawn from the generative model that is our reverse process:
$$ q(x) = \int dz\, q(x|z) q(z) $$</p>
<p>So, by minimizing the KL between our forward and reverse process -- by aligning the two joint distributions -- we can ensure that we make progress towards learning a good generative model of our images $q(x)$. We can ensure that we are aligning the marginals $q(x)$ and $p(x)$.</p>
<p>The tightness of this bound is controlled by how close together the remaining conditional distributions are:</p>
<p>$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log \frac{p(x)}{q(x)} \right\rangle + \left\langle \log \frac{p(z|x)}{q(z|x)} \right\rangle $$
In other words: the degree to which our encoding distribution ($p(z|x)$) matches the Bayesian posterior of our generative model ($q(z|x)$) determies the tightness of our bound.</p>
<p>So, again, all we started with is the idea of two different processes, the <em>forward</em> process that takes images and encodes them and a <em>reverse</em> process that samples some latents from a known distribution and decodes them. If we try to minimize the KL divergence between these two processes, forward to reverse, we can ensure that this is a valid bound on the marginal KL between the true image distribution $p(x)$ and the marginal of our generative model $q(x)$. That is, by learning to make the two joint processes look alike we are also as a consequence learning a good generative model of images.</p>
<aside><sup id="#ELBO">4</sup>
<a href="https://en.wikipedia.org/wiki/Evidence_lower_bound">Evidence Lower BOund</a>
</aside>
We've just derived the ordinary ELBO:<sup><a href="#ELBO">4</a></sup>
$$ \left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log p(x) -\log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle, $$
up to a constant outside our control, the entropy of the true image distribution $p(x)$. Notice that this term cancels out on both sides if we wish to target
the cross-entropy from our true $p(x)$ to our model's $q(x)$ rather than the KL.
<p>$$\begin{align}
\left\langle \log \frac{p(x,z)}{q(x,z)} \right\rangle = \left\langle \log p(x) - \log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle &\geq \left\langle \log \frac{p(x)}{q(x)} \right\rangle \\
\left\langle -\log q(x|z) + \log \frac{p(z|x)}{q(z)} \right\rangle &\geq \left\langle -\log q(x) \right\rangle \\
\left\langle \log q(x) \right\rangle &\geq \left\langle \log q(x|z) - \log \frac{p(z|x)}{q(z)} \right\rangle
\end{align}$$</p>
<p>At the end of the day, the hope and the dream we seem to have in doing latent variable modeling is that maybe we will somehow be more successful in learning a reverse $q(z)q(x|z)$ process to match some forward $p(x)p(z|x)$ than we would have been able to just model the density $q(x)$ directly. We are hoping that by expanding the problem, and making it a harder or larger modeling task, it'll become easier for us to optimize or learn.</p>
<h2>Diffusion</h2>
<p>For diffusion models, honestly, there isn't much to add except they add many more steps.
The only difference is that instead of a two-step forward process, in diffusion we imagine a many-stepped (or potentially continuous) forward and reverse process.</p>
<p>In particular, in most diffusion models we fix the forward process to be a Markov chain:
$$ p(x, z_0, z_1, z_2, \cdots, z_{T-1}, z_T) = p(x) p(z_0|x) p(z_1|z_0) \cdots p(z_T|z_{T-1}), $$
which starts with a sample from a natural image distribution $p(x)$ and then adds $T$ steps of additive Gaussian noise $p(z_t| z_{t-1}) \sim \mathcal N(\alpha_{t} z_{t-1}, \sigma_{t}^2) $.</p>
<figure id="#diffusion-forward">
<img src="figures/diffusion-forward.svg"
alt="Graphical model showing the forward process for diffusion.">
<figcaption>
Figure 1. The graphical model for the forward process in diffusion.
</figcaption>
</figure>
<aside><sup id="#variance-preserving">5</sup>
In a lot of the diffusion work, the process is taken to be *variance preserving* by setting:
$$ \alpha^2 = 1 - \sigma^2 $$
</aside>
<p>This takes an ordinary image and then adds more and more noise to it until it looks more or less indistinguishable from just isotropic Gaussian noise.<sup><a href="#variance-preserving">5</a></sup></p>
<figure id="#forward-diffusion">
<center>
<img src="figures/forward-diffusion.png"
alt="Illustration of standard forward diffusion process as additive Gaussian noise.">
<figcaption>
Figure 2. A demonstration of the typical forward process in diffusion models.
</figcaption>
</center>
</figure>
<p>One particularly nice thing about using Gaussians for every step of the forward process here is that the composition of a bunch of conditional Gaussians is itself Gaussian so we will have a closed form for the marginal distribution at any intermediate time:
$$ p(z_t|x) = \mathcal N(\tilde \alpha_t x, \tilde \sigma_t^2 I ).$$</p>
<p>With a forward process defined, we parameterize or learn the reverse process, a Markov chain that operates in the opposite direction:
$$ q(x,z_0,z_1,\cdots,z_T) = q(z_T) q(z_{T-1}|z_T) \cdots q(z_1|z_2)q(z_0|z_1)q(x|z_0) $$</p>
<figure id="#diffusion-reverse">
<img src="figures/diffusion-backward.svg"
alt="Graphical model showing the reverse process for diffusion.">
<figcaption>
Figure 3. The graphical model for the reverse process in diffusion.
</figcaption>
</figure>
<aside><sup id="#extra-entropy">6</sup>
Aside, again, from the constant entropy of the data outside our control which we can ignore for purposes of optimization.
</aside>
<p>The VDM loss is<sup><a href="#extra-entropy">6</a></sup> simply the KL between these two joints, which serves as an upper bound on the KL of the image marginals:
$$ \left\langle \log \frac{p(x,z_0,z_1,\cdots,z_T)}{q(x,z_0,z_1,\cdots,z_T)} \right\rangle \geq \left\langle \log \frac{p(x)}{q(x)}\right\rangle $$</p>
<aside><sup id="#deep-unsupervised">7</sup>
See <a href="https://arxiv.org/abs/1503.03585">Deep Unsupervised Learning Using Nonequilibrium Thermodynamics</a> by Sohl-Dickstein et al.
</aside>
<p>Just as in the case of a VAE, here, the hope is that it might actually be easier to model the larger joint distribution than it was to try to model the density directly. In the case of simple diffusion models, the forward process is fixed additive Gaussian noise. If we make enough steps in the forward process we believe we ought to be able to learn the reverse process exactly.<sup><a href="#deep-unsupervised">7</a></sup></p>
<h3>Various Sundry Tricks</h3>
<p>The joint KL is equivalent to the VDM loss. However, in practice, to make this loss efficient to train, diffusion models leverage a lot of the known structure
of the forward process to power a very clever parameterization of the reverse process. This requires some tricky rearranging of terms and some stochastic approximation to make the whole thing efficient.<br />
To see the code, please check out the <a href="https://colab.research.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb">example colab</a> as well as its accompanying text that walks through some of these details in more detail.</p>
<p>To utilize our knowledge of the forward process, we're actually going to rewrite the forward process not as a sequence of conditional Gaussian steps (a <em>bottom-up</em> forward process):
$$ p(x,z_0,z_1,z_2,\cdots,z_T) = p(x) p(z_0|x) p(z_1|z_0) p(z_2|z_1) \cdots p(z_T|z_{T-1}) $$
but instead we'll rearrange this to be a product of a bunch of conditional reverse steps (as a <em>top-down</em> forward process):
$$
\begin{align}
p(x, z_0, z_1, z_2,\cdots, z_N) &= p(z_0,z_1,z_2,\cdots, z_T|x) p(x) \\
&= p(z_0|z_1,\cdots,z_T,x)p(z_1|z_2,\cdots,z_T,x)\cdots p(z_T|x)p(x) \\
&= p(z_0|z_1,x)p(z_1|z_2,x)\cdots p(z_{T-1}|z_{T},x)p(z_T|x)p(x)
\end{align}$$
For the Gaussian diffusion, we can analytically figure out what these conditional reverse steps should be for the forward process $p(z_{t-1}|z_t,x)$. These distributions compute the probability of seeing a particular noisy image from the previous step if we get to observe both the noisy image as well as the original image.</p>
<figure id="#diffusion-forward-reverse">
<img src="figures/diffusion-forward-backward.svg"
alt="Graphical model showing the top-down generative process for diffusion.">
<figcaption>
Figure 4. The graphical model for the top-down forward process in diffusion.
</figcaption>
</figure>
<p>We'll then parameterize our reverse process $q(z_{t-1}|z_t)$ to have this same <em>functional form</em>:
$$ q(z_{t-1}|z_t) \leftarrow p(z_{t-1}|z_t, \hat x(z_t, t)). $$
We'll model the reverse process as if it were the exact reversed conditional forward process, but of course, for the true reverse process we don't get to observe the true original image. Still, we'll use the same functional form, it's just we'll spend our modeling budget on trying to impute the original clean image $\hat x$ after observing the noisy image $z_t$ and which step we are on $t$.</p>
<aside><sup id="#epshat">8</sup>
The two are affinely related:
$$ \hat x_t = (z_t - \sigma_t \hat \epsilon_t) / \alpha_t $$
</aside>
<p>The actual parametric model in a diffusion model is this bit, $\hat x(z_t, t)$. It is a neural network that takes as input the noisy image $z_t$ and the step we are on in the diffusion process $t$ and has the job of trying to predict what the corresponding clean image was that generated the noisy image. In most diffusion models this is implemented as a <a href="https://en.wikipedia.org/wiki/U-Net">U-Net</a> style architecture. In practice, it's been found that if instead of predicting the clean image $\hat x$, you predict the noise $\hat \epsilon$ from the noisy image, you get better-looking samples.<sup><a href="#epshat">8</a></sup> The full reverse generative model then consists of many steps of looking at a noisy image and trying to infer the clean one; rinse and repeat.</p>
<p>With these choices in place, we can now look at the full joint KL and organize terms.</p>
<p>$$ \left\langle \log p(x) - \log q(x|z_0) + \log \frac{p(z_T|x)}{q(z_T)} + \sum_{i=0}^{T-1} \log \frac{p(z_i|z_{i+1},x)}{q(z_i|z_{i+1})} \right\rangle_p $$</p>
<p>The last trick we're going to use is that we're going to avoid computing all of the terms in our sum by simply not computing all of the terms in our sum. We'll approximate the sum with Monte Carlo: we'll simply randomly choose one of the terms and upweight it appropriately.
At that point, we have the loss function used to train VDM models. A very nice thing about the VDM loss is that it is clear that we are optimizing a bound on the marginal likelihood of our generative model. As you can learn in the <a href="https://arxiv.org/abs/2107.00630">VDM Paper</a>, many of the diffusion models you've heard about correspond to a <em>weighted</em> form of this same objective, where different terms in the sum get different weights.</p>
<p>After going through all of the fancy math, the analytic KL divergences involved in the diffusion loss simplify quite nicely:
$$ \left\langle \log p(x) - \log q(x|z_0) + \log \frac{p(z_T|x)}{q(z_T)} + \frac 1 2 \sum_{t=0}^{T-1} \beta_t \left\lVert \epsilon - \hat \epsilon(z_t,t) \right\rVert^2 \right\rangle $$
For variational diffusion the weight terms $\beta_t$ depend on your choice of <em>noise schedule</em>. For most other diffusion models in the wild, these $\beta_t$ weights are conventionally set to 1.</p>
<h2>Closing Thoughts</h2>
<p>So, why are diffusion models so interesting? Well, first and foremost, the reason they are drawing so much attention is that they have shown tremendous performance. It feels like for the first time we have models that are able to generate very high resolution, very high fidelity natural images. Projects like <a href="https://openai.com/dall-e-2/">DALL-E2</a>, <a href="https://imagen.research.google/">Imagen</a>, and <a href="https://github.com/CompVis/stable-diffusion#stable-diffusion-v1">Stable Diffusion</a> show really impressive results. What is the magic driving these models?</p>
<p>At a high level, I think we can say that diffusion models start to realize the dream of latent variable models. Sometimes, when you are faced with a problem that is too difficult, you can crack it if you consider an even harder, related problem. As I tried to demonstrate here, even for simple latent variable models like VAEs and especially for diffusion models, one reason we can point to for their success is that instead of directly modeling the distribution over images, they model a much larger joint distribution. That larger joint distribution is strictly speaking a bigger thing to attempt to model, but here we get to design the forward process in such a way that even if there are many pieces to the forward process, those pieces individually are easier to tackle.</p>
<p>However, if that were the case, shouldn't we have expected deep hierarchical models to perform similarly awesomely? Probably, though here I think there is another real trick that diffusion has up its sleeve. For a general deep hierarchical generative model, even if by splitting the problem up into smaller pieces you might have split it up into easier-to-model tasks, to evaluate the joint KL you still need to evaluate all of those terms. That is, as your model becomes richer and more computationally expressive because of its depth, so does the cost of training your model, as you have to evaluate all of the layers at each step in the training process.</p>
<p>Diffusion models avoid this by structuring their forward process in such a way that all of the steps share a great deal of structural similarity. This allows diffusion to approximate a sum of a potentially large number of steps by a single randomly chosen step. If each step looks more or less the same, you can get a good estimate for the whole sum by looking at an individual, random, term.</p>
<p>The last trick up its sleeve is, even if you managed to design a deep hierarchical generative model with this structural homogeneity property, if you wanted to get to some intermediate position in the hierarchy you'd still have to run roughly half of the full forward process. That would still be expensive in general. Here, diffusion avoids that entirely.<br />
As boring as a sequence of conditional Gaussians is as a forward process, it is also beautiful: it enables exact analytic marginalization to intermediate steps. You can very quickly mimic the result of adding hundreds of steps of additive Gaussian noise by simply adding a moderate amount of Gaussian noise in a single shot.</p>
<p>So, ultimately, what do I think is one of the main reasons diffusion models do so well? I think it's because they <em>can</em> do so well! I think it's because they are very powerful, expressive, generative models. Sampling from them is generally rather expensive. Drawing a sample means running the full reverse process, which might mean calling the central score net a thousand or so times. That is a very powerful and very expressive generative model, but magically, we can train that generative model's likelihood without ever having to actually instantiate the full generative process at training time due to our set of sundry tricks.</p>
<p>I'm excited to see where this all goes and hope this post and the <a href="https://colab.sandbox.google.com/github/google-research/vdm/blob/main/colab/SimpleDiffusionColab.ipynb">colab</a> help to introduce these magical models to a wider audience.</p>
<p><small>Special thanks to <a href="https://twitter.com/poolio">Ben Poole</a>, <a href="https://twitter.com/pavel_izmailov">Pavel Izmailov</a>, <a href="https://twitter.com/def_chris_suter">Christopher Suter</a>, and Sergey Ioffe, and <a href="https://twitter.com/itfische">Ian Fischer</a> for helpful feedback on this post.</small></p>
postsThu, 15 Sep 2022 00:00:00 -0400Non-equilibrium Thermodynamics Results Seemingly from NothingDeriving some classic results in non-equilibrium thermodynamics from seemingly nothing.<p>Let's see if we can very quickly prove the Jarzynski Equality and related non-equilibrium statistical mechanics results. Much like the mathematical underpinnings of thermodynamics are pretty mathematically simple, e.g. the existence of a convex surface on which mixed partial derivatives commute, I believe most of the results in non-equilibrium statistical mechanics are similarly due to a rhetorical reinterpretation of a simple mathematical manipulation.</p>
<p>This post will assume some familiarity with physics.</p>
<h2>Basic Facts</h2>
<p>The underlying math in our case are two facts, one that probability distributions are normalized:
$$ \int dx\, p(x) = 1. $$</p>
<aside><sup id="#concave">1</sup>
The proof of which is straightforward given that $\log$ is concave and Jensen's inequality, see <a href="kl.html#non-negative-proof">my other blog post</a> for a proof.
</aside>
and second, that KL divergence is positive:<sup><a href="#concave">1</a></sup>
$$ \int dx\, p(x) \log \frac{p(x)}{q(x)} \geq 0. $$
<h2>Density Ratios</h2>
<p>To generate the classic non-equilibrium statistical mechanics results we start by considering a simple ratio of two joint probability distributions:
$$ \frac{q(x_0, x_1)}{p(x_0, x_1)} $$
Clearly we have a tremendous freedom here in our choices for the distributions $p$ and $q$. Mathematically it's uninteresting but we can start to build some rhetorical weight by factoring our two distributions in two distinct ways:
$$ \frac{q(x_1) q(x_0|x_1)}{p(x_0)p(x_1|x_0)} $$
Despite still not having done anything, we can start to build an interpretation here. Imagine $x_0$ and $x_1$ as being two configurations of a system, with $x_1$ happening <em>after</em> $x_0$. Now, though we're allowed by the chain rule to factor distributions any way we wish, here we've chosen to factor $p$ to be suggestive of some kind of <em>forward process</em> wherein we first sample some $x_0$ from a distribution $p(x_0)$ and then evolve it according to some potentially stochastic process to generate our next state $x_1$ conditioned on the first: $p(x_1|x_0)$. At the same time, we've factored $q$ the other way, evocative of a <em>reverse process</em> that starts at $x_1$ and then evolves backward to $x_0$.</p>
<p>To make further progress, let's specialize a bit. Let's imagine that $x_0$ and $x_1$ are configurations of a physical system evolving according to Hamiltonian dynamics, with a Hamiltonian governed by some kind of control parameter $\lambda$. Let's further <em>imagine</em> that at the beginning of either our forward or reverse process our system is in thermodynamic equilibrium at the same temperature, and in particular in a <a href="https://en.wikipedia.org/wiki/Canonical_ensemble">canonical ensemble</a>:<sup><a href="#beta">2</a></sup></p>
<aside><sup id="#beta">2</sup>
$\beta$ is the <a href="https://en.wikipedia.org/wiki/Thermodynamic_beta">inverse temperature</a> $1/(k_B T)$
</aside>
$$
\begin{align}
p(x_0) &= \frac{1}{Z(\beta,\lambda_0)} e^{-\beta H(x_0, \lambda_0)} \\
q(x_1) &= \frac{1}{Z(\beta, \lambda_1)} e^{-\beta H(x_1, \lambda_1)}.
\end{align}
$$
<p>Simply substituting these expressions into our density ratio we find:</p>
<p>$$ \frac{q(x_0,x_1)}{p(x_0,x_1)} = \frac{Z(\beta,\lambda_0)}{Z(\beta, \lambda_1)} e^{-\beta \left( H(x_1,\lambda_1) - H(x_0, \lambda_0) \right)} \frac{q(x_0|x_1)}{p(x_1|x_0)}. $$</p>
<p>We can clean this up a bit and give it a cleaner physical interpretation. Let's identify the change in the Hamiltonian with the work:
$$ W \equiv H(x_1,\lambda_1) - H(x_0, \lambda_0). $$
And let's use the standard definition of the free energy:
$$ \beta F = -\log Z, $$
to rewrite the ratio of partition functions as a difference in free energies:
$$ e^{-\beta \Delta F} = e^{\log Z(\beta,\lambda_0) -\log Z(\beta,\lambda_1)} = \frac{Z(\beta,\lambda_0)}{Z(\beta,\lambda_1)}. $$
Combining these results gives:
$$ \frac{q(x_0,x_1)}{p(x_0,x_1)} = e^{\beta (W - \Delta F)} \frac{q(x_0|x_1)}{p(x_1|x_0)}. $$
I'm going to anticipate some of the things we're going to talk about below and define the log of the forward over the reverse transition probabilities as the <em>heat</em>:
$$ Q = \log \frac{p(x_1|x_0)}{q(x_0|x_1)}. $$
With this final identification we end up with the general statement:
$$ \frac{q_R}{p_F} = e^{\beta (W - Q - \Delta F)}. $$
The density ratio of the reverse process (shortened here as $q_R$) to the forward process $p_F$ is given by the exponential of $\beta$</p>
<h2>Hamiltonian Dynamics</h2>
<p>First, if we assume that our dynamics is Hamiltonian, and thus deterministic and reversible, we know that the probability that we start at $x_0$ and end up at $x_1$ if we evolve forward in time is the same as the probability that we start at $x_1$ and end up at $x_0$ if we reverse our time evolution, ($q(x_0|x_1) = p(x_1|x_0)$)<sup><a href="#heat-caveat">3</a></sup></p>
<aside><sup id="#heat-caveat">3</sup>
Alternatively, if you trust our identification of heat, you could imagine an isolated system where the heat flow is zero.
</aside>
so the ratio of conditional probabilities actually cancels and we generate <a href="https://en.wikipedia.org/wiki/Crooks_fluctuation_theorem">Crook's Fluctuation Theorem</a>:
$$ \frac{q_R}{p_F} = e^{\beta (W - \Delta F)}. $$
The ratio of the reverse process probability to the forward probability for a given initial and final point is given by the exponential $e^{\beta (W - \Delta F)}$. If we now take the integral of this with respect to the forward process, we generate the <a href="https://en.wikipedia.org/wiki/Jarzynski_equality">Jarzynski equality</a>:<sup><a href="#langle">4</a></sup>
<aside><sup id="#langle">4</sup>
We've also introduced the $\langle \cdot \rangle$ notation for expectations to clean up the notation a bit.
</aside>
$$ \int dx_0\, dx_1\, p(x_0,x_1) \frac{q(x_0,x_1)}{p(x_0,x_1)} = 1 = \left\langle e^{\beta (W - \Delta F)} \right\rangle_p, $$
<aside><sup id="#free-energy">5</sup>
The free energy only depends on the partition function $Z$ which is a constant so can be taken outside the expectation.
</aside>
which simplifies to<sup><a href="#free-energy">5</a></sup>:
$$ \left\langle e^{-\beta W}\right\rangle_p = e^{-\beta \Delta F}. $$
So, recapping, what have we just done?
Since we can take density ratios of arbitrary probability distributions, we could choose those two densities to mean something we care about. Consider $p$ the forward, Hamiltonian evolution of a system from $x_0$ to $x_1$ and $q$ the reverse process. If we imagine that both the forward and reverse processes start in a state of canonical equilibrium, we can generate both Crook's Fluctuation Theorem as well as the Jarzynski equality.
<p>The power of this result is that it allows us to relate an expectation computed with respect to non-equilibrium processes (the exponential of the beta weighted stochastic work needed for a bunch of non-equilibrium realizations of our trajectory) to a pure equilibrium quantity (a difference of equilibrium free energies).
In the context of the physical sciences, this lets us perform non-equilibrium simulations or experiments, and provided we measure the work performed over many such runs, even with the system driven far from equilibrium, we can estimate equilibrium free energy differences.</p>
<h2>Stochastic Dynamics</h2>
<p>But, let's say you don't like the assumption that the dynamics are Hamiltonian, we can imagine something else, imagine our dynamics is stochastic but imagine discretizing the dynamics. We still need to make some kind of assumption, in this case, we'll imagine that our process consists of $N$ steps, each of which is governed by a Markov transition kernel. Finally, we'll assume that each transition kernel has a stationary distribution and satisfies detailed balance.</p>
<p>What this means is that we'll imagine that our forward process now takes the form:
$$
\begin{align}
p_F &= p(x_0) p(x_1|x_0) p(x_2|x_1) \cdots p(x_N|x_{N-1}) \\
&= p(x_0) T_1(x_1|x_0) T_2(x_2|x_0) \cdots T_N(x_N|x_{N-1})
\end{align}
$$
Here we've denoted the intermediate conditional distributions as being governed by our transistion kernels, labeled with the corresponding stationary distribution. Saying that our kernels have a stationary distribution that they respect according to detailed balance means that:
$$ T_k(x'|x) \sigma_k(x) = T_{k}(x|x') \sigma_k(x'), $$
for the stationary distribution $\sigma_k$.</p>
<p>We've defined our forward process, now we need to define our reverse process. We'll imagine that the reverse process is governed by the same transition kernels but running in reverse:<sup><a href="#reverse">6</a></sup></p>
<aside><sup id="#reverse">6</sup>
Notice that by *reverse* here we mean that the kernels are actually designed to be the ones targeting the stationary distribution for the step we're on, rather than the one we are heading to.
</aside>
$$
\begin{align}
q_R &= q(x_N) q(x_{N-1}|x_N) \cdots q(x_1|x_2) q(x_0|x_1) \\
&= q(x_N) T_{N}(x_{N-1}|x_N) \cdots T_2(x_1|x_2) T_1(x_0|x_1).
\end{align}
$$
<p>Now if we look at the ratio of our reverse to our forward process, things simplify a bit:
$$
\begin{align}
\frac{q_R}{p_F} &= \frac{q(x_N)T_N(x_{N-1}|x_N)\cdots T_2(x_1|x_2)T_1(x_0|x_1)}{p(x_0)T_1(x_1|x_0)T_2(x_2|x_1)\cdots T_N(x_N|x_{N-1})} \\
&= \frac{q(x_N)}{p(x_0)} \frac{T_1(x_1|x_0)}{T_1(x_0|x_1)} \frac{T_2(x_1|x_2)}{T_2(x_2|x_1)} \cdots \frac{T_N(x_{N-1}|x_N)}{T_N(x_N|x_{N-1})} \\
&= \frac{q(x_N)}{p(x_0)} \frac{\sigma_1(x_1)}{\sigma_1(x_0)} \frac{\sigma_2(x_2)}{\sigma_2(x_1)} \cdots \frac{\sigma_N(x_{N-1})}{\sigma_N(x_N)} .
\end{align}
$$</p>
<p>Finally, as we did above, let's imagine that all of these marginal distributions take the form of a canonical distribution.<sup><a href="#stationary">7</a></sup></p>
<aside><sup id="#stationary">7</sup>
Notice that this isn't the same as assuming that our process is always in equilibrium, we are still describing a potentially non-equilibrium process, the only assumption here is that the dynamics is Markov and *stationary* with some stationary distribution that we can characterize.
</aside>
$$
\begin{align}
q(x_N) &\equiv \frac{1}{Z_N} e^{-\beta H_N} \\
p(x_0) &\equiv \frac{1}{Z_0} e^{-\beta H_0} \\
\sigma_k(x_j) &\equiv \frac{1}{Z_k} e^{-\beta E_k(x_j)}.
\end{align}
$$
Notice that the nice simplification that happens here is that since we imagined our reverse process as being the reverse of the forward process, in all but one of these fractions, the partition function of the intermediate stationary processes will cancel out. Putting this all together we obtain the general result:
$$ \frac{q_R}{p_F} = e^{\beta(W - Q - \Delta F)}, $$
if we identify $W$ with the total energy change of the system ($H_0-H_N$), $\Delta F$ with the change in the partition functions (as above, $-\beta \Delta F = \log Z_0/Z_N$) and now identify the <i>heat</i> as additional energy changes in each of the intermediate processes:<sup><a href="#serious">8</a></sup>
<aside><sup id="#serious">8</sup>
I don't think we should take this identification with the heat too seriously, some of the literature just calls this the total work for the trajectory.
</aside>
$$ Q \equiv \sum_{k=1}^{N} Q_k \qquad Q_k = \Delta E_k = E_k(x_k) - E_k(x_{k-1}) . $$
And I believe we've done it. Taking the expectation of this quantity with respect to the forward process will give us the Jarzynksi equality again<sup><a href="#ais">9</a></sup>:
$$ \left\langle e^{\beta(W - Q)} \right\rangle = e^{\beta \Delta F}. $$
<aside><sup id="#ais">9</sup>
We've also just reinvented <a href="https://arxiv.org/abs/physics/9803008">Annealed Importance Sampling (AIS)</a>. For more details of how these non-equilibrium results relate to AIS see <a href="https://papers.nips.cc/paper/2017/hash/4da04049a062f5adfe81b67dd755cecc-Abstract.html"><i>Model Evidence from nonequilibrium simulations</i></a> by Habeck, NeurIPS2017.
</aside>
<p>Taking the logarithm of the ratio and then the expectation is equivalent to the KL divergence between the forward and reverse processes, which we know must be positive:
$$ D(p_F; q_R) = \left\langle \log \frac{p_F}{q_F} \right\rangle_F = -\beta \left\langle W - Q \right\rangle + \beta \Delta F \geq 0 $$
which naturally generates the inequality (a version of the second law):
$$ \Delta F \geq \left\langle W - Q \right\rangle. $$
As a reminder, in this case, we were generalized to a situation where our initial distributions were canonical, but our dynamics were generalized to any sequence of Markovian transition kernels, provided only that those kernels have a stationary distribution.</p>
<h2>Generalized Landauer Bound</h2>
<p><a href="https://youtu.be/r33Wj8FF_EQ?t=356">Wolpert says</a> that, from stochastic thermodynamics we know:</p>
<p>\begin{equation}
-\Delta Q = \Delta \Sigma + S(p_0) - S(p_1)
\end{equation}</p>
<p>Which, with $\Delta \Sigma \geq 0$ gives us the <em>generalized Landauer bound</em></p>
<p>\begin{equation}
-\Delta Q \geq S(p_0) - S(p_1)
\end{equation}</p>
<p>For the classic case of bit erasure the change in entropy is $\log 2$ and we get Landauer's bound:</p>
<p>\begin{equation}
-\Delta Q \geq kT \log 2
\end{equation}</p>
<p>So, where does this come from? It doesn't seem like there is much to it, honestly, imagine two joint distributions $p(x_0, x_1)$ and $q(x_0, x_1)$ describing a <em>forward</em> and <em>reverse</em> process that moves between two states. The KL divergence between these two is non-negative and <em>monotonic</em></p>
<p>\begin{equation}
\left\langle \log \frac{p(x_0,x_1)}{q(x_0,x_1)} \right\rangle_p \geq \left\langle \log \frac{p(x_1)}{q(x_1)} \right\rangle \geq 0
\end{equation}</p>
<p>We can simply rearrange terms to see that:
Subtracting $\langle \log p(x_1)/q(x_1) \rangle$ from both sides we first find the entropy production:
\begin{equation}
\Delta\Sigma \equiv \left\langle \log \frac{p(x_1|x_0)p(x_0)}{q(x_0|x_1)p(x_1)} \right\rangle \geq 0
\end{equation}</p>
<p>and we can establish the identity:
\begin{equation}
\left\langle \log \frac{p(x_1|x_0)p(x_0)}{q(x_0|x_1)p(x_1)} \right\rangle_p = \left\langle \log \frac{p(x_1|x_0)}{q(x_0|x_1)} \right\rangle_p + \left\langle \log \frac{p(x_0)}{p(x_1)} \right\rangle_p
\end{equation}</p>
<p>If we simply identify terms, we recover the Wolpert form:</p>
<p>\begin{equation}
\Delta \Sigma = -\Delta Q + S(p_1)-S(p_0)
\end{equation}</p>
<p>To make these identifications, we can see that:
\begin{equation}
S(p_0) = -\left\langle \log p(x_0) \right\rangle \qquad S(p_1) = -\left\langle \log p(x_1) \right\rangle
\end{equation}</p>
<p>And for the <em>entropy rate</em>:
\begin{equation}
-\Delta Q \equiv \left\langle \log \frac{p(x_1|x_0)}{q(x_0|x_1)} \right\rangle
\end{equation}
which appears to be the likelihood ratio of our forward and reverse conditional processes, i.e. some characterization of the irreversibility of our system.</p>
<p>If we happen to be in a system that satisfies local detailed balance, we know that there should be some kind of steady state distribution for which:
\begin{equation}
p(x_1|x_0) \pi(x_0) = q(x_0|x_1) \pi(x_1)
\end{equation}
so that:
\begin{equation}
\log \frac{p(x_1|x_0)}{q(x_0|x_1)} = \log \frac{\pi(x_1)}{\pi(x_0)}
\end{equation}
and if we further imagine that the steady state distribution is boltzmann like and the system is in contact with some kind of heat bath, we see that:
\begin{equation}
\log \frac{\pi(x_1)}{\pi(x_0)} = \log \frac{\frac{1}{Z_1}e^{\beta H_1}}{\frac{1}{Z_0} e^{\beta H_0}} = \log \frac{Z_0}{Z_1}+ \beta (H_1 - H_0) = \beta \Delta F - \beta \Delta U = \Delta Q
\end{equation}
we can identify the forward to the reverse transition probabilties as the heat flow from the bath.</p>
<h2>Variational Autoencoder</h2>
<p>To show some of the generality of what we're doing here, let's do it again but for a completely different kind of system, this time a <a href="https://en.wikipedia.org/wiki/Variational_autoencoder">Variational Autoencoder</a>. In a variational autoencoder there are two joint distributions at play, one a <em>representational model</em> $p(x,z) = p(x) p(z|x)$ which starts with a draw from some <em>true</em> data distribution $p(x)$ and then uses an <em>encoder</em> to map that datum to some kind of representative code, or summary, or representation $z$: $p(z|x)$. The other joint distribution consists of a <em>generative model</em> $q(x,z) = q(z)q(x|z)$ that imagines a joint distribution over the same space but works in <em>reverse</em>. First, we generate a <em>latent variable</em> $z$ from some <em>prior distribution</em> $q(z)$ and then we use a <em>decoder</em> to stochastically turn that latent variable into a generated datum $x$: $q(x|z)$.</p>
<p>We can easily imagine the ratio of these two densities:
$$ \frac{q(x,z)}{p(x,z)} = \frac{q(z)q(x|z)}{p(x)p(z|x)}. $$</p>
<p>As we saw above, the way to generate an inequality here is to turn this into a KL divergence:
$$
\begin{align}
D( p(x,z) ; q(x,z) ) &= \left\langle \log \frac{p(x) p(z|x)}{q(z) q(x|z)} \right\rangle_p \\
&= -\left\langle -\log p(x) \right\rangle_p + \left\langle -\log q(x|z) \right\rangle_p + \left\langle \log \frac{p(z|x)}{q(z)} \right\rangle_p \\
&\equiv -\mathbb{H} + D + R \geq 0
\end{align}
$$
Here, just as above we've only rearranged terms, but this time organized them into three contributions, the <em>entropy</em> of the true data generating process:
$$ H \equiv \left\langle -\log p(x) \right\rangle_p, $$
the <em>distortion</em> a measure of the likelihood we encode then decode and image to the one we started with:
$$ D \equiv \left\langle - \log q(x|z) \right\rangle_p = -\int dx\, p(x) \int dz\, p(z|x) \log q(x|z), $$
and the <em>rate</em>, a measure of the excess cost required to communicate this message $z$ over a wire designed to be optimal for the prior $q(z)$:
$$ R \equiv \left\langle \log \frac{p(z|x)}{q(z)} \right\rangle_p = \left\langle D(p(z|x); q(z)) \right\rangle_{p(x)}. $$
We've just rederived the <em>ELBO</em><sup><a href="#elbo">10</a></sup></p>
<aside><sup id="#elbo">10</sup>
For Evidence Lower BOund.
</aside>
rendered in the form presented in <i>Fixing a Broken ELBO</i><sup><a href="#brokenelbo">11</a></sup>
<aside><sup id="#brokenelbo">11</sup>
<i>Fixing a Broken ELBO</i> by AA Alemi, B Poole, I Fischer, JV Dillon, RA Saurous and K Murphy, ICML 2018. <a href="https://arxiv.org/abs/1711.00464">1711.00464</a>
</aside>
$$ \textsf{ELBO} \equiv D + R \geq H. $$
<h2>Conclusion</h2>
<p>We've managed to derive several non-equilibrium statistical mechanical equalities and inequalities seemingly from nothing. All of these results were powered by the facts we opened with, that probability distributions integrate to one and that KL divergences are positive. The only challenge here was one of semantics. To get power out of such trivial mathematical manipulations required us to make judicious choices in how we interpreted them.</p>
<p><small>Special thanks to Sam Schoenholz, Srinivas Vasudevan, Yasaman Bahri and Jim Sethna for helpful feedback on this post.</small></p>
postsFri, 16 Sep 2022 00:00:00 -0400Uncertainty in VIBhttps://docs.google.com/presentation/d/1PjEaRIeDOwVYKEmyLBIBiKS8bYPYwwKcXki_Gb8i42c/presentVIB classifiers capture uncertainty effectively. / UAI UDL Workshop 2018https://alexalemi.com/talks/uaivib.htmltalksWed, 01 Aug 2018 00:00:00 -0400Fixing a BrokenELBOhttps://docs.google.com/presentation/d/11ToIFlOLrcP3GTl8u6Lv-jPntifO0iohIp2VIlTsqj8/presentA representational reinterpretation of VAEs that help clarify issues such as posterior collapse. / ICML2018https://alexalemi.com/talks/fixing-broken-elbo.htmltalksSun, 01 Jul 2018 00:00:00 -0400Thermodynamics and Machine Learninghttps://docs.google.com/presentation/d/1B2xbdhFRByzIJOdehPVGm5xrCpbUXkijnFRPEwVMKtk/presentAn earlier talk relating thermodynamics and machine learning for a physics audience. / Cornell Physics Colloquiumhttps://alexalemi.com/talks/thermodynamics-and-ml.htmltalksThu, 01 Nov 2018 00:00:00 -0400Focusing on the Representationhttps://docs.google.com/presentation/d/1Zd_-R6vVWkPegm_oEXTnlTvFbPTcdb71AtrHYk4-JpM/presentAn overview of my work, which often amounts to reinterpreting existing techniques in a representational light. / Cornell AI Seminarhttps://alexalemi.com/talks/focusing-on-the-representation.htmltalksThu, 01 Nov 2018 00:00:00 -0400TherMLhttps://docs.google.com/presentation/d/1Uhr4oJwTm2yI7FAvkjMbdK6s_HNwD9T61j6Ccz_eBmc/presentDrawing an analogy between Thermodynamics and modern deep variational latent variable generative modelling / Aspen: Machine Learning and Physicshttps://alexalemi.com/talks/therml.htmltalksTue, 01 Jan 2019 00:00:00 -0500A Case for Compressionhttps://docs.google.com/presentation/d/1rAZToLv1dfCXfzlzgTiYXBxf563qv0esx_i7y9vYt5c/presentI offer arguments both for and against learning compressed representations in the form of a generalized information bottleneck. / NeurIPS 2019 Workshop on Information Theory and Machine Learninghttps://alexalemi.com/talks/case-for-compression.htmltalksSun, 01 Dec 2019 00:00:00 -0500Variational Predictive Information Bottleneckhttps://docs.google.com/presentation/d/1wlQzWYr2cHu081NWPL9Cfp6z1cC4wIxe_qfFUhAfWcg/presentI attempt to show that most modern forms of inference can be viewed as optimizing a variational bound on a predictive information bottleneck objective. / Information Theory and Applications Workshophttps://alexalemi.com/talks/ita-pib.htmltalksSat, 01 Feb 2020 00:00:00 -0500TherMLhttps://docs.google.com/presentation/d/1LiovZcyZfh-P6mluB9fnyGFkOz4FJ7Y6V_hOuUx_j0A/presentAnother version of my TherML talk. / American Physical Society Topical Group on Data Sciencehttps://alexalemi.com/talks/therml-aps.htmltalksMon, 01 Jun 2020 00:00:00 -0400Machine Learning and Thermodynamicshttps://docs.google.com/presentation/d/1zG0pU33e6SnIhyYR926Y6JNhBdP4kqM3-vMz1vNdTQk/presentThermodynamics from a Probabilistic perspective and machine learning from a thermodynamic perspective. / University of Maryland - Informal Statistical Physics Seminarhttps://alexalemi.com/talks/ml-and-thermo.htmltalksMon, 01 Jun 2020 00:00:00 -0400VIB is Half Bayeshttps://youtu.be/JGDAZ4joUX8The Variational Information Bottleneck can be viewed as a sort of half-Bayesian approach. / Advances in Approximate Bayesian Inference Symposium 2021https://alexalemi.com/talks/vib-is-half-bayes.htmltalksMon, 01 Feb 2021 00:00:00 -0500Machine Learning and Thermodynamicshttps://docs.google.com/presentation/d/1tIGTRRE0gKjBIySrQOUO-qlIDlj1nXhDybamG0YP3YI/presentAnother version of the relationship between thermodynamics and machine learning. / Scientific Machine Learning Mini-Course (SciML) @ CMUhttps://alexalemi.com/talks/sciml-thermo.htmltalksThu, 01 Jul 2021 00:00:00 -0400PACm Bayes - Your Model is Wrong Workshophttps://youtu.be/HHu7fclYlVgBayesian inference doesn't optimize for prediction in mispecified models. / Your Model is Wrong Workshop - NeurIPS 2021https://alexalemi.com/talks/pacm-talk.htmltalksMon, 01 Nov 2021 00:00:00 -0400Inferential Engineshttps://docs.google.com/presentation/d/1WjyaZxYD6jf_bkK4QIhgM73MuO8nXiLZqt0P8s726VQ/present?usp=share_link&resourcekey=0-IuSicQOQrSCn-kQsu3NyQQViewing VAEs as four stroke engines. / Theoretical Physics for Machine Learning - Aspenhttps://alexalemi.com/talks/inferential-engines.htmltalksWed, 01 Feb 2023 00:00:00 -0500Why Venus has no moonhttps://alexalemi.com/publications/venus.pdfUndergraduate research investigating whether two collisions in the opposite direction could explain Venus' lack of moon and slow rotation. / AA Alemi, DJ Stevenson / / AAS Oralhttps://alexalemi.com/publications/venus.pdfpublicationsFri, 01 Sep 2006 00:00:00 -0400NEMS Couplinghttps://alexalemi.com/publications/nems.pdfUndergraduate research project on synchronization in nano cantilevers. / AA Alemi / / https://alexalemi.com/publications/nems.pdfpublicationsMon, 01 Sep 2008 00:00:00 -0400Laplace-Runge-Lenz Vectorhttps://alexalemi.com/publications/laplace.pdfUndergraduate project on the history of the Runge Vector. / AA Alemi / / https://alexalemi.com/publications/laplace.pdfpublicationsMon, 01 Jun 2009 00:00:00 -0400Near-field radiative heat transfer between macroscopic planar surfaceshttps://arxiv.org/abs/1103.2389Exploration of quantum tunnelling as a mechanism for cooling the next generation LIGO detectors. / RS Ottens, Volker Quetschke, Stacy Wise, AA Alemi, Ramsey Lundock, Guido Mueller, David H Reitze, David B Tanner, Bernard F Whiting / 1103.2389 / Phys Rev Letthttps://alexalemi.com/publications/heat.pdfpublicationsTue, 01 Mar 2011 00:00:00 -0500Growth and form of melanoma cell colonieshttps://arxiv.org/abs/1308.6037Simple models of skin cancer growth. / MM Baraldi, AA Alemi, JP Sethna, S Caracciolo, CAM La Porta, S Zapperi / 1308.6037 / JSMhttps://alexalemi.com/publications/melanoma.pdfpublicationsThu, 01 Aug 2013 00:00:00 -0400Imaging atomic rearrangements in two-dimensional silica glass: watching silica's dancehttps://alexalemi.com/publications/glass.pdfApplying elastic theory to the atomic scale. / PY Huang, S Kurasch, JS Alden, A Shekhawat, AA Alemi, PL McEuen, JP Sethna, U Kaiser, DA Muller / / Sciencehttps://alexalemi.com/publications/glass.pdfpublicationsTue, 01 Oct 2013 00:00:00 -0400Knowledgebase of Interatomic Models application programming interface as a standard for molecular simulationshttps://alexalemi.com/publications/openkim2.pdfBuilding a website to collect interatomic potentials and score them. / R Elliott, E Tadmor, D Karls, A Ludvik, J Sethna, M Bierbaum, AA Alemi, T Wennblom / / https://alexalemi.com/publications/openkim2.pdfpublicationsWed, 01 Oct 2014 00:00:00 -0400Ensuring reliability, reproducibility and transferability in atomistic simulations: The knowledgebase of interatomic models (https://openkim.org)https://alexalemi.com/publications/openkim-abs.pdf / E Tadmor, R Elliott, D Karls, A Ludvik, J Sethna, M Bierbaum, AA Alemi, T Wennblom / / https://alexalemi.com/publications/openkim-abs.pdfpublicationsWed, 01 Oct 2014 00:00:00 -0400Mechanical properties of growing melanocytic nevi and the progression to melanomahttps://arxiv.org/abs/1404.4116Elastic models of skin cancer. / A Taloni, AA Alemi, E Ciusani, JP Sethna, S Zapperi, CAM La Porta / 1404.4116 / PloS Onehttps://alexalemi.com/publications/cancer.pdfpublicationsTue, 01 Apr 2014 00:00:00 -0400Text segmentation based on semantic word embeddingshttps://arxiv.org/abs/1503.05543Using word2vec vectors to do automatic text segmentation. / AA Alemi, P Ginsparg / 1503.05543 / https://alexalemi.com/publications/segmentation.pdfpublicationsSun, 01 Mar 2015 00:00:00 -0500Clustering via Content-Augmented Stochastic Blockmodelshttps://arxiv.org/abs/1505.06538Better clustering through content conditioning. / JM Cashore, X Zhao, AA Alemi, Y Liu, PI Frazier / 1505.06538 / https://alexalemi.com/publications/blockmodels.pdfpublicationsFri, 01 May 2015 00:00:00 -0400Zombies Reading Segmented Graphene Articles On The Arxivhttps://alexalemi.com/publications/thesis.pdfA collection of four of my graduate student projects. / AA Alemi / / Thesishttps://alexalemi.com/publications/thesis.pdfpublicationsSat, 01 Aug 2015 00:00:00 -0400You can run, you can hide: The epidemiology and statistical mechanics of zombieshttps://arxiv.org/abs/1503.01104A fun pedadogical introduction to epidemiology and statistical mechanics. / AA Alemi, M Bierbaum, CR Myers, JP Sethna / 1503.01104 / Phys Rev Ehttps://alexalemi.com/publications/zombies.pdfpublicationsSun, 01 Nov 2015 00:00:00 -0400SPARTA: Fast global planning of collision-avoiding robot trajectorieshttps://alexalemi.com/publications/sparta.pdfUsing ADMM to do fast trajectory planning. / CJM Mathy, F Gonda, D Schmidt, N Derbinsky, AA Alemi, J Bento, FM Delle Fave, JS Yedidia / / https://alexalemi.com/publications/sparta.pdfpublicationsTue, 01 Dec 2015 00:00:00 -0500DeepMath-deep sequence models for premise selectionhttps://arxiv.org/abs/1606.04442Using neural networks to improve automatic theorem proving. / G Irving, C Szegedy, AA Alemi, N Eén, F Chollet, J Urban / 1606.04442 / NeurIPShttps://alexalemi.com/publications/deep_math.pdfpublicationsWed, 01 Jun 2016 00:00:00 -0400Improving inception and image classification in tensorflowhttps://ai.googleblog.com/2016/08/improving-inception-and-image.htmlBlogpost accompanying open source release of Inception Resnet V2. / AA Alemi / / Google Research Bloghttps://alexalemi.com/publications/inceptionblog.htmlpublicationsWed, 01 Jun 2016 00:00:00 -0400Tree-Structured Variational Autoencoderhttps://alexalemi.com/publications/tree_vae.pdfAttempting to learn tree-structured representations. / R Shin, AA Alemi, G Irving, O Vinyals / / https://alexalemi.com/publications/tree_vae.pdfpublicationsTue, 01 Nov 2016 00:00:00 -0400Improved generator objectives for ganshttps://arxiv.org/abs/1612.02780You can target separate divergences for the generator and discriminator of a GAN. / B Poole, AA Alemi, J Sohl-Dickstein, A Angelova / 1612.02780 / NeurIPS Adversarial Workshophttps://alexalemi.com/publications/improved_gan.pdfpublicationsThu, 01 Dec 2016 00:00:00 -0500Deep Variational Information Bottleneckhttps://arxiv.org/abs/1612.00410A modern formulation of the Information Bottleneck which is friendly towards neural networks. / AA Alemi, I Fischer, JV Dillon, K Murphy / 1612.00410 / ICLRhttps://alexalemi.com/publications/vib.pdfpublicationsWed, 01 Mar 2017 00:00:00 -0500Inception-v4, inception-resnet and the impact of residual connections on learninghttps://arxiv.org/abs/1602.07261Residual connections improve the inception family of classifiers. / C Szegedy, S Ioffe, V Vanhoucke, AA Alemi / 1602.07261 / AAAIhttps://alexalemi.com/publications/inceptionv4.pdfpublicationsWed, 01 Feb 2017 00:00:00 -0500Motion prediction under multimodality with conditional stochastic networkshttps://arxiv.org/abs/1705.02082Pedestrian motion is stochastic which creates certain challenges. / K Fragkiadaki, J Huang, AA Alemi, S Vijayanarasimhan, S Ricco, R Sukthankar / 1705.02082 / https://alexalemi.com/publications/motion.pdfpublicationsMon, 01 May 2017 00:00:00 -0400Jeffrey's prior sampling of deep sigmoidal networkshttps://arxiv.org/abs/1705.10589Jeffrey's prior doesn't really work for neural networks. / LX Hayden, AA Alemi, PH Ginsparg, JP Sethna / 1705.10589 / https://alexalemi.com/publications/jeffrey.pdfpublicationsMon, 01 May 2017 00:00:00 -0400Light microscopy at maximal precisionhttps://arxiv.org/abs/1702.07336Better featuring of colloids. / M Bierbaum, BD Leahy, AA Alemi, I Cohen, JP Sethna / 1702.07336 / Phys Rev Xhttps://alexalemi.com/publications/peri.pdfpublicationsWed, 01 Feb 2017 00:00:00 -0500Tensorflow distributionshttps://arxiv.org/abs/1711.10604Paper accompanying library. / JV Dillon, I Langmore, D Tran, E Brevdo, S Vasudevan, D Moore, B Patton, AA Alemi, M Hoffman, RA Saurous / 1711.10604 / https://alexalemi.com/publications/tfd.pdfpublicationsWed, 01 Nov 2017 00:00:00 -0400Fixing a Broken ELBOhttps://arxiv.org/abs/1711.00464Adopting a representational view of VAEs can help explain away some of their problems. / AA Alemi, B Poole, I Fischer, JV Dillon, RA Saurous, K Murphy / 1711.00464 / ICMLhttps://alexalemi.com/publications/broken_elbo.pdfpublicationsTue, 01 May 2018 00:00:00 -0400GILBO: one metric to measure them allhttps://arxiv.org/abs/1802.04874A variational lower bound on the mutual informations in GANs highlight some of their problems. / AA Alemi, I Fischer / 1802.04874 / NeurIPShttps://alexalemi.com/publications/gilbo.pdfpublicationsSat, 01 Dec 2018 00:00:00 -0500Watch your step: Learning node embeddings via graph attentionhttps://arxiv.org/abs/1710.09599Building better graph representations. / S Abu-El-Haija, B Perozzi, R Al-Rfou, AA Alemi / 1710.09599 / NeurIPShttps://alexalemi.com/publications/watch_step.pdfpublicationsSat, 01 Dec 2018 00:00:00 -0500Uncertainty in the Variational Information Bottleneckhttps://arxiv.org/abs/1807.00906VIB builds robust classifiers which are aware of what they don't know. / AA Alemi, I Fischer, JV Dillon / 1807.00906 / UAI UDL Workshophttps://alexalemi.com/publications/uncert_vib.pdfpublicationsSun, 01 Jul 2018 00:00:00 -0400TherML: Thermodynamics of Machine Learninghttps://arxiv.org/abs/1807.04162Modern variational latent variable modelling looks a lot like Thermodynamics. / AA Alemi, I Fisher / 1807.04162 / ICML2018 TFADGM Workshophttps://alexalemi.com/publications/therml.pdfpublicationsSun, 01 Jul 2018 00:00:00 -0400WAIC, but Why? Generative Ensembles for Robust Anomaly Detectionhttps://arxiv.org/abs/1810.01392Even though it shouldn't work, robust likelihoods can detect OOD data in practice. / H Choi, E Jang, AA Alemi / 1810.01392 / https://alexalemi.com/publications/waic.pdfpublicationsMon, 01 Oct 2018 00:00:00 -0400Canonical Sectors and Evolution of Firms in the US Stock Marketshttps://arxiv.org/abs/1503.06205Matrix factorization gives automatic and continous sector assignments to stocks. / LX Hayden, R Chachra, AA Alemi, PH Ginsparg, JP Sethna / 1503.06205 / Quantitative Financehttps://alexalemi.com/publications/stocks.pdfpublicationsMon, 01 Oct 2018 00:00:00 -0400β-VAEs can retain label information even at high compressionhttps://arxiv.org/abs/1812.02682Some rich decoder VAEs can magically focus on salient information. / E Fertig, A Arbabi, AA Alemi / 1812.02682 / NeurIPS BDL Workshophttps://alexalemi.com/publications/beta_retain.pdfpublicationsSat, 01 Dec 2018 00:00:00 -0500On the Use of ArXiv as a Datasethttps://arxiv.org/abs/1905.0075More people should use the ArXiv as a dataset. / CB Clement, M Bierbaum, KP O'Keeffe, AA Alemi / 1905.0075 / ICLR workshop RLGMhttps://alexalemi.com/publications/arxiv.pdfpublicationsWed, 01 May 2019 00:00:00 -0400Variational Autoencoders with Tensorflow Probability Layershttps://medium.com/tensorflow/variational-autoencoders-with-tensorflow-probability-layers-d06c658931b7TFP makes VAEs easy. / I Fischer, AA Alemi, JV Dillon, TFP Team / / Tensorflow Bloghttps://alexalemi.com/publications/vaetfp.htmlpublicationsFri, 01 Mar 2019 00:00:00 -0500Dueling Decoders: Regularizing Variational Autoencoder Latent Spaceshttps://arxiv.org/abs/1905.07478Sometimes a worse decoder gives better representations. / B Seybold, E Fertig, AA Alemi, I Fischer / 1905.07478 / https://alexalemi.com/publications/dueling.pdfpublicationsWed, 01 May 2019 00:00:00 -0400On Variational Bounds of Mutual Informationhttps://arxiv.org/abs/1905.06922Overview of recent advances in variationally bounding mutual information. / B Poole, S Ozair, A van den Oord, AA Alemi, G Tucker / 1905.06922 / ICMLhttps://alexalemi.com/publications/vmibounds.pdfpublicationsWed, 01 May 2019 00:00:00 -0400Thermodynamic Computinghttps://arxiv.org/abs/1911.01968A position paper on the future of thermodynamic computing. / T Conte, E DeBenedictis, N Ganesh, T Hylton, JP Strachan, RS Williams, AA Alemi, L Altenberg, G Crooks, J Crutchfield, L del Rio, J Deutsch, M DeWeese, K Douglas, M Esposito, M Frank, R Fry, P Harsha, M Hill, C Kello, J Krichmar, S Kumar, SC Liu, S Lloyd, M Marsili, I Nemenman, A Nugent, N Packard, D Randall, P Sadowski, N Santhanam, R Shaw, A Stieg, E Stopnitzky, C Teuscher, C Watkins, D Wolpert, J Yang, Y Yufik / 1911.01968 / CCChttps://alexalemi.com/publications/thermodynamic.pdfpublicationsFri, 01 Nov 2019 00:00:00 -0400On Predictive Information in RNNshttps://arxiv.org/abs/1910.09578Modern RNNs do not optimally capture predictive information in sequences. / Z Dong, D Oktay, B Poole, AA Alemi / 1910.09578 / https://alexalemi.com/publications/salamander.pdfpublicationsTue, 01 Oct 2019 00:00:00 -0400CEB Improves Model Robustnesshttps://arxiv.org/abs/2002.05380A class conditional version of VIB shows good robustness. / I Fischer, AA Alemi / 2002.05380 / Entropyhttps://alexalemi.com/publications/cebrobust.pdfpublicationsTue, 01 Oct 2019 00:00:00 -0400Information in Infinite Ensembles of Infinitely-Wide Networkshttps://arxiv.org/abs/1911.09189While they seem complex, infinite ensembles of infinitely-wide networks are simple enough to enable tractable calculations of many information theoretic quantities. / R Shwartz-Ziv, AA Alemi / 1911.09189 / AABI 2019 - PMLRhttps://alexalemi.com/publications/infiniteinfo.pdfpublicationsTue, 01 Oct 2019 00:00:00 -0400Variational Predictive Information Bottleneckhttps://arxiv.org/abs/1910.10831Most modern inference procedures can be rederived as a simple variational bound on a predictive information bottleneck objective. / AA Alemi / 1910.10831 / AABIhttps://alexalemi.com/publications/pib.pdfpublicationsTue, 01 Oct 2019 00:00:00 -0400Neural Tangents: Fast and Easy Infinite Neural Networks in Pythonhttps://arxiv.org/abs/1912.02803Simple to use python package for training infinitely wide neural networks. / R Novak, L Xiao, J Hron, J Lee, AA Alemi, J Sohl-Dickstein, SS Schoenholz / 1912.02803 / ICLRhttps://alexalemi.com/publications/neural_tangents.pdfpublicationsSun, 01 Dec 2019 00:00:00 -0500The OpenKIM Processing Pipeline: A Cloud-Based Automatic Materials Property Computation Enginehttps://arxiv.org/abs/2005.09062Database for Interatomic Potentials. / DS Karls, M Bierbaum, AA Alemi, RS Elliot, JP Sethna, EB Tadmor / 2005.09062 / Journal of Chemical Physicshttps://alexalemi.com/publications/openkim.pdfpublicationsFri, 01 May 2020 00:00:00 -0400Density of States Estimation for Out-of-Distribution Detectionhttps://arxiv.org/abs/2006.09273Simple density-of-states inspired out of distribution detection. / WR Morningstar, C Ham, AG Gallagher, B Lakshminarayanan, AA Alemi, JV Dillon / 2006.09273 / AISTATS 2021 Oralhttps://alexalemi.com/publications/dose.pdfpublicationsMon, 01 Jun 2020 00:00:00 -0400PACᵐ-Bayes: Narrowing the Empirical Risk Gap in the Misspecified Bayesian Regimehttps://arxiv.org/abs/2010.09629Multisample bound that does better than Bayes at prediction for misspecified models. / WR Morningstar, AA Alemi, JV Dillon / 2010.09629 / AISTATS2022https://alexalemi.com/publications/pacm.pdfpublicationsThu, 01 Oct 2020 00:00:00 -0400VIB is Half Bayeshttps://arxiv.org/abs/2011.08711VIB can be rederived as a half-Bayesian half-Maximum likelihood method. / AA Alemi, WR Morningstar, B Poole, I Fischer, JV Dillon / 2011.08711 / AABI 2021 Oralhttps://alexalemi.com/publications/pacvib.pdfpublicationsSun, 01 Nov 2020 00:00:00 -0400Does Knowledge Distillation Really Work?https://arxiv.org/abs/2106.05945Knowledge distillation doesn't seem to work as well as people assume it does. / S Stanton, P Izmailov, P Kirichenko, AA Alemi, AG Wilson / 2106.05945 / NeurIPS2021https://alexalemi.com/publications/distillation.pdfpublicationsTue, 01 Jun 2021 00:00:00 -0400A Closer Look at the Adversarial Robustness of Information Bottleneck Modelshttps://arxiv.org/abs/2107.05712Looking more carefully, IB models aren't fully robust to adversarial examples. / I Korshunova, D Stutz, AA Alemi, O Wiles, S Gowal / 2107.05712 / ICML 2021 AML Workshop Posterhttps://alexalemi.com/publications/robustness.pdfpublicationsTue, 01 Jun 2021 00:00:00 -0400Bayesian Imitation Learning for End-to-End Mobile Manipulationhttps://arxiv.org/abs/2202.07600Using VIB to help robots open doors. / Y Du, D Ho, AA Alemi, E Jang, M Khansari / 2202.07600 / ICML 2022https://alexalemi.com/publications/endtoend.pdfpublicationsTue, 01 Feb 2022 00:00:00 -0500Trajectory ensembling for fine tuning - performance gains without modifying traininghttps://alexalemi.com/publications/traj-ensemble.pdfEnsembling within a trajectory gives some simple gains. / L Anderson-Conway, V Birodkar, S Singh, H Mobahi, AA Alemi / / HITY Workshop NeurIPS 2022https://alexalemi.com/publications/traj-ensemble.pdfpublicationsThu, 01 Sep 2022 00:00:00 -0400Weighted Ensemble Self-Supervised Learninghttps://arxiv.org/abs/2211.09981Ensembling the heads of SSL methods gives nice gains. / Y Ruan, S Singh, WR Morningstar, AA Alemi, S Ioffe, I Fischer, JV Dillon / 2211.09981 / ICLR 2023https://alexalemi.com/publications/weighted-ssl.pdfpublicationsTue, 01 Nov 2022 00:00:00 -0400Variational Predictionhttps://alexalemi.com/publications/variational-prediction.pdfTargetting the predictive distribution directly with a variational method. / AA Alemi, B Poole / / TBDhttps://alexalemi.com/publications/variational-prediction.pdfpublicationsSun, 01 May 2022 00:00:00 -0400