The Reparameterization Trick

A common explanation for the reparameterization trick with variational autoencoders is that we cannot backpropagate through a stochastic node. I provide a more formal justification.

Informal justifications

In Auto-Encoding Variational Bayes, (Kingma & Welling, 2013), Kingma presents an unbiased, differentiable, and scalable estimator for the ELBO in variational inference. A key idea behind this estimator is the reparameterization trick. But why do we need this trick in the first place? When first learning about variational autoencoders (VAEs), I tried to find an answer online but found the explanations too informal. Here are a few examples:

StackExchange: We need the reparameterization trick in order to backpropagate through a random node.

Reddit: The “trick” part of the reparameterization trick is that you make the randomness an input to your model instead of something that happens “inside” it, which means you never need to differentiate with respect to sampling (which you can’t do).

Quora: The problem is because backpropogation cannot flow through a random node.

I found these unsatisfactory. What does a “random node” mean and what does it mean for backprop to “flow” or not flow through such a node?

The goal of this post is to provide a more formal answer for why we need the reparameterization trick. I assume the reader is familiar with variational inference and variational autoencoders. Otherwise, I recommend (Blei et al., 2017) and (Doersch, 2016) as introductions.

Undifferentiable expectations

Let’s say we want to take the gradient w.r.t. θ\theta of the following expectation,

Ep(z)[fθ(z)] \mathbb{E}_{p(z)}[f_{\theta}(z)]

where pp is a density. Provided we can differentiate fθ(z)f_{\theta}(z), we can easily compute the gradient:

θEp(z)[fθ(z)]=θ[zp(z)fθ(z)dz]=zp(z)[θfθ(z)]dz=Ep(z)[θfθ(z)] \begin{aligned} \nabla_{\theta} \mathbb{E}_{p(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} p(z) \Big[\nabla_{\theta} f_{\theta}(z) \Big] dz \\ &= \mathbb{E}_{p(z)} \Big[\nabla_{\theta} f_{\theta}(z) \Big] \end{aligned}

In words, the gradient of the expectation is equal to the expectation of the gradient. But what happens if our density pp is also parameterized by θ\theta?

θEpθ(z)[fθ(z)]=θ[zpθ(z)fθ(z)dz]=zθ[pθ(z)fθ(z)]dz=zfθ(z)θpθ(z)dz+zpθ(z)θfθ(z)dz=zfθ(z)θpθ(z)dzWhat about this?+Epθ(z)[θfθ(z)] \begin{aligned} \nabla_{\theta} \mathbb{E}_{p_{\theta}(z)}[f_{\theta}(z)] &= \nabla_{\theta} \Big[ \int_{z} p_{\theta}(z) f_{\theta}(z) dz \Big] \\ &= \int_{z} \nabla_{\theta} \Big[ p_{\theta}(z) f_{\theta}(z) \Big] dz \\ &= \int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z) dz + \int_{z} p_{\theta}(z) \nabla_{\theta} f_{\theta}(z) dz \\ &= \underbrace{ \int_{z} f_{\theta}(z) \nabla_{\theta} p_{\theta}(z) dz}_{\text{What about this?}} + \mathbb{E}_{p_{\theta}(z)} \Big[\nabla_{\theta} f_{\theta}(z)\Big] \end{aligned}

The first term of the last equation is not guaranteed to be an expectation. Monte Carlo methods require that we can sample from pθ(z)p_{\theta}(z), but not that we can take its gradient. This is not a problem if we have an analytic solution to θpθ(z)\nabla_{\theta }p_{\theta}(z), but this is not true in general.

Now that we have a better understanding of the problem, let’s see what happens when we apply the reparameterization trick to our simple example. To be consistent with Kingma, I’ll switch to bold text for vectors and denote the iith sample of vector v\textbf{v} as v(i)\textbf{v}^{(i)} and lLl \in L to denote the llth Monte Carlo sample:

ϵp(ϵ)z=gθ(ϵ,x)Epθ(z)[f(z(i))]=Ep(ϵ)[f(gθ(ϵ,x(i)))]θEpθ(z)[f(z(i))]=θEp(ϵ)[f(gθ(ϵ,x(i)))](1)=Ep(ϵ)[θf(gθ(ϵ,x(i)))](2)1Ll=1Lθf(gθ(ϵ(l),x(i)))(3) \begin{aligned} \boldsymbol{\epsilon} &\sim p(\boldsymbol{\epsilon}) \\ \\ \textbf{z} &= g_{\boldsymbol{\theta}}(\boldsymbol{\epsilon}, \textbf{x}) \\ \\ \mathbb{E}_{p_{\boldsymbol{\theta}}(\textbf{z})}[f(\textbf{z}^{(i)})] &= \mathbb{E}_{p(\boldsymbol{\epsilon})} [f(g_{\theta}(\boldsymbol{\epsilon}, \textbf{x}^{(i)}))] \\ \\ \nabla_{\theta} \mathbb{E}_{p_{\boldsymbol{\theta}}(\textbf{z})}[f(\textbf{z}^{(i)})] &= \nabla_{\theta} \mathbb{E}_{p(\boldsymbol{\epsilon})} [f(g_{\boldsymbol{\theta}}(\boldsymbol{\epsilon}, \textbf{x}^{(i)}))] && \text{(1)} \\ &= \mathbb{E}_{p(\boldsymbol{\epsilon})} [\nabla_{\boldsymbol{\theta}} f(g_{\boldsymbol{\theta}}(\boldsymbol{\epsilon}, \textbf{x}^{(i)}))] && \text{(2)} \\ &\approx \frac{1}{L} \sum_{l=1}^{L} \nabla_{\boldsymbol{\theta}} f(g_{\boldsymbol{\theta}}(\epsilon^{(l)}, \textbf{x}^{(i)})) && \text{(3)} \end{aligned}

In my mind, the above line of reasoning is key to understanding VAEs. We use the reparameterization trick to express a gradient of an expectation (1)(1) as an expectation of a gradient (2)(2). Provided gθg_{\boldsymbol{\theta}} is differentiable—something Kingma emphasizes—then we can then use Monte Carlo methods to estimate θEpθ(z)[f(z(i))]\nabla_{\theta} \mathbb{E}_{p_{\boldsymbol{\theta}}(\textbf{z})}[f(\textbf{z}^{(i)})] (3)(3).

Sanity check

It is worth noting that this explanation aligns with Kingma’s own justification:

Kingma: This reparameterization is useful for our case since it can be used to rewrite an expectation w.r.t qϕ(zx)q_{\phi}(\textbf{z} \mid \textbf{x}) such that the Monte Carlo estimate of the expectation is differentiable w.r.t. ϕ\phi.

The issue is not that we cannot backprop through a “random node” in any technical sense. Rather, backproping would not compute an estimate of the derivative. Without the reparameterization trick, we have no guarantee that sampling large numbers of z\textbf{z} will help converge to the right estimate of θ\nabla_{\theta}.

Furthermore, this is the exact problem we have with the ELBO we want to estimate:

ELBO(θ,ϕ)=[Eqϕ(z)[logpθ(x,z)logqϕ(zx)]](4)θ,ϕELBO(θ,ϕ)=θ,ϕ[Eqϕ(z)[logpθ(x,z)logqϕ(zx)]]Gradient w.r.t. ϕ over expectation w.r.t. ϕ \begin{aligned} \text{ELBO}(\boldsymbol{\theta}, \boldsymbol{\phi}) &= \Big[\mathbb{E}_{q_{\boldsymbol{\phi}}(\textbf{z})}[\log p_{\boldsymbol{\theta}}(\textbf{x}, \textbf{z}) - \log q_{\boldsymbol{\phi}}(\textbf{z} \mid \textbf{x})] \Big] && \text{(4)} \\ &\downarrow \\ \nabla_{\theta, \phi} \text{ELBO}(\boldsymbol{\theta}, \boldsymbol{\phi}) &= \underbrace{\nabla_{\theta, \phi} \Big[\mathbb{E}_{q_{\boldsymbol{\phi}}(\textbf{z})}[\log p_{\boldsymbol{\theta}}(\textbf{x}, \textbf{z}) - \log q_{\boldsymbol{\phi}}(\textbf{z} \mid \textbf{x})] \Big]}_{\text{Gradient w.r.t. $\phi$ over expectation w.r.t. $\phi$}} \end{aligned}

At this point, you might have noticed that the above equation does not look like what we compute in a standard VAE. In the paper, Kingma actually presents two estimators, which he denotes LA\mathcal{L}^A and LB\mathcal{L}^B. Equation (4)(4) is LA\mathcal{L}^A, while LB\mathcal{L}^B is an estimator we can use when we have an analytic solution to the KL-divergence term in the ELBO, for example when we assume both the prior pθ(z)p_{\boldsymbol{\theta}}(\textbf{z}) and the posterior approximation qϕ(zx)q_{\boldsymbol{\phi}}(\textbf{z} \mid \textbf{x}) are Gaussian:

θ,ϕLB=θ,ϕ[KL[qϕ(zx(i))pθ(z)]]Analytically compute this+θ,ϕ[1Ll=1L(logpθ(x(i)z(l)))]Monte Carlo estimate this \nabla_{\theta, \phi} \mathcal{L}^B = - \nabla_{\theta, \phi} \overbrace{\Bigg[\text{KL}[q_{\phi}(\textbf{z} \mid \textbf{x}^{(i)}) \lVert p_{\theta}(\textbf{z})]\Bigg]}^{\text{Analytically compute this}} + \nabla_{\theta, \phi} \overbrace{\Bigg[ \frac{1}{L} \sum_{l=1}^{L} \Big( \log p_{\boldsymbol{\theta}}(\textbf{x}^{(i)} \mid \textbf{z}^{(l)}) \Big)\Bigg]}^{\text{Monte Carlo estimate this}}

Now that we can compute the full loss through a sequence of differentiable operations, we can use our favorite gradient-based optimization technique to maximize the ELBO.

Implementation

I always find it useful to close the loop and talk about implementation. The equation above is what the standard VAE implements (example) because Kingma derives an analytic solution for the KL term in Appendix 2. A common framing for this version of the model is to think of the likelihood as a “decoder” and the approximate posterior as an “encoder”:

LB=KL[qϕ(zx(i))Encoderpθ(z)Fixed]+1Ll=1Llogpθ(x(i)z(l))Decoder \mathcal{L}^B = - \text{KL}[\overbrace{q_{\phi}(\textbf{z} \mid \textbf{x}^{(i)})}^{\text{Encoder}} \lVert \overbrace{p_{\theta}(\textbf{z})}^{\text{Fixed}}] + \frac{1}{L} \sum_{l=1}^{L} \log \overbrace{p_{\boldsymbol{\theta}}(\textbf{x}^{(i)} \mid \textbf{z}^{(l)})}^{\text{Decoder}}

I think of it like this: the KL-divergence term encourages the approximate posterior to be close to the prior pθ(z)p_{\boldsymbol{\theta}}(\textbf{z}). If the approximate posterior could exactly match both the real posterior and the prior, then using Bayes’ rule we would know that p(x)=p(xz)p(\textbf{x}) = p(\textbf{x} \mid \textbf{z}). This is exactly what we would want from a generative model. We could sample z\textbf{z} using the reparameterization trick and then condition on z\textbf{z} to generate a realistic sample x\textbf{x}.

Concretely, here is one pass through the computational graph when the prior and approximate posterior are Gaussian:

μx,σx=M(x),Σ(x)Push x through encoderϵN(0,1)Sample noisez=ϵσx+μxReparameterizexr=pθ(xz)Push z through decoderrecon. loss=MSE(x,xr)Compute reconstruction lossvar. loss=KL[N(μx,σx)N(0,I)]Compute variational lossL=recon. loss+var. lossCombine losses \begin{aligned} \boldsymbol{\mu}_x, \boldsymbol{\sigma}_x &= M(\textbf{x}), \Sigma(\textbf{x}) && \text{Push $\textbf{x}$ through encoder} \\ \\ \boldsymbol{\epsilon} &\sim \mathcal{N}(0, 1) && \text{Sample noise} \\ \\ \textbf{z} &= \boldsymbol{\epsilon} \boldsymbol{\sigma}_x + \boldsymbol{\mu}_x && \text{Reparameterize} \\ \\ \textbf{x}_r &= p_{\boldsymbol{\theta}}(\textbf{x} \mid \textbf{z}) && \text{Push $\textbf{z}$ through decoder} \\ \\ \text{recon. loss} &= \text{MSE}(\textbf{x}, \textbf{x}_r) && \text{Compute reconstruction loss} \\ \\ \text{var. loss} &= -\text{KL}[\mathcal{N}(\boldsymbol{\mu}_x, \boldsymbol{\sigma}_x) \lVert \mathcal{N}(0, I)] && \text{Compute variational loss} \\ \\ \text{L} &= \text{recon. loss} + \text{var. loss} && \text{Combine losses} \end{aligned}

Since we computed every variable in this computational graph through a sequence of differentiable operations, we can use a method like backpropagation to compute the required gradients. I find this easiest to see via a diagram:

Above, the input or root nodes are in yellow. I like to denote the network parameters as inputs because it emphasizes how a tool like autograd (Maclaurin et al., 2015) actually works: for a given parameter ww, we can compute L/w\partial L / \partial w at the graph node that directly takes ww as input.

What’s in a name?

When I first read Kingma’s paper, I wondered why it focused on the stochastic gradient variational Bayes (SGVB) estimator and associated algorithm, while the now-famous variational autoencoder was just given as an example halfway through the paper.

But with a better understanding of the differentiability of this Monte Carlo estimator, we can understand the focus of the paper and the name of the estimator. Variational Bayes refers to approximating integrals using Bayesian inference. The method is stochastic because it approximates an expectation with many random samples. And a VAE using neural networks is an example of a model you could build with the SGVB estimator because the estimator is gradient-based.

   

Acknowledgements

I want to thank Bianca Dumitrascu for many helpful conversations on this topic.

  1. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. ArXiv Preprint ArXiv:1312.6114.
  2. Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.
  3. Doersch, C. (2016). Tutorial on variational autoencoders. ArXiv Preprint ArXiv:1606.05908.
  4. Maclaurin, D., Duvenaud, D., & Adams, R. P. (2015). Autograd: Effortless gradients in numpy. ICML 2015 AutoML Workshop.