I learned very early the difference between knowing the name of something and knowing something.

Richard FeynmanStandard Errors and Confidence Intervals

16 February 2021

How do we know when a parameter estimate from a random sample is significant? I discuss the use of standard errors and confidence intervals to answer this question.

1A Python Implementation of the Multivariate Skew Normal

29 December 2020

I needed a Python implementation of the multivariate skew normal. I wrote one based on SciPy's multivariate distributions module.

2Understanding Dirichlet–Multinomial Models

24 December 2020

The Dirichlet distribution is really a multivariate beta distribution. I discuss this connection and then derive the posterior, marginal likelihood, and posterior predictive distributions for Dirichlet–multinomial models.

3For a project, I needed to compute the log PDF of a vector $\mathbf{x}$ for multiple pairs of parameters, $\{(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1), \dots, (\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)\}$. I discuss a fast Python implementation.

4Why Shouldn't I Invert That Matrix?

09 December 2020

A standard claim in textbooks and courses in numerical linear algebra is that one should not invert a matrix to solve for $\mathbf{x}$ in $\mathbf{Ax} = \mathbf{b}$. I explore why this is typically true.

5Inference for Hidden Markov Models

28 November 2020

Expectation–maximization for hidden Markov models is called the Baum–Welch algorithm, and it relies on the forward–backward algorithm for efficient computation. I review HMMs and then present these algorithms in detail.

619 November 2020

The unscented transform, most commonly associated with the nonlinear Kalman filter, was proposed by Jeffrey Uhlmann to estimate a nonlinear transformation of a Gaussian. I illustrate the main idea.

7Conjugate Analysis for the Multivariate Gaussian

18 November 2020

I work through Bayesian parameter estimation of the mean for the multivariate Gaussian.

8A Python Demonstration that Mutual Information Is Symmetric

11 November 2020

I provide a numerical demonstration that the mutual information of two random variables, the observations and latent variables in a Gaussian mixture model, is symmetric.

9Proof that Mutual Information Is Symmetric

10 November 2020

The mutual information (MI) of two random variables quantifies how much information (in bits or nats) is obtained about one random variable by observing the other. I discuss MI and show it is symmetric.

10From Entropy Search to Predictive Entropy Search

28 October 2020

In Bayesian optimization, a popular acquisition function is predictive entropy search, which is a clever reframing of another acquisition function, entropy search. I rederive the connection and explain why this reframing is useful.

11A Unifying Review of EM for Gaussian Latent Factor Models

25 October 2020

The expectation–maximization (EM) updates for several Gaussian latent factor models (factor analysis, probabilistic principal component analysis, probabilistic canonical correlation analysis, and inter-battery factor analysis) are closely related. I explore these relationships in detail.

12Implementing Bayesian Online Changepoint Detection

20 October 2020

I annotate my Python implementation of the framework in Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection".

1301 September 2020

I derive the entropy for the univariate and multivariate Gaussian distributions.

14Bayesian Inference for Beta–Bernoulli Models

19 August 2020

I derive the posterior, marginal likelihood, and posterior predictive distributions for beta–Bernoulli models.

1505 August 2020

Thoughts on John Carmack's theory of antifragile idea generation.

16Gaussian Process Dynamical Models

24 July 2020

Wang and Fleet's 2008 paper, "Gaussian Process Dynamical Models for Human Motion", introduces a Gaussian process latent variable model with Gaussian process latent dynamics. I discuss this paper in detail.

17Matrix Multiplication as the Sum of Outer Products

17 July 2020

The transpose of a matrix times itself is equal to the sum of outer products created by the rows of the matrix. I prove this identity.

18From Probabilistic PCA to the GPLVM

14 July 2020

A Gaussian process latent variable model (GPLVM) can be viewed as a generalization of probabilistic principal component analysis (PCA) in which the latent maps are Gaussian-process distributed. I discuss this relationship.

1905 July 2020

The physics of Hamiltonian Monte Carlo, part 3: In the final post in this series, I discuss Hamiltonian Monte Carlo, building off previous discussions of the Euler–Lagrange equation and Hamiltonian dynamics.

20Gaussian Processes with Multinomial Observations

03 July 2020

Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", introduces a Gibbs sampler for Gaussian processes with multinomial observations. I discuss this model in detail.

2102 July 2020

The sum of two equations that are quadratic in $\mathbf{x}$ is a single quadratic form in $\mathbf{x}$. I work through this derivation in detail.

22Following Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", I show that a multinomial density can be represented as a product of binomial densities.

2321 June 2020

I have received a number of compliments on my blog's style or theme and even more requests for details on the blogging environment. So here's how I built my blog.

24Lagrangian and Hamiltonian Mechanics

14 June 2020

The physics of Hamiltonian Monte Carlo, part 2: Building off the Euler–Lagrange equation, I discuss Lagrangian mechanics, the principle of stationary action, and Hamilton's equations.

2510 May 2020

The physics of Hamiltonian Monte Carlo, part 1: Lagrangian and Hamiltonian mechanics are based on the principle of stationary action, formalized by the calculus of variations and the Euler–Lagrange equation. I discuss this result.

2611 April 2020

Why are a distribution's moments called "moments"? How does the equation for a moment capture the shape of a distribution? Why do we typically only study four moments? I explore these and other questions in detail.

27Gibbs Sampling Is a Special Case of Metropolis–Hastings

23 February 2020

Gibbs sampling is a computationally convenient Bayesian inference algorithm that is a special case of the Metropolis–Hastings algorithm. I discuss Gibbs sampling in the broader context of Markov chain Monte Carlo methods.

2809 February 2020

Normalizing vectors of log probabilities is a common task in statistical modeling, but it can result in under- or overflow when exponentiating large values. I discuss the log-sum-exp trick for resolving this issue.

2904 February 2020

Linear models, part 3. I discuss Bayesian linear regression or classical linear regression with a prior on the parameters. Using a particular prior as an example, I provide intuition and detailed derivations for the full model.

3031 January 2020

Linear models, part 2. Before discussing regularization, I discuss what overfitting means for linear models.

31A Python Implementation of the Multivariate t-distribution

20 January 2020

I needed a fast and numerically stable Python implementation of the multivariate t-distribution. I wrote one based on SciPy's multivariate distributions module.

3212 January 2020

Writing has made me a better thinker and researcher. I expand on my reasons why.

33Comparing Kernel Ridge with Gaussian Process Regression

06 January 2020

The posterior mean from a Gaussian process regressor is related to the prediction of a kernel ridge regressor. I explore this connection in detail.

3404 January 2020

Linear models, part 1. I discuss classical linear regression with an emphasis on multiple interpretations of the model.

3523 December 2019

Rahimi and Recht's 2007 paper, "Random Features for Large-Scale Kernel Machines", introduces a framework for randomized, low-dimensional approximations of kernel functions. I discuss this paper in detail with a focus on random Fourier features.

36Implicit Lifting and the Kernel Trick

10 December 2019

I disentangle the what I call the "lifting trick" from the kernel trick as a way of clarifying what the kernel trick is and does.

37Asymptotic Normality of Maximum Likelihood Estimators

28 November 2019

Under certain regularity conditions, maximum likelihood estimators are "asymptotically efficient", meaning that they achieve the Cramér–Rao lower bound in the limit. I discuss this result.

38Proof of the Cramér–Rao Lower Bound

27 November 2019

The Cramér–Rao lower bound allows us to derive uniformly minimum–variance unbiased estimators by finding unbiased estimators that achieve this bound. I derive the main result.

3921 November 2019

I document several properties of the Fisher information or the variance of the derivative of the log likelihood.

40Proof of the Rao–Blackwell Theorem

15 November 2019

I walk the reader through a proof the Rao–Blackwell Theorem.

4115 November 2019

In numerical analysis, the Lagrange polynomial is the polynomial of least degree that exactly coincides with a set of data points. I provide the geometric intuition and proof of correctness for this idea.

42Proof of the Law of Total Expectation

14 November 2019

I discuss a straightforward proof of the law of total expectation with three standard assumptions.

43Approximate Counting with Morris's Algorithm

11 November 2019

Robert Morris's algorithm for counting large numbers using 8-bit registers is an early example of a sketch or data structure for efficiently processing a data stream. I introduce the algorithm and analyze its probabilistic behavior.

4410 November 2019

For many latent variable models, maximizing the complete log likelihood is easier than maximizing the log likelihood. The expectation–maximization (EM) algorithm leverages this fact to construct and optimize a tight lower bound. I rederive EM.

4502 November 2019

Many authors introduce Metropolis–Hastings through its acceptance criteria without explaining why such a criteria allows us to sample from our target distribution. I provide a formal justification.

46Naively computing the probability density function for the multivariate normal can be slow and numerically unstable. I work through SciPy's implementation.

47A Romantic View of Markov Chains

28 October 2019

A Markov chain is ergodic if and only if it has at most one recurrent class and is aperiodic. A sketch of a proof of this theorem hinges on an intuitive probabilistic idea called "coupling" that is worth understanding.

48Interpreting Expectations and Medians as Minimizers

04 October 2019

I show how several properties of the distribution of a random variable—the expectation, conditional expectation, and median—can be viewed as solutions to optimization problems.

4920 September 2019

Bayesian inference for models with binomial likelihoods is hard, but in a 2013 paper, Nicholas Polson and his coauthors introduced a new method fast Bayesian inference using Gibbs sampling. I discuss their main results in detail.

5018 September 2019

This operation, while useful in elementary algebra, also arises frequently when manipulating Gaussian random variables. I review and document both the univariate and multivariate cases.

51A Poisson–Gamma Mixture Is Negative-Binomially Distributed

16 September 2019

We can view the negative binomial distribution as a Poisson distribution with a gamma prior on the rate parameter. I work through this derivation in detail.

52A Practical Implementation of Gaussian Process Regression

12 September 2019

I discuss Rasmussen and Williams's Algorithm 2.1 for an efficient implementation of Gaussian process regression.

53Sampling: Two Basic Algorithms

01 September 2019

Numerical sampling uses randomized algorithms to sample from and estimate properties of distributions. I explain two basic sampling algorithms, rejection sampling and importance sampling.

54Bayesian Online Changepoint Detection

13 August 2019

Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection", introduces a modular Bayesian framework for online estimation of changes in the generative parameters of sequential data. I discuss this paper in detail.

55Gaussian Process Regression with Code Snippets

27 June 2019

The definition of a Gaussian process is fairly abstract: it is an infinite collection of random variables, any finite number of which are jointly Gaussian. I work through this definition with an example and provide several complete code snippets.

5608 May 2019

Laplace's method is used to approximate a distribution with a Gaussian. I explain the technique in general and work through an exercise by David MacKay.

57Bayesian Inference for the Gaussian

04 April 2019

I work through several cases of Bayesian parameter estimation of Gaussian models.

5819 March 2019

Probability distributions that are members of the exponential family have mathematically convenient properties for Bayesian inference. I provide the general form, work through several examples, and discuss several important properties.

59Conjugacy in Bayesian Inference

16 March 2019

Conjugacy is an important property in exact Bayesian inference. I work though Bishop's example of a beta conjugate prior for the binomial distribution and explore why conjugacy is useful.

60Random Noise and the Central Limit Theorem

01 February 2019

Many probabilistic models assume random noise is Gaussian distributed. I explain at least part of the motivation for this, which is grounded in the Central Limit Theorem.

61The KL Divergence: From Information to Density Estimation

22 January 2019

The KL divergence, also known as "relative entropy", is a commonly used metric for density estimation. I re-derive the relationships between probabilities, entropy, and relative entropy for quantifying similarity between distributions.

62Floating Point Precision with Log Likelihoods

18 January 2019

Computing the log likelihood is a common task in probabilistic machine learning, but it can easily under- or overflow. I discuss one such issue and its resolution.

63Randomized Singular Value Decomposition

17 January 2019

Halko, Martinsson, and Tropp's 2011 paper, "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions", introduces a modular framework for randomized matrix decompositions. I discuss this paper in detail with a focus on randomized SVD.

6411 January 2019

Bessel's correction is the division of the sample variance by $N - 1$ rather than $N$. I walk the reader through a quick proof that this correction results in an unbiased estimator of the population variance.

65Proof of the Singular Value Decomposition

20 December 2018

I walk the reader carefully through Gilbert Strang's existence proof of the singular value decomposition.

66Singular Value Decomposition as Simply as Possible

10 December 2018

Singular Value Decomposition (SVD) is powerful and ubiquitous tool for matrix factorization but explanations often provide little intuition. My goal is to explain SVD as simply as possible before working towards the formal definition.

67Woodbury Matrix Identity for Factor Analysis

30 November 2018

In factor analysis, the Woodbury matrix identity allows us to invert the covariance matrix of our data $\textbf{x}$ in $O(k^3)$ time rather than $O(p^3)$ time where $k$ and $p$ are the latent and data dimensions respectively. I explain and implement the technique.

68Modeling Repulsion with Determinantal Point Processes

06 November 2018

Determinantal point process are point processes characterized by the determinant of a positive semi-definite matrix, but what this means is not necessarily obvious. I explain how such a process can model repulsive systems.

69A Geometrical Understanding of Matrices

24 October 2018

My college course on linear algebra focused on systems of linear equations. I present a geometrical understanding of matrices as linear transformations, which has helped me visualize and relate concepts from the field.

70Probabilistic Canonical Correlation Analysis in Detail

10 September 2018

Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.

7108 August 2018

Factor analysis is a statistical method for modeling high-dimensional data using a smaller number of latent variables. It is deeply related to other probabilistic models such as probabilistic PCA and probabilistic CCA. I define the model and how to fit it in detail.

72Canonical Correlation Analysis in Detail

17 July 2018

Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.

73Dot Product: Equivalence of Definitions

26 June 2018

The dot product has two definitions, one algebraic and one geometric. The relationship between the two may not be immediately obvious. I explain why they make sense relative to each other and then prove that they are equivalent.

74An Example of Probabilistic Machine Learning

13 June 2018

Probabilistic machine learning is a useful framework for handling uncertainty and modeling generative processes. I explore this approach by comparing two models, one with and one without a clear probabilistic interpretation.

7529 April 2018

A common explanation for the reparameterization trick with variational autoencoders is that we cannot backpropagate through a stochastic node. I provide a more formal justification.

7615 April 2018

Backprogation is an algorithm that computes the gradient of a neural network, but it may not be obvious why the algorithm uses a backward pass. The answer allows us to reconstruct backprop from first principles.

77From Convolution to Neural Network

24 February 2017

Most explanations of CNNs assume the reader understands the convolution operation and how it relates to image processing. I explore convolutions in detail and explain how they are implemented as layers in a neural network.

78