I learned very early the difference between knowing the name of something and knowing something.

Richard Feynman

Probability and statistics

Discrete-Time Martingales

The Kalman Filter

Brownian Motion

Simulating Geometric Brownian Motion

I work through a simple Python bvimplementation of geometric Brownian motion and check it against the theoretical model.

Bienaymé's Identity

In probability theory, Bienaymé's identity is a formula for the variance of random variables which are themselves sums of random variables. I provide a little intuition for the identity and then prove it.

Lognormal Distribution

I derive some basic properties of the lognormal distribution.

High-Dimensional Variance

A useful view of a covariance matrix is that it is a natural generalization of variance to higher dimensions. I explore this idea.

Moving Averages

I discuss moving or rolling averages, which are algorithms to compute means over different subsets of sequential data.

The Gauss–Markov Theorem

I discuss and prove the Gauss–Markov theorem, which states that under certain conditions, the least squares estimator is the minimum-variance linear unbiased estimator of the model parameters.

Standard Errors and Confidence Intervals

How do we know when a parameter estimate from a random sample is significant? I discuss the use of standard errors and confidence intervals to answer this question.

A Python Demonstration that Mutual Information Is Symmetric

I provide a numerical demonstration that the mutual information of two random variables, the observations and latent variables in a Gaussian mixture model, is symmetric.

Proof that Mutual Information Is Symmetric

The mutual information (MI) of two random variables quantifies how much information (in bits or nats) is obtained about one random variable by observing the other. I discuss MI and show it is symmetric.

Entropy of the Gaussian

I derive the entropy for the univariate and multivariate Gaussian distributions.

Understanding Moments

Why are a distribution's moments called "moments"? How does the equation for a moment capture the shape of a distribution? Why do we typically only study four moments? I explore these and other questions in detail.

Asymptotic Normality of Maximum Likelihood Estimators

Under certain regularity conditions, maximum likelihood estimators are "asymptotically efficient", meaning that they achieve the Cramér–Rao lower bound in the limit. I discuss this result.

Proof of the Cramér–Rao Lower Bound

The Cramér–Rao lower bound allows us to derive uniformly minimum–variance unbiased estimators by finding unbiased estimators that achieve this bound. I derive the main result.

The Fisher Information

I document several properties of the Fisher information or the variance of the derivative of the log likelihood.

Proof of the Rao–Blackwell Theorem

I walk the reader through a proof the Rao–Blackwell Theorem.

Proof of the Law of Total Expectation

I discuss a straightforward proof of the law of total expectation with three standard assumptions.

Interpreting Expectations and Medians as Minimizers

I show how several properties of the distribution of a random variable—the expectation, conditional expectation, and median—can be viewed as solutions to optimization problems.

The Exponential Family

Probability distributions that are members of the exponential family have mathematically convenient properties for Bayesian inference. I provide the general form, work through several examples, and discuss several important properties.

Random Noise and the Central Limit Theorem

Many probabilistic models assume random noise is Gaussian distributed. I explain at least part of the motivation for this, which is grounded in the Central Limit Theorem.

The KL Divergence: From Information to Density Estimation

The KL divergence, also known as "relative entropy", is a commonly used metric for density estimation. I re-derive the relationships between probabilities, entropy, and relative entropy for quantifying similarity between distributions.

Proof of Bessel's Correction

Bessel's correction is the division of the sample variance by N1N - 1 rather than NN. I walk the reader through a quick proof that this correction results in an unbiased estimator of the population variance.