I learned very early the difference between knowing the name of something and knowing something.
Richard Feynman21 December 2024
Thoughts on learning to memorize the first one hundred digits of pi.
1An Intuitive Explanation of Black–Scholes
28 September 2024
I explain the Black–Scholes formula using only basic probability theory and calculus, with a focus on the big picture and intuition over technical details.
2Expectation of the Truncated Lognormal Distribution
18 August 2024
I derive the expected value of a random variable that is left-truncated and lognormally distributed.
324 April 2024
The prime factorization of 999999 allows us to compute repeating decimals for some common fractions. I work through this idea.
4Simulating Geometric Brownian Motion
13 April 2024
I work through a simple Python implementation of geometric Brownian motion and check it against the theoretical model.
504 January 2024
In probability theory, Bienaymé's identity is a formula for the variance of random variables which are themselves sums of random variables. I provide a little intuition for the identity and then prove it.
617 December 2023
I derive some basic properties of the lognormal distribution.
709 December 2023
A useful view of a covariance matrix is that it is a natural generalization of variance to higher dimensions. I explore this idea.
829 October 2023
A mean–variance optimizer will hedge correlated assets. I explain why and then work through a simple example.
908 October 2023
In finance, the "Greeks" refer to the partial derivatives of an option pricing model with respect to its inputs. They are important for understanding how an option's price may change. I discuss the Black–Scholes Greeks in detail.
1010 September 2023
The VIX is a benchmark for market-implied volatility. It is computed from a weighted average of variance swaps. I first derive the fair strike for a variance swap and then discuss the VIX's approximation of this formula.
1119 August 2023
I work through a well-known approximation of the Black–Scholes price of at-the-money (ATM) options.
12Proof the Binomial Model Converges to Black–Scholes
03 June 2023
The binomial options-pricing model converges to Black–Scholes as the number of steps in fixed physical time goes to infinity. I present Chi-Cheng Hsia's 1983 proof of this result.
13Binomial Options-Pricing Model
03 June 2023
I present a simple yet useful model for pricing European-style options, called the binomial options-pricing model. It provides good intuition into pricing options without any advanced mathematics.
1413 May 2023
I describe the process of using ChatGPT-3.5 to write a program that uses OpenAI's API. The program generates LLM fortunes a la the Unix command 'fortune'.
15Problem Solving with Dimensional Analysis
11 February 2023
Dimensional analysis is the technique of analyzing relationships through their base quantities. I demonstrate the power of this approach by approximating a Gaussian integral without calculus.
16Estimating Square Roots in Your Head
01 February 2023
I explore an ancient algorithm, sometimes called Heron's method, for estimating square roots without a calculator.
1726 January 2023
In the options-pricing literature, the Carr–Madan formula equates a derivative's nonlinear payoff function with a portfolio of options. I describe and prove this relationship.
1807 December 2022
The binomial options-pricing model is a numerical method for valuing options. I explore this model over a single time period and focus on two key ideas, the no-arbitrage condition and risk-neutral pricing.
1917 September 2022
Principal component analyis (PCA) is a simple, fast, and elegant linear method for data analysis. I explore PCA in detail, first with pictures and intuition, then with linear algebra and detailed derivations, and finally with code.
20Matrices as Functions, Matrices as Data
28 August 2022
I discuss two views of matrices: matrices as linear functions and matrices as data. The second view is particularly useful in understanding dimension reduction methods.
21Scaling Factors for Hidden Markov Models
13 August 2022
Inference for hidden Markov models (HMMs) is numerically unstable. A standard approach to resolving this instability is to use scaling factors. I discuss this idea in detail.
2209 August 2022
Weighted least squares (WLS) is a generalization of ordinary least squares in which each observation is assigned a weight, which scales the squared residual error. I discuss WLS and then derive its estimator in detail.
2329 June 2022
The Sharpe ratio measures a financial strategy's performance as the ratio of its reward to its variability. I discuss this metric in detail, particularly its relationship to the information ratio and -statistics.
24How Dangerous Is Biking in New York?
18 June 2022
I estimate my probability of serious injury or death from bike commuting to work in New York, using public data from city's Department of Transportation.
2504 June 2022
I discuss moving or rolling averages, which are algorithms to compute means over different subsets of sequential data.
2624 May 2022
A common heuristic for time-aggregating volatility is the square root of time rule. I discuss the big idea for this rule and then provide the mathematical assumptions underpinning it.
2717 May 2022
Many phenomena can be modeled as exponential decay. I discuss this model in detail, focusing on natural exponential decay (base ) and various useful properties.
2812 April 2022
I discuss multi-factor modeling, which generalizes many early financial models into a common prediction and risk framework.
2927 March 2022
During my PhD, I went hiking alone in a remote region of Iceland. Over the years, I've come to view this trip as analogous to the PhD process. Graduate school was hard, but on the warm days, the views were spectacular.
3020 March 2022
Conjugate gradient descent (CGD) is an iterative algorithm for minimizing quadratic functions. CGD uses a kind of orthogonality (conjugacy) to efficiently search for the minimum. I present CGD by building it up from gradient descent.
31The Capital Asset Pricing Model
06 March 2022
In finance, the capital asset pricing model (CAPM) was the first theory to measure systematic risk. The CAPM argues that there is a single type of risk, market risk. I derive the CAPM from the mean–variance framework of modern portfolio theory.
3203 March 2022
I discuss generalized least squares (GLS), which extends ordinary least squares by assuming heteroscedastic errors. I prove some basic properties of GLS, particularly that it is the best linear unbiased estimator, and work through a complete example.
33Understanding Positive Definite Matrices
27 February 2022
I discuss a geometric interpretation of positive definite matrices and how this relates to various properties of them, such as positive eigenvalues, positive determinants, and decomposability. I also discuss their importance in quadratic programming.
3408 February 2022
I discuss and prove the Gauss–Markov theorem, which states that under certain conditions, the least squares estimator is the minimum-variance linear unbiased estimator of the model parameters.
3506 February 2022
I discuss prices, returns, cumulative returns, and log returns, with a special focus on some nice mathematical properties of log returns.
36Breusch–Pagan Test for Heteroscedasticity
31 January 2022
I discuss the Breusch–Pagan test, a simple hypothesis test for heteroscedasticity in linear models. I also implement the test in Python and demonstrate that it can detect heteroscedasticity in a toy example.
3730 January 2022
The ordinary least squares estimator is inefficient when the homoscedasticity assumption does not hold. I provide a simple example of a nonsensical -statistic from data with heteroscedasticity and discuss why this happens in general.
38Consistency of the OLS Estimator
29 January 2022
A consistent estimator converges in probability to the true value. I discuss this idea in general and then prove that the ordinary least squares estimator is consistent.
3916 January 2022
The locus defined by a convex combination of two points is the line between them. I provide some geometric intuition for this fact and then prove it.
40Geometry of the Efficient Frontier
09 January 2022
Some important financial ideas are encoded in the geometry of the efficient frontier, such as the tangency portfolio and the Sharpe ratio. The goal of this post is to re-derive these ideas geometrically, showing that they arise from the mean–variance analysis framework.
4106 January 2022
Autoregressive (AR) models represent random processes in which each observation is a linear function of some of its previous values, plus noise. I present the main ideas behind AR models, including when they are stationary and how to fit them with the Yule–Walker equations.
42Blockchain in 19 Lines of Python
19 September 2021
I walk through a simple Python implementation of blockchain technology, commonly used for public ledgers in cryptocurrencies.
4309 September 2021
When can we be confident in our estimated coefficients when using OLS? We typically use a -statistic to quantify whether an inferred coefficient was likely to have happened by chance. I discuss hypothesis testing and -statistics for OLS.
4402 September 2021
I describe a useful trick for computing geometric series without closed-form solutions.
45Residual Sum of Squares in Terms of Pearson's Correlation
01 September 2021
I re-derive a relationship between the residual sum of squares in simple linear regresssion and Pearson's correlation coefficient.
4627 August 2021
Drawdown measures the decline of a time series variable from a historical peak. I explore visualizing and computing drawdown-based metrics.
47Sampling Distribution of the OLS Estimator
26 August 2021
I derive the mean and variance of the OLS estimator, as well as an unbiased estimator of the OLS estimator's variance. I then show that the OLS estimator is normally distributed if we assume the error terms are normally distributed.
48Simple Linear Regression and Correlation
25 August 2021
In simple linear regression, the slope parameter is a simple function of the correlation between the targets and predictors. I derive this result and discuss a few consequences.
4909 August 2021
In ordinary least squares, the coefficient of determination quantifies the variation in the dependent variables that can be explained by the model. However, this interpretation has a few assumptions which are worth understanding. I explore this metric and the assumptions in detail.
5012 July 2021
Multicollinearity is when two or more predictors are linearly dependent. This can impact the interpretability of a linear model's estimated coefficients. I discuss this phenomenon in detail.
51Portfolio Theory: Why Diversification Matters
04 May 2021
The casual investor knows that diversification matters. This intuition is grounded in the mathematics of modern portfolio theory. I define diversification and formalize how diversification helps maximize risk-adjusted returns.
52I formalize and visualize several important concepts in linear algebra: linear independence and dependence, orthogonality and orthonormality, and basis. Finally, I discuss the Gram–Schmidt algorithm, an algorithm for converting a basis into an orthonormal basis.
53The ELBO in Variational Inference
16 April 2021
I derive the evidence lower bound (ELBO) in variational inference and explore its relationship to the objective in expectation–maximization and the variational autoencoder.
54Standard Errors and Confidence Intervals
16 February 2021
How do we know when a parameter estimate from a random sample is significant? I discuss the use of standard errors and confidence intervals to answer this question.
55A Python Implementation of the Multivariate Skew Normal
29 December 2020
I needed a Python implementation of the multivariate skew normal. I wrote one based on SciPy's multivariate distributions module.
56Understanding Dirichlet–Multinomial Models
24 December 2020
The Dirichlet distribution is really a multivariate beta distribution. I discuss this connection and then derive the posterior, marginal likelihood, and posterior predictive distributions for Dirichlet–multinomial models.
57For a project, I needed to compute the log PDF of a vector for multiple pairs of mean and variance parameters. I discuss a fast Python implementation.
58Why Shouldn't I Invert That Matrix?
09 December 2020
A standard claim in textbooks and courses in numerical linear algebra is that one should not invert a matrix to solve for in . I explore why this is typically true.
59Inference for Hidden Markov Models
28 November 2020
Expectation–maximization for hidden Markov models is called the Baum–Welch algorithm, and it relies on the forward–backward algorithm for efficient computation. I review HMMs and then present these algorithms in detail.
6019 November 2020
The unscented transform, most commonly associated with the nonlinear Kalman filter, was proposed by Jeffrey Uhlmann to estimate a nonlinear transformation of a Gaussian. I illustrate the main idea.
61Conjugate Analysis for the Multivariate Gaussian
18 November 2020
I work through Bayesian parameter estimation of the mean for the multivariate Gaussian.
62A Python Demonstration that Mutual Information Is Symmetric
11 November 2020
I provide a numerical demonstration that the mutual information of two random variables, the observations and latent variables in a Gaussian mixture model, is symmetric.
63Proof that Mutual Information Is Symmetric
10 November 2020
The mutual information (MI) of two random variables quantifies how much information (in bits or nats) is obtained about one random variable by observing the other. I discuss MI and show it is symmetric.
64From Entropy Search to Predictive Entropy Search
28 October 2020
In Bayesian optimization, a popular acquisition function is predictive entropy search, which is a clever reframing of another acquisition function, entropy search. I rederive the connection and explain why this reframing is useful.
65A Unifying Review of EM for Gaussian Latent Factor Models
25 October 2020
The expectation–maximization (EM) updates for several Gaussian latent factor models (factor analysis, probabilistic principal component analysis, probabilistic canonical correlation analysis, and inter-battery factor analysis) are closely related. I explore these relationships in detail.
66Implementing Bayesian Online Changepoint Detection
20 October 2020
I annotate my Python implementation of the framework in Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection".
6701 September 2020
I derive the entropy for the univariate and multivariate Gaussian distributions.
68Bayesian Inference for Beta–Bernoulli Models
19 August 2020
I derive the posterior, marginal likelihood, and posterior predictive distributions for beta–Bernoulli models.
6905 August 2020
Thoughts on John Carmack's theory of antifragile idea generation.
70Gaussian Process Dynamical Models
24 July 2020
Wang and Fleet's 2008 paper, "Gaussian Process Dynamical Models for Human Motion", introduces a Gaussian process latent variable model with Gaussian process latent dynamics. I discuss this paper in detail.
71Matrix Multiplication as the Sum of Outer Products
17 July 2020
The transpose of a matrix times itself is equal to the sum of outer products created by the rows of the matrix. I prove this identity.
72From Probabilistic PCA to the GPLVM
14 July 2020
A Gaussian process latent variable model (GPLVM) can be viewed as a generalization of probabilistic principal component analysis (PCA) in which the latent maps are Gaussian-process distributed. I discuss this relationship.
7305 July 2020
The physics of Hamiltonian Monte Carlo, part 3: In the final post in this series, I discuss Hamiltonian Monte Carlo, building off previous discussions of the Euler–Lagrange equation and Hamiltonian dynamics.
74Gaussian Processes with Multinomial Observations
03 July 2020
Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", introduces a Gibbs sampler for Gaussian processes with multinomial observations. I discuss this model in detail.
7502 July 2020
The sum of two equations that are quadratic in is a single quadratic form in . I work through this derivation in detail.
76Following Linderman, Johnson, and Adam's 2015 paper, "Dependent multinomial models made easy: Stick-breaking with the Pólya-gamma augmentation", I show that a multinomial density can be represented as a product of binomial densities.
7721 June 2020
I have received a number of compliments on my blog's style or theme and even more requests for details on the blogging environment. So here's how I built my blog.
78Lagrangian and Hamiltonian Mechanics
14 June 2020
The physics of Hamiltonian Monte Carlo, part 2: Building off the Euler–Lagrange equation, I discuss Lagrangian mechanics, the principle of stationary action, and Hamilton's equations.
7910 May 2020
The physics of Hamiltonian Monte Carlo, part 1: Lagrangian and Hamiltonian mechanics are based on the principle of stationary action, formalized by the calculus of variations and the Euler–Lagrange equation. I discuss this result.
8011 April 2020
Why are a distribution's moments called "moments"? How does the equation for a moment capture the shape of a distribution? Why do we typically only study four moments? I explore these and other questions in detail.
81Gibbs Sampling Is a Special Case of Metropolis–Hastings
23 February 2020
Gibbs sampling is a computationally convenient Bayesian inference algorithm that is a special case of the Metropolis–Hastings algorithm. I discuss Gibbs sampling in the broader context of Markov chain Monte Carlo methods.
8209 February 2020
Normalizing vectors of log probabilities is a common task in statistical modeling, but it can result in under- or overflow when exponentiating large values. I discuss the log-sum-exp trick for resolving this issue.
8304 February 2020
I discuss Bayesian linear regression or classical linear regression with a prior on the parameters. Using a particular prior as an example, I provide intuition and detailed derivations for the full model.
8431 January 2020
We know that regularization is important for linear models, but what does overfitting mean in this context? I discuss this question.
85A Python Implementation of the Multivariate t-distribution
20 January 2020
I needed a fast and numerically stable Python implementation of the multivariate t-distribution. I wrote one based on SciPy's multivariate distributions module.
8612 January 2020
Writing has made me a better thinker and researcher. I expand on my reasons why.
87Comparing Kernel Ridge with Gaussian Process Regression
06 January 2020
The posterior mean from a Gaussian process regressor is related to the prediction of a kernel ridge regressor. I explore this connection in detail.
8804 January 2020
I discuss ordinary least squares or linear regression when the optimal coefficients minimize the residual sum of squares. I discuss various properties and interpretations of this classic model.
8923 December 2019
Rahimi and Recht's 2007 paper, "Random Features for Large-Scale Kernel Machines", introduces a framework for randomized, low-dimensional approximations of kernel functions. I discuss this paper in detail with a focus on random Fourier features.
90Implicit Lifting and the Kernel Trick
10 December 2019
I disentangle the what I call the "lifting trick" from the kernel trick as a way of clarifying what the kernel trick is and does.
91Asymptotic Normality of Maximum Likelihood Estimators
28 November 2019
Under certain regularity conditions, maximum likelihood estimators are "asymptotically efficient", meaning that they achieve the Cramér–Rao lower bound in the limit. I discuss this result.
92Proof of the Cramér–Rao Lower Bound
27 November 2019
The Cramér–Rao lower bound allows us to derive uniformly minimum–variance unbiased estimators by finding unbiased estimators that achieve this bound. I derive the main result.
9321 November 2019
I document several properties of the Fisher information or the variance of the derivative of the log likelihood.
94Proof of the Rao–Blackwell Theorem
15 November 2019
I walk the reader through a proof the Rao–Blackwell Theorem.
9515 November 2019
In numerical analysis, the Lagrange polynomial is the polynomial of least degree that exactly coincides with a set of data points. I provide the geometric intuition and proof of correctness for this idea.
96Proof of the Law of Total Expectation
14 November 2019
I discuss a straightforward proof of the law of total expectation with three standard assumptions.
97Approximate Counting with Morris's Algorithm
11 November 2019
Robert Morris's algorithm for counting large numbers using 8-bit registers is an early example of a sketch or data structure for efficiently processing a data stream. I introduce the algorithm and analyze its probabilistic behavior.
9810 November 2019
For many latent variable models, maximizing the complete log likelihood is easier than maximizing the log likelihood. The expectation–maximization (EM) algorithm leverages this fact to construct and optimize a tight lower bound. I rederive EM.
9902 November 2019
Many authors introduce Metropolis–Hastings through its acceptance criteria without explaining why such a criteria allows us to sample from our target distribution. I provide a formal justification.
100Naively computing the probability density function for the multivariate normal can be slow and numerically unstable. I work through SciPy's implementation.
10128 October 2019
A Markov chain is ergodic if and only if it has at most one recurrent class and is aperiodic. A sketch of a proof of this theorem hinges on an intuitive probabilistic idea called "coupling" that is worth understanding.
102Interpreting Expectations and Medians as Minimizers
04 October 2019
I show how several properties of the distribution of a random variable—the expectation, conditional expectation, and median—can be viewed as solutions to optimization problems.
10320 September 2019
Bayesian inference for models with binomial likelihoods is hard, but in a 2013 paper, Nicholas Polson and his coauthors introduced a new method fast Bayesian inference using Gibbs sampling. I discuss their main results in detail.
10418 September 2019
This operation, while useful in elementary algebra, also arises frequently when manipulating Gaussian random variables. I review and document both the univariate and multivariate cases.
105A Poisson–Gamma Mixture Is Negative-Binomially Distributed
16 September 2019
We can view the negative binomial distribution as a Poisson distribution with a gamma prior on the rate parameter. I work through this derivation in detail.
106A Practical Implementation of Gaussian Process Regression
12 September 2019
I discuss Rasmussen and Williams's Algorithm 2.1 for an efficient implementation of Gaussian process regression.
107Sampling: Two Basic Algorithms
01 September 2019
Numerical sampling uses randomized algorithms to sample from and estimate properties of distributions. I explain two basic sampling algorithms, rejection sampling and importance sampling.
108Bayesian Online Changepoint Detection
13 August 2019
Adams and MacKay's 2007 paper, "Bayesian Online Changepoint Detection", introduces a modular Bayesian framework for online estimation of changes in the generative parameters of sequential data. I discuss this paper in detail.
109Gaussian Process Regression with Code Snippets
27 June 2019
The definition of a Gaussian process is fairly abstract: it is an infinite collection of random variables, any finite number of which are jointly Gaussian. I work through this definition with an example and provide several complete code snippets.
11008 May 2019
Laplace's method is used to approximate a distribution with a Gaussian. I explain the technique in general and work through an exercise by David MacKay.
111Bayesian Inference for the Gaussian
04 April 2019
I work through several cases of Bayesian parameter estimation of Gaussian models.
11219 March 2019
Probability distributions that are members of the exponential family have mathematically convenient properties for Bayesian inference. I provide the general form, work through several examples, and discuss several important properties.
113Conjugacy in Bayesian Inference
16 March 2019
Conjugacy is an important property in exact Bayesian inference. I work though Bishop's example of a beta conjugate prior for the binomial distribution and explore why conjugacy is useful.
114Random Noise and the Central Limit Theorem
01 February 2019
Many probabilistic models assume random noise is Gaussian distributed. I explain at least part of the motivation for this, which is grounded in the Central Limit Theorem.
115The KL Divergence: From Information to Density Estimation
22 January 2019
The KL divergence, also known as "relative entropy", is a commonly used metric for density estimation. I re-derive the relationships between probabilities, entropy, and relative entropy for quantifying similarity between distributions.
116Floating Point Precision with Log Likelihoods
18 January 2019
Computing the log likelihood is a common task in probabilistic machine learning, but it can easily under- or overflow. I discuss one such issue and its resolution.
117Randomized Singular Value Decomposition
17 January 2019
Halko, Martinsson, and Tropp's 2011 paper, "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions", introduces a modular framework for randomized matrix decompositions. I discuss this paper in detail with a focus on randomized SVD.
11811 January 2019
Bessel's correction is the division of the sample variance by rather than . I walk the reader through a quick proof that this correction results in an unbiased estimator of the population variance.
119Proof of the Singular Value Decomposition
20 December 2018
I walk the reader carefully through Gilbert Strang's existence proof of the singular value decomposition.
120Singular Value Decomposition as Simply as Possible
10 December 2018
The singular value decomposition (SVD) is a powerful and ubiquitous tool for matrix factorization but explanations often provide little intuition. My goal is to explain the SVD as simply as possible before working towards the formal definition.
121Woodbury Matrix Identity for Factor Analysis
30 November 2018
In factor analysis, the Woodbury matrix identity allows us to invert the covariance matrix of our data in time rather than time where and are the latent and data dimensions respectively. I explain and implement the technique.
122Modeling Repulsion with Determinantal Point Processes
06 November 2018
Determinantal point process are point processes characterized by the determinant of a positive semi-definite matrix, but what this means is not necessarily obvious. I explain how such a process can model repulsive systems.
123A Geometrical Understanding of Matrices
24 October 2018
My college course on linear algebra focused on systems of linear equations. I present a geometrical understanding of matrices as linear transformations, which has helped me visualize and relate concepts from the field.
124Probabilistic Canonical Correlation Analysis in Detail
10 September 2018
Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.
12508 August 2018
Factor analysis is a statistical method for modeling high-dimensional data using a smaller number of latent variables. It is deeply related to other probabilistic models such as probabilistic PCA and probabilistic CCA. I define the model and how to fit it in detail.
126Canonical Correlation Analysis in Detail
17 July 2018
Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.
12726 June 2018
The dot product is often presented as both an algebraic and a geometric operation. The relationship between these two ideas may not be immediately obvious. I prove that they are equivalent and explain why the relationship makes sense.
128An Example of Probabilistic Machine Learning
13 June 2018
Probabilistic machine learning is a useful framework for handling uncertainty and modeling generative processes. I explore this approach by comparing two models, one with and one without a clear probabilistic interpretation.
12929 April 2018
A common explanation for the reparameterization trick with variational autoencoders is that we cannot backpropagate through a stochastic node. I provide a more formal justification.
13015 April 2018
Backprogation is an algorithm that computes the gradient of a neural network, but it may not be obvious why the algorithm uses a backward pass. The answer allows us to reconstruct backprop from first principles.
131From Convolution to Neural Network
24 February 2017
Most explanations of CNNs assume the reader understands the convolution operation and how it relates to image processing. I explore convolutions in detail and explain how they are implemented as layers in a neural network.
132