Statistical modeling

I learned very early the difference between knowing the name of something and knowing something.

Statistical modeling

13 April 2024

I work through a simple Python implementation of geometric Brownian motion and check it against the theoretical model.

17 September 2022

Principal component analyis (PCA) is a simple, fast, and elegant linear method for data analysis. I explore PCA in detail, first with pictures and intuition, then with linear algebra and detailed derivations, and finally with code.

Scaling Factors for Hidden Markov Models

13 August 2022

Inference for hidden Markov models (HMMs) is numerically unstable. A standard approach to resolving this instability is to use scaling factors. I discuss this idea in detail.

Generalized Least Squares

03 March 2022

I discuss generalized least squares (GLS), which extends ordinary least squares by assuming heteroscedastic errors. I prove some basic properties of GLS, particularly that it is the best linear unbiased estimator, and work through a complete example.

The Gauss–Markov Theorem

08 February 2022

I discuss and prove the Gauss–Markov theorem, which states that under certain conditions, the least squares estimator is the minimum-variance linear unbiased estimator of the model parameters.

Breusch–Pagan Test for Heteroscedasticity

31 January 2022

I discuss the Breusch–Pagan test, a simple hypothesis test for heteroscedasticity in linear models. I also implement the test in Python and demonstrate that it can detect heteroscedasticity in a toy example.

OLS with Heteroscedasticity

30 January 2022

The ordinary least squares estimator is inefficient when the homoscedasticity assumption does not hold. I provide a simple example of a nonsensical $t$ -statistic from data with heteroscedasticity and discuss why this happens in general.

Consistency of the OLS Estimator

29 January 2022

A consistent estimator converges in probability to the true value. I discuss this idea in general and then prove that the ordinary least squares estimator is consistent.

Autoregressive Model

06 January 2022

Autoregressive (AR) models represent random processes in which each observation is a linear function of some of its previous values, plus noise. I present the main ideas behind AR models, including when they are stationary and how to fit them with the Yule–Walker equations.

Hypothesis Testing for OLS

09 September 2021

When can we be confident in our estimated coefficients when using OLS? We typically use a $t$ -statistic to quantify whether an inferred coefficient was likely to have happened by chance. I discuss hypothesis testing and $t$ -statistics for OLS.

Residual Sum of Squares in Terms of Pearson's Correlation

01 September 2021

I re-derive a relationship between the residual sum of squares in simple linear regresssion and Pearson's correlation coefficient.

Sampling Distribution of the OLS Estimator

26 August 2021

I derive the mean and variance of the OLS estimator, as well as an unbiased estimator of the OLS estimator's variance. I then show that the OLS estimator is normally distributed if we assume the error terms are normally distributed.

Simple Linear Regression and Correlation

25 August 2021

In simple linear regression, the slope parameter is a simple function of the correlation between the targets and predictors. I derive this result and discuss a few consequences.

Coefficient of Determination

09 August 2021

In ordinary least squares, the coefficient of determination quantifies the variation in the dependent variables that can be explained by the model. However, this interpretation has a few assumptions which are worth understanding. I explore this metric and the assumptions in detail.

Multicollinearity

12 July 2021

Multicollinearity is when two or more predictors are linearly dependent. This can impact the interpretability of a linear model's estimated coefficients. I discuss this phenomenon in detail.

A Python Implementation of the Multivariate Skew Normal

29 December 2020

I needed a Python implementation of the multivariate skew normal. I wrote one based on SciPy's multivariate distributions module.

Fast Computation of the Multivariate Normal PDF for Multiple Parameters

12 December 2020

For a project, I needed to compute the log PDF of a vector for multiple pairs of mean and variance parameters. I discuss a fast Python implementation.

Inference for Hidden Markov Models

28 November 2020

Expectation–maximization for hidden Markov models is called the Baum–Welch algorithm, and it relies on the forward–backward algorithm for efficient computation. I review HMMs and then present these algorithms in detail.

The Unscented Transform

19 November 2020

The unscented transform, most commonly associated with the nonlinear Kalman filter, was proposed by Jeffrey Uhlmann to estimate a nonlinear transformation of a Gaussian. I illustrate the main idea.

The Log-Sum-Exp Trick

09 February 2020

Normalizing vectors of log probabilities is a common task in statistical modeling, but it can result in under- or overflow when exponentiating large values. I discuss the log-sum-exp trick for resolving this issue.

Bayesian Linear Regression

04 February 2020

I discuss Bayesian linear regression or classical linear regression with a prior on the parameters. Using a particular prior as an example, I provide intuition and detailed derivations for the full model.

Can Linear Models Overfit?

31 January 2020

We know that regularization is important for linear models, but what does overfitting mean in this context? I discuss this question.

A Python Implementation of the Multivariate t-distribution

20 January 2020

I needed a fast and numerically stable Python implementation of the multivariate t-distribution. I wrote one based on SciPy's multivariate distributions module.

Ordinary Least Squares

04 January 2020

I discuss ordinary least squares or linear regression when the optimal coefficients minimize the residual sum of squares. I discuss various properties and interpretations of this classic model.

Expectation–Maximization

10 November 2019

For many latent variable models, maximizing the complete log likelihood is easier than maximizing the log likelihood. The expectation–maximization (EM) algorithm leverages this fact to construct and optimize a tight lower bound. I rederive EM.

A Fast and Numerically Stable Implementation of the Multivariate Normal PDF

30 October 2019

Naively computing the probability density function for the multivariate normal can be slow and numerically unstable. I work through SciPy's implementation.

Floating Point Precision with Log Likelihoods

18 January 2019

Computing the log likelihood is a common task in probabilistic machine learning, but it can easily under- or overflow. I discuss one such issue and its resolution.

Woodbury Matrix Identity for Factor Analysis

30 November 2018

In factor analysis, the Woodbury matrix identity allows us to invert the covariance matrix of our data $\textbf{x}$ in $O(k^3)$ time rather than $O(p^3)$ time where $k$ and $p$ are the latent and data dimensions respectively. I explain and implement the technique.

Probabilistic Canonical Correlation Analysis in Detail

10 September 2018

Probabilistic canonical correlation analysis is a reinterpretation of CCA as a latent variable model, which has benefits such as generative modeling, handling uncertainty, and composability. I define and derive its solution in detail.

Factor Analysis in Detail

08 August 2018

Factor analysis is a statistical method for modeling high-dimensional data using a smaller number of latent variables. It is deeply related to other probabilistic models such as probabilistic PCA and probabilistic CCA. I define the model and how to fit it in detail.

Canonical Correlation Analysis in Detail

17 July 2018

Canonical correlation analsyis is conceptually straightforward, but I want to define its objective and derive its solution in detail, both mathematically and programmatically.