Autoregressive Model

Autoregressive (AR) models represent random processes in which each observation is a linear function of some of its previous values, plus noise. I present the main ideas behind AR models, including when they are stationary and how to fit them with the Yule–Walker equations.

Published

06 January 2022

In statistics and econometrics, an autoregressive (AR) model is a statistical model of a particular type of random process, one in which a time-dependent scalar random variable is a linear function of some of its previous values, plus a random term. Formally, the model is:

$x_t = \alpha + \sum_{i=1}^p \beta_i x_{t-i} + \varepsilon_t. \tag{1}$

This is called an autoregressive model of order $p$ and is denoted AR $(p)$ . $\boldsymbol{\beta} = \{ \beta_1, \dots, \beta_p \}$ are the model’s parameters, $\alpha$ is a constant bias, and $\varepsilon_t$ is white noise.

An AR $(0)$ process is simply random noise, with no dependence between time points. As $p$ increases, the induced AR model becomes more complex, as it accounts for more previous terms in the estimation of the current term. See Figure $1$ for synthetic data from AR processes of increasing order.

Figure 1. Autoregressive models with increasing model orders,

p \in \{0, 1, 2, 3,\}

. AR

(0)

is white noise, while AR

(p > 0)

exhibit autocorrelation with increasingly complex ependencies between

x_t

and previous observations.

For what it’s worth, according to (Greene, 2003), “The received empirical literature is overwhelmingly dominated by the AR $(1)$ model, which is partly a matter of convenience. Processes more involved than this model are usually extremely difficult to analyze…” He goes on to write, “The first-order autoregression has withstood the test of time and experimentation as a reasonable model for underlying processes that probably, in truth, are impenetrably complex. AR $(1)$ works as a first pass.”

Stationarity

An important property of an AR process is that it has finite variance if the sum of the absolute values of its coefficients is within the unit circle, i.e.

$|\beta_1| + |\beta_2| + \dots + |\beta_p| < 1. \tag{2}$

Intuitively, this means that provided Equation $2$ holds, then the value of the AR process doesn’t rapidly diverge (Figure $2$ ).

Figure 2. AR(1) models with varying values for

\beta

. When

\beta > 1

, the process is nonstationary, meaning the variance of

x_t

is not finite.

However, Equation $2$ is actually just the consequence of a more important property of AR processes and time-series more generally: stationarity. As I understand it, stationarity is a statement about the sequence of moments of a distribution. In particular, consider the definition of weak stationarity, sometimes called covariance stationarity:

Weak stationarity: The random process $\{z_t\}$ is weakly stationary if $\mathbb{E}[z_t]$ is finite and independent of $t$ , $\mathbb{V}[z_t]$ is finite for all $t$ , and $\text{Cov}(z_t, z_s)$ is finite and only depends on $t - s$ .

Covariance stationarity is important here because if we assume that both $\{x_t\}$ and $\{\varepsilon_t\}$ are weakly stationary, then we can derive the first and second moments for $x_t$ . First, let’s assume the following about the sequence of error terms $\{\varepsilon_t\}$ :

$\begin{aligned} \mathbb{E}[\varepsilon_t] &= 0, \\ \mathbb{V}[\varepsilon_t] &= \sigma_{\varepsilon}^2, && \sigma_{\varepsilon}^2 < \infty, \\ \text{Cov}[\varepsilon_t, \varepsilon_s] &= 0, && t \neq s. \end{aligned} \tag{3}$

It’s clear from Equation $3$ that $\{\varepsilon_t\}$ is a covariance stationary random process. We can use these assumptions to derive the conditions required for finite mean and finite variance for the AR process $\{x_t\}$ . The mean is

$\begin{aligned} \mathbb{E}[x_t] &= \mathbb{E}[\alpha] + \mathbb{E}\left[ \sum_{i=1}^p \beta_i x_{t-i} \right] + \mathbb{E}[\varepsilon_t] \\ &= \alpha + \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-i}] \\ &\Downarrow \\ \mathbb{E}[x_t] &= \frac{\alpha}{1 - \sum_{i=1}^p \beta_i}, \end{aligned} \tag{4}$

and variance is,

$\begin{aligned} \mathbb{V}[x_t] &= \mathbb{V}[\alpha] + \mathbb{V}\left[ \sum_{i=1}^p \beta_i x_{t-i} \right] + \mathbb{V}[\varepsilon_t] \\ &= \sum_{i=1}^p \beta_i^2 \mathbb{V}[x_{t-i}] + \sigma_{\varepsilon}^2 \\ &\Downarrow \\ \mathbb{V}[x_t] &= \frac{\sigma_{\varepsilon}^2}{1 - \sum_{i=1}^p \beta_i^2}. \end{aligned} \tag{5}$

So both the mean and variance of $\{x_t\}$ are fixed properties of the AR process—functions of $\boldsymbol{\beta}$ , $\alpha$ , and $\sigma_{\varepsilon}^2$ . For our assumption of weak stationarity to hold for $\{x_t\}$ , it must be the case that Equation $2$ holds, that $\lVert \boldsymbol{\beta} \rVert_1 < 1$ .

Inference via the Yule–Walker equations

AR processes can be fit in a variety of ways, such as with maximum likelihood estimation, least squares regression (OLS), and method of moments (Yule–Walker equations). I’ll present the Yule–Walker equations because they are simple to derive and program and, in my mind, nicely capture the intuition for AR processes. In contrast, OLS estimation is not causal and is biased. See A1 for a discussion of this.

Much like how the Baum–Welch algorithm is an instance of expectation–maximization applied to hidden Markov models, the Yule–Walker equations are an instance of method of moments applied to AR models. Thus, the goal of this approach is to estimate moments (here, autocorrelations) of the AR process.

The Yule–Walker equations are not particularly hard to understand, but deriving them is tedious. To make the material easier to grok, let’s break the discussion into three parts: some preliminaries, including helpful notation; a discussion of the big idea; and then detailed derivations.

Preliminaries

First, let’s assume an AR model that is covariance stationary. Second, let’s ignore the bias term $\alpha$ . De-meaning is easy to do, and this will clean up the notation. Now consider the covariance matrix $\mathbf{C}$ with the following notation. In general, let $c_t$ be defined as

$\mathbb{E}[x_t x_{t-k}] \triangleq c_{t-(t-k)} = c_{-k}. \tag{6}$

So the matrix $\mathbf{C}$ can be written as:

$\mathbf{C} = \begin{bmatrix} c_0 & c_{-1} & c_{-2} & \dots & c_{2-p} & c_{1-p} \\ c_{1} & c_0 & c_{-1} & \dots & c_{3-p} & c_{2-p} \\ c_{2} & c_{1} & c_0 & \dots & c_{4-p} & c_{3-p} \\ \vdots & \vdots & \vdots & \ddots & \ddots & \vdots \\ c_{p-2} & c_{p-3} & c_{p-4} & \dots & c_{0} & c_{-1} \\ c_{p-1} & c_{p-2} & c_{p-3} & \dots & c_{1} & c_{0} \end{bmatrix} \tag{7}$

The key things to note are that $c_0 = \mathbb{V}[x_t] = \sigma_x^2$ by definition and that $c_t = c_{-t}$ for all $t$ . This simply because of the symmetry of covariance,

$\begin{aligned} \mathbb{E}[x_t x_{t-k}] &= \mathbb{E}[x_{t-k} x_t], \\ &\Downarrow \\ c_{k} &= c_{-k}. \end{aligned} \tag{8}$

Finally, consider a correlation matrix $\mathbf{R}$ where each element $r_t$ is defined as simply

$r_{-k} = \frac{c_{-k}}{c_0} \stackrel{\star}{=} \frac{\mathbb{E}[x_t x_{t-k}]}{\sqrt{\mathbb{E}[x_t]}\sqrt{\mathbb{E}[x_{t-k}]}}. \tag{9}$

Step $\star$ in Equation $9$ is from Equation $4$ , i.e. each $x_t$ has the same variance or

$\sqrt{\mathbb{E}[x_t^2]} \sqrt{\mathbb{E}[x_{t-k}^2]} = \sqrt{\sigma^2_x \sigma^2_x} = \sigma_x^2 = c_0. \tag{10}$

To be explicit, this correlation matrix is just $\mathbf{C}$ normalized by $\sigma_x^2$ :

$\mathbf{R} \triangleq \frac{\mathbf{C}}{c_0} = \begin{bmatrix} r_0 & r_{-1} & r_{-2} & \dots & r_{2-p} & r_{1-p} \\ r_{1} & r_0 & r_{-1} & \dots & r_{3-p} & r_{2-p} \\ r_{2} & r_{1} & r_0 & \dots & r_{4-p} & r_{3-p} \\ \vdots & \vdots & \vdots & \ddots & \ddots & \vdots \\ r_{p-2} & r_{p-3} & r_{p-4} & \dots & r_{0} & r_{-1} \\ r_{p-1} & r_{p-2} & r_{p-3} & \dots & r_{1} & r_{0} \end{bmatrix} \tag{11}$

With these preliminaries out of the way, let’s discuss the big idea of this approach.

Big idea

The basic idea of the Yule–Walker equations is to construct a system of linear equations,

$\mathbf{r} = \mathbf{R}\boldsymbol{\beta}, \tag{12}$

representing the relationships specified by the AR model. The matrix $\mathbf{R}$ is the matrix in Equation $12$ above, and the vector $\mathbf{r}$ the first column (or row) of $\mathbf{R}$ but shifted by one, i.e.,

$\mathbf{r} = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_p \end{bmatrix}. \tag{13}$

The unknowns in this equation are $\boldsymbol{\beta}$ . Intuitively, Equation $12$ captures the assumed autocorrelations encoded in the AR model, and solving for $\boldsymbol{\beta}$ in this way gives us parameter estimates that capture these modeling assumptions.

Now here’s something fun. Since $\mathbf{R}$ is a symmetric Toeplitz matrix, we actually only need to estimate a single column (or row) of $\mathbf{R}$ in order to construct the whole thing. Thus, we just need to estimate $r_0$ and $\mathbf{r}$ from data. We can then construct $\mathbf{R}$ with the appropriate tiling and then solve the system of linear equations for $\boldsymbol{\beta}$ .

Derivations

Now let’s derive $r_0$ and each component of $\mathbf{r}$ . These derivations, and the notation $c_t$ and $r_t$ , are from Gidon Eshel’s lectures notes. Consider the following AR process,

$x_t = \sum_{i=1}^p \beta_i x_{t-i} + \varepsilon_t. \tag{14}$

Estimating $r_1$ . Let’s multiply both sides by $x_{t-1}$ and then take the expectation of both sides. Since our noise is zero-mean and since $\beta_i$ are non-random, we have

$\mathbb{E}[x_{t-1} x_t] = \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-1} x_{t-i}]. \tag{15}$

The important thing to note is that the left-hand-side of Equation $15$ is an un-normalized cross-covariance term. If we normalize by $(N-1)$ , we get cross-covariance terms $c_k$ ; and if we divide these by $c_0$ , we get autocorrelation terms $r_k$ . Let’s do that:

$\begin{aligned} \frac{c_{-1}}{c_0} &= \sum_{i=1}^p \beta_i \frac{c_{i-1}}{c_0}, \\ &\Downarrow \\ r_{1} &= \sum_{i=1}^p \beta_i r_{i-1} \end{aligned} \tag{16}$

An attentive reader may notice: the last line of Equation $16$ is the first line (implicit) in the matrix equation in Equation $12$ . Indeed, we’re going to build Equation $12$ “line by line”, so to speak.

Estimating $r_2$ . Let’s do the same process as above, but only after multiplying Equation $14$ by $x_{t-2}$ rather than by $x_{t-1}$ . As you might imagine, this will produce $r_2$ :

$\begin{aligned} \mathbb{E}[x_{t-2} x_t] &= \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-2} x_{t-i}] \\ &\Downarrow \\ \frac{c_{-2}}{c_0} &= \sum_{i=1}^p \beta_i \frac{c_{i-2}}{c_0} \\ &\Downarrow \\ r_{2} &= \sum_{i=1}^p \beta_i r_{i-2}. \end{aligned} \tag{17}$

This is the second line in the matrix equation in Equation $13$ .

Estimating $r_p$ . We can see how this generalizes. To be thorough, let’s compute $r_k$ for any row $k$ :

$\begin{aligned} \mathbb{E}[x_{t-k} x_t] &= \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-k} x_{t-i}] \\ &\Downarrow \\ \frac{c_{-k}}{c_0} &= \sum_{i=1}^p \beta_i \frac{c_{i-k}}{c_0} \\ &\Downarrow \\ r_{k} &= \sum_{i=1}^p \beta_i r_{i-k}. \end{aligned} \tag{18}$

Summary

Hopefully, how we mechanically construct the Yule–Walker equations is clear. We simply compute $\{ r_0, r_1, r_2, \dots, r_{p-1}, r_p \}$ using Equation $18$ for all $k$ . We use the last $p$ elements to construct the $p$ -vector $\mathbf{r}$ , and we use the first $p$ elements to construct the first column in $\mathbf{R}$ . We then appropriately tile this column to construct a symmetric Toeplitz matrix. We can plug these quantities into Equation $12$ to solve for $\boldsymbol{\beta}$ .

Implementing the Yule–Walker equations in code is fairly straightforward. See the fit function in the Python implementation of an AR process in A2. I generated $100,000$ synthetic data points with the following true parameters,

$\boldsymbol{\beta} = \begin{bmatrix} 0.9 \\ 0.05 \\ 0.01 \end{bmatrix}. \tag{19}$

Using my implementation of an AR process, I fit a new AR process to these data and inferred the following parameter estimates:

$\boldsymbol{\beta} = \begin{bmatrix} 0.9002 \\ 0.0494 \\ 0.0092 \end{bmatrix}. \tag{20}$

As always, my derivations or code may have mistakes—please email me if you find any! —, but I believe the main idea here is correct.

Conclusion

Autoregressive processes are simple yet useful statistical models for random processes exhibiting autocorrelation. The parameters can be inferred in a variety of ways. Using the Yule–Walker equations is fairly straightforward both conceptually and in practice.

Appendix

A1. Inference via OLS

While fitting an AR model with OLS is fairly straightforward, it is not not causal and the estimator is not unbiased. To quote (Greene, 2003):

If the regression contains any lagged values of the dependent variable, then least squares will no longer be unbiased or consistent.

To see why, consider Equation $1$ again. Clearly, the response $x_{t-1}$ and the noise $\varepsilon_t$ are correlated. This breaks the OLS assumption of strict exogeneity. Why? Because an implication of strict exogeneity is that each regressor is orthogonal to the error term for all observations:

$\begin{aligned} \mathbb{E}[\varepsilon_t \mid \mathbf{X}] &= 0 && \text{(strict exogeneity)}, \\ &\Downarrow \\ \mathbb{E}[x_s \varepsilon_t] &= \mathbb{E}[\mathbb{E}[x_s \varepsilon_t \mid x_s]] \\ &= \mathbb{E}[x_s \mathbb{E}[ \varepsilon_t \mid x_s]] \\ &= \mathbb{E}[x_s 0] \\ &= 0. \end{aligned}$

Finally, strict exogeneity is required in the proof that the OLS estimator is unbiased. Taken together, if $x_{t-1}$ and $\varepsilon_t$ are correlated, the OLS estimator is not unbiased.

A2. Autoregressive model in Python

import numpy as np
from   scipy.linalg import toeplitz
from   scipy.stats import norm


class AR:
    """Autoregressive model of order p.
    """

    def __init__(self, coefs=[1], obs_mean=0, obs_bias=0, noise_std=1):
        self.coefs      = coefs
        self.order      = len(self.coefs)
        self.obs_mean   = obs_mean
        self.obs_bias   = obs_bias
        self.noise_dist = norm(0, noise_std)

    def fit(self, xs):
        """Fit the AR model using the Yule–Walker equations.

        This is based off the statsmodels yule_walker function:
        
          https://www.statsmodels.org/stable/generated/
            statsmodels.regression.linear_model.yule_walker.html
        """
        # Demean data.
        xs = xs - xs.mean()
        n = xs.shape[0]
        r = np.zeros(self.order+1)
        # We can't put this inside the loop, since xs[:-0] is an empty slice.
        r[0] = (xs * xs).sum() / (n-1)
        for i in range(1, self.order+1):
            # r_i is the autocorrelation between all x_t and all x_{t-i}.
            r[i] = (xs[:-i] * xs[i:]).sum() / (n-i-1)
        R = toeplitz(r[:-1])
        beta = np.linalg.solve(R, r[1:])
        self.coefs = beta

    def sample(self, size=100):
        """Generate `size` random samples from the instantiated AR model.
        """
        c  = self.obs_bias
        x0 = self.obs_mean
        xs = np.array([x0] * self.order)

        if self.order > 1:
            pad = np.zeros(shape=(self.order-1))
            xs  = np.append(pad, xs)
        for t in range(size):
            e  = self.noise_dist.rvs()
            if self.order > 0:
                x = c + np.dot(xs[:self.order], self.coefs) + e
            else:
                # AR(0) is just white noise, possibly offset by c.
                x = c + e
            xs = np.append([x], xs)

        # I think of the most recent datum as the last element. We construct the
        # array in reverse to align with notation.
        return np.flip(xs)

    @property
    def stationary(self):
        """Return `True` if the AR model is weakly stationary, `False`
        otherwise.
        """
        return self.coefs.sum() < 1

Greene, W. H. (2003). Econometric analysis. Pearson Education India.