Autoregressive Model

Autoregressive (AR) models represent random processes in which each observation is a linear function of some of its previous values, plus noise. I present the main ideas behind AR models, including when they are stationary and how to fit them with the Yule–Walker equations.

In statistics and econometrics, an autoregressive (AR) model is a statistical model of a particular type of random process, one in which a time-dependent scalar random variable is a linear function of some of its previous values, plus a random term. Formally, the model is:

xt=α+i=1pβixti+εt.(1) x_t = \alpha + \sum_{i=1}^p \beta_i x_{t-i} + \varepsilon_t. \tag{1}

This is called an autoregressive model of order pp and is denoted AR(p)(p). β={β1,,βp}\boldsymbol{\beta} = \{ \beta_1, \dots, \beta_p \} are the model’s parameters, α\alpha is a constant bias, and εt\varepsilon_t is white noise.

An AR(0)(0) process is simply random noise, with no dependence between time points. As pp increases, the induced AR model becomes more complex, as it accounts for more previous terms in the estimation of the current term. See Figure 11 for synthetic data from AR processes of increasing order.

Figure 1. Autoregressive models with increasing model orders, p{0,1,2,3,}p \in \{0, 1, 2, 3,\}. AR(0)(0) is white noise, while AR(p>0)(p > 0) exhibit autocorrelation with increasingly complex ependencies between xtx_t and previous observations.

For what it’s worth, according to (Greene, 2003), “The received empirical literature is overwhelmingly dominated by the AR(1)(1) model, which is partly a matter of convenience. Processes more involved than this model are usually extremely difficult to analyze…” He goes on to write, “The first-order autoregression has withstood the test of time and experimentation as a reasonable model for underlying processes that probably, in truth, are impenetrably complex. AR(1)(1) works as a first pass.”

Stationarity

An important property of an AR process is that it has finite variance if the sum of the absolute values of its coefficients is within the unit circle, i.e.

β1+β2++βp<1.(2) |\beta_1| + |\beta_2| + \dots + |\beta_p| < 1. \tag{2}

Intuitively, this means that provided Equation 22 holds, then the value of the AR process doesn’t rapidly diverge (Figure 22).

Figure 2. AR(1) models with varying values for β\beta. When β>1\beta > 1, the process is nonstationary, meaning the variance of xtx_t is not finite.

However, Equation 22 is actually just the consequence of a more important property of AR processes and time-series more generally: stationarity. As I understand it, stationarity is a statement about the sequence of moments of a distribution. In particular, consider the definition of weak stationarity, sometimes called covariance stationarity:

Weak stationarity: The random process {zt}\{z_t\} is weakly stationary if E[zt]\mathbb{E}[z_t] is finite and independent of tt, V[zt]\mathbb{V}[z_t] is finite for all tt, and Cov(zt,zs)\text{Cov}(z_t, z_s) is finite and only depends on tst - s.

Covariance stationarity is important here because if we assume that both {xt}\{x_t\} and {εt}\{\varepsilon_t\} are weakly stationary, then we can derive the first and second moments for xtx_t. First, let’s assume the following about the sequence of error terms {εt}\{\varepsilon_t\}:

E[εt]=0,V[εt]=σε2,σε2<,Cov[εt,εs]=0,ts.(3) \begin{aligned} \mathbb{E}[\varepsilon_t] &= 0, \\ \mathbb{V}[\varepsilon_t] &= \sigma_{\varepsilon}^2, && \sigma_{\varepsilon}^2 < \infty, \\ \text{Cov}[\varepsilon_t, \varepsilon_s] &= 0, && t \neq s. \end{aligned} \tag{3}

It’s clear from Equation 33 that {εt}\{\varepsilon_t\} is a covariance stationary random process. We can use these assumptions to derive the conditions required for finite mean and finite variance for the AR process {xt}\{x_t\}. The mean is

E[xt]=E[α]+E[i=1pβixti]+E[εt]=α+i=1pβiE[xti]E[xt]=α1i=1pβi,(4) \begin{aligned} \mathbb{E}[x_t] &= \mathbb{E}[\alpha] + \mathbb{E}\left[ \sum_{i=1}^p \beta_i x_{t-i} \right] + \mathbb{E}[\varepsilon_t] \\ &= \alpha + \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-i}] \\ &\Downarrow \\ \mathbb{E}[x_t] &= \frac{\alpha}{1 - \sum_{i=1}^p \beta_i}, \end{aligned} \tag{4}

and variance is,

V[xt]=V[α]+V[i=1pβixti]+V[εt]=i=1pβi2V[xti]+σε2V[xt]=σε21i=1pβi2.(5) \begin{aligned} \mathbb{V}[x_t] &= \mathbb{V}[\alpha] + \mathbb{V}\left[ \sum_{i=1}^p \beta_i x_{t-i} \right] + \mathbb{V}[\varepsilon_t] \\ &= \sum_{i=1}^p \beta_i^2 \mathbb{V}[x_{t-i}] + \sigma_{\varepsilon}^2 \\ &\Downarrow \\ \mathbb{V}[x_t] &= \frac{\sigma_{\varepsilon}^2}{1 - \sum_{i=1}^p \beta_i^2}. \end{aligned} \tag{5}

So both the mean and variance of {xt}\{x_t\} are fixed properties of the AR process—functions of β\boldsymbol{\beta}, α\alpha, and σε2\sigma_{\varepsilon}^2. For our assumption of weak stationarity to hold for {xt}\{x_t\}, it must be the case that Equation 22 holds, that β1<1\lVert \boldsymbol{\beta} \rVert_1 < 1.

Inference via the Yule–Walker equations

AR processes can be fit in a variety of ways, such as with maximum likelihood estimation, least squares regression (OLS), and method of moments (Yule–Walker equations). I’ll present the Yule–Walker equations because they are simple to derive and program and, in my mind, nicely capture the intuition for AR processes. In contrast, OLS estimation is not causal and is biased. See A1 for a discussion of this.

Much like how the Baum–Welch algorithm is an instance of expectation–maximization applied to hidden Markov models, the Yule–Walker equations are an instance of method of moments applied to AR models. Thus, the goal of this approach is to estimate moments (here, autocorrelations) of the AR process.

The Yule–Walker equations are not particularly hard to understand, but deriving them is tedious. To make the material easier to grok, let’s break the discussion into three parts: some preliminaries, including helpful notation; a discussion of the big idea; and then detailed derivations.

Preliminaries

First, let’s assume an AR model that is covariance stationary. Second, let’s ignore the bias term α\alpha. De-meaning is easy to do, and this will clean up the notation. Now consider the covariance matrix C\mathbf{C} with the following notation. In general, let ctc_t be defined as

E[xtxtk]ct(tk)=ck.(6) \mathbb{E}[x_t x_{t-k}] \triangleq c_{t-(t-k)} = c_{-k}. \tag{6}

So the matrix C\mathbf{C} can be written as:

C=[c0c1c2c2pc1pc1c0c1c3pc2pc2c1c0c4pc3pcp2cp3cp4c0c1cp1cp2cp3c1c0](7) \mathbf{C} = \begin{bmatrix} c_0 & c_{-1} & c_{-2} & \dots & c_{2-p} & c_{1-p} \\ c_{1} & c_0 & c_{-1} & \dots & c_{3-p} & c_{2-p} \\ c_{2} & c_{1} & c_0 & \dots & c_{4-p} & c_{3-p} \\ \vdots & \vdots & \vdots & \ddots & \ddots & \vdots \\ c_{p-2} & c_{p-3} & c_{p-4} & \dots & c_{0} & c_{-1} \\ c_{p-1} & c_{p-2} & c_{p-3} & \dots & c_{1} & c_{0} \end{bmatrix} \tag{7}

The key things to note are that c0=V[xt]=σx2c_0 = \mathbb{V}[x_t] = \sigma_x^2 by definition and that ct=ctc_t = c_{-t} for all tt. This simply because of the symmetry of covariance,

E[xtxtk]=E[xtkxt],ck=ck.(8) \begin{aligned} \mathbb{E}[x_t x_{t-k}] &= \mathbb{E}[x_{t-k} x_t], \\ &\Downarrow \\ c_{k} &= c_{-k}. \end{aligned} \tag{8}

Finally, consider a correlation matrix R\mathbf{R} where each element rtr_t is defined as simply

rk=ckc0=E[xtxtk]E[xt]E[xtk].(9) r_{-k} = \frac{c_{-k}}{c_0} \stackrel{\star}{=} \frac{\mathbb{E}[x_t x_{t-k}]}{\sqrt{\mathbb{E}[x_t]}\sqrt{\mathbb{E}[x_{t-k}]}}. \tag{9}

Step \star in Equation 99 is from Equation 44, i.e. each xtx_t has the same variance or

E[xt2]E[xtk2]=σx2σx2=σx2=c0.(10) \sqrt{\mathbb{E}[x_t^2]} \sqrt{\mathbb{E}[x_{t-k}^2]} = \sqrt{\sigma^2_x \sigma^2_x} = \sigma_x^2 = c_0. \tag{10}

To be explicit, this correlation matrix is just C\mathbf{C} normalized by σx2\sigma_x^2:

RCc0=[r0r1r2r2pr1pr1r0r1r3pr2pr2r1r0r4pr3prp2rp3rp4r0r1rp1rp2rp3r1r0](11) \mathbf{R} \triangleq \frac{\mathbf{C}}{c_0} = \begin{bmatrix} r_0 & r_{-1} & r_{-2} & \dots & r_{2-p} & r_{1-p} \\ r_{1} & r_0 & r_{-1} & \dots & r_{3-p} & r_{2-p} \\ r_{2} & r_{1} & r_0 & \dots & r_{4-p} & r_{3-p} \\ \vdots & \vdots & \vdots & \ddots & \ddots & \vdots \\ r_{p-2} & r_{p-3} & r_{p-4} & \dots & r_{0} & r_{-1} \\ r_{p-1} & r_{p-2} & r_{p-3} & \dots & r_{1} & r_{0} \end{bmatrix} \tag{11}

With these preliminaries out of the way, let’s discuss the big idea of this approach.

Big idea

The basic idea of the Yule–Walker equations is to construct a system of linear equations,

r=Rβ,(12) \mathbf{r} = \mathbf{R}\boldsymbol{\beta}, \tag{12}

representing the relationships specified by the AR model. The matrix R\mathbf{R} is the matrix in Equation 1212 above, and the vector r\mathbf{r} the first column (or row) of R\mathbf{R} but shifted by one, i.e.,

r=[r1r2rp].(13) \mathbf{r} = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_p \end{bmatrix}. \tag{13}

The unknowns in this equation are β\boldsymbol{\beta}. Intuitively, Equation 1212 captures the assumed autocorrelations encoded in the AR model, and solving for β\boldsymbol{\beta} in this way gives us parameter estimates that capture these modeling assumptions.

Now here’s something fun. Since R\mathbf{R} is a symmetric Toeplitz matrix, we actually only need to estimate a single column (or row) of R\mathbf{R} in order to construct the whole thing. Thus, we just need to estimate r0r_0 and r\mathbf{r} from data. We can then construct R\mathbf{R} with the appropriate tiling and then solve the system of linear equations for β\boldsymbol{\beta}.

Derivations

Now let’s derive r0r_0 and each component of r\mathbf{r}. These derivations, and the notation ctc_t and rtr_t, are from Gidon Eshel’s lectures notes. Consider the following AR process,

xt=i=1pβixti+εt.(14) x_t = \sum_{i=1}^p \beta_i x_{t-i} + \varepsilon_t. \tag{14}

Estimating r1r_1. Let’s multiply both sides by xt1x_{t-1} and then take the expectation of both sides. Since our noise is zero-mean and since βi\beta_i are non-random, we have

E[xt1xt]=i=1pβiE[xt1xti].(15) \mathbb{E}[x_{t-1} x_t] = \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-1} x_{t-i}]. \tag{15}

The important thing to note is that the left-hand-side of Equation 1515 is an un-normalized cross-covariance term. If we normalize by (N1)(N-1), we get cross-covariance terms ckc_k; and if we divide these by c0c_0, we get autocorrelation terms rkr_k. Let’s do that:

c1c0=i=1pβici1c0,r1=i=1pβiri1(16) \begin{aligned} \frac{c_{-1}}{c_0} &= \sum_{i=1}^p \beta_i \frac{c_{i-1}}{c_0}, \\ &\Downarrow \\ r_{1} &= \sum_{i=1}^p \beta_i r_{i-1} \end{aligned} \tag{16}

An attentive reader may notice: the last line of Equation 1616 is the first line (implicit) in the matrix equation in Equation 1212. Indeed, we’re going to build Equation 1212 “line by line”, so to speak.

Estimating r2r_2. Let’s do the same process as above, but only after multiplying Equation 1414 by xt2x_{t-2} rather than by xt1x_{t-1}. As you might imagine, this will produce r2r_2:

E[xt2xt]=i=1pβiE[xt2xti]c2c0=i=1pβici2c0r2=i=1pβiri2.(17) \begin{aligned} \mathbb{E}[x_{t-2} x_t] &= \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-2} x_{t-i}] \\ &\Downarrow \\ \frac{c_{-2}}{c_0} &= \sum_{i=1}^p \beta_i \frac{c_{i-2}}{c_0} \\ &\Downarrow \\ r_{2} &= \sum_{i=1}^p \beta_i r_{i-2}. \end{aligned} \tag{17}

This is the second line in the matrix equation in Equation 1313.

Estimating rpr_p. We can see how this generalizes. To be thorough, let’s compute rkr_k for any row kk:

E[xtkxt]=i=1pβiE[xtkxti]ckc0=i=1pβicikc0rk=i=1pβirik.(18) \begin{aligned} \mathbb{E}[x_{t-k} x_t] &= \sum_{i=1}^p \beta_i \mathbb{E}[x_{t-k} x_{t-i}] \\ &\Downarrow \\ \frac{c_{-k}}{c_0} &= \sum_{i=1}^p \beta_i \frac{c_{i-k}}{c_0} \\ &\Downarrow \\ r_{k} &= \sum_{i=1}^p \beta_i r_{i-k}. \end{aligned} \tag{18}

Summary

Hopefully, how we mechanically construct the Yule–Walker equations is clear. We simply compute {r0,r1,r2,,rp1,rp}\{ r_0, r_1, r_2, \dots, r_{p-1}, r_p \} using Equation 1818 for all kk. We use the last pp elements to construct the pp-vector r\mathbf{r}, and we use the first pp elements to construct the first column in R\mathbf{R}. We then appropriately tile this column to construct a symmetric Toeplitz matrix. We can plug these quantities into Equation 1212 to solve for β\boldsymbol{\beta}.

Implementing the Yule–Walker equations in code is fairly straightforward. See the fit function in the Python implementation of an AR process in A2. I generated 100,000100,000 synthetic data points with the following true parameters,

β=[0.90.050.01].(19) \boldsymbol{\beta} = \begin{bmatrix} 0.9 \\ 0.05 \\ 0.01 \end{bmatrix}. \tag{19}

Using my implementation of an AR process, I fit a new AR process to these data and inferred the following parameter estimates:

β=[0.90020.04940.0092].(20) \boldsymbol{\beta} = \begin{bmatrix} 0.9002 \\ 0.0494 \\ 0.0092 \end{bmatrix}. \tag{20}

As always, my derivations or code may have mistakes—please email me if you find any! —, but I believe the main idea here is correct.

Conclusion

Autoregressive processes are simple yet useful statistical models for random processes exhibiting autocorrelation. The parameters can be inferred in a variety of ways. Using the Yule–Walker equations is fairly straightforward both conceptually and in practice.

   

Appendix

A1. Inference via OLS

While fitting an AR model with OLS is fairly straightforward, it is not not causal and the estimator is not unbiased. To quote (Greene, 2003):

If the regression contains any lagged values of the dependent variable, then least squares will no longer be unbiased or consistent.

To see why, consider Equation 11 again. Clearly, the response xt1x_{t-1} and the noise εt\varepsilon_t are correlated. This breaks the OLS assumption of strict exogeneity. Why? Because an implication of strict exogeneity is that each regressor is orthogonal to the error term for all observations:

E[εtX]=0(strict exogeneity),E[xsεt]=E[E[xsεtxs]]=E[xsE[εtxs]]=E[xs0]=0. \begin{aligned} \mathbb{E}[\varepsilon_t \mid \mathbf{X}] &= 0 && \text{(strict exogeneity)}, \\ &\Downarrow \\ \mathbb{E}[x_s \varepsilon_t] &= \mathbb{E}[\mathbb{E}[x_s \varepsilon_t \mid x_s]] \\ &= \mathbb{E}[x_s \mathbb{E}[ \varepsilon_t \mid x_s]] \\ &= \mathbb{E}[x_s 0] \\ &= 0. \end{aligned}

Finally, strict exogeneity is required in the proof that the OLS estimator is unbiased. Taken together, if xt1x_{t-1} and εt\varepsilon_t are correlated, the OLS estimator is not unbiased.

A2. Autoregressive model in Python

import numpy as np
from   scipy.linalg import toeplitz
from   scipy.stats import norm


class AR:
    """Autoregressive model of order p.
    """

    def __init__(self, coefs=[1], obs_mean=0, obs_bias=0, noise_std=1):
        self.coefs      = coefs
        self.order      = len(self.coefs)
        self.obs_mean   = obs_mean
        self.obs_bias   = obs_bias
        self.noise_dist = norm(0, noise_std)

    def fit(self, xs):
        """Fit the AR model using the Yule–Walker equations.

        This is based off the statsmodels yule_walker function:
        
          https://www.statsmodels.org/stable/generated/
            statsmodels.regression.linear_model.yule_walker.html
        """
        # Demean data.
        xs = xs - xs.mean()
        n = xs.shape[0]
        r = np.zeros(self.order+1)
        # We can't put this inside the loop, since xs[:-0] is an empty slice.
        r[0] = (xs * xs).sum() / (n-1)
        for i in range(1, self.order+1):
            # r_i is the autocorrelation between all x_t and all x_{t-i}.
            r[i] = (xs[:-i] * xs[i:]).sum() / (n-i-1)
        R = toeplitz(r[:-1])
        beta = np.linalg.solve(R, r[1:])
        self.coefs = beta

    def sample(self, size=100):
        """Generate `size` random samples from the instantiated AR model.
        """
        c  = self.obs_bias
        x0 = self.obs_mean
        xs = np.array([x0] * self.order)

        if self.order > 1:
            pad = np.zeros(shape=(self.order-1))
            xs  = np.append(pad, xs)
        for t in range(size):
            e  = self.noise_dist.rvs()
            if self.order > 0:
                x = c + np.dot(xs[:self.order], self.coefs) + e
            else:
                # AR(0) is just white noise, possibly offset by c.
                x = c + e
            xs = np.append([x], xs)

        # I think of the most recent datum as the last element. We construct the
        # array in reverse to align with notation.
        return np.flip(xs)

    @property
    def stationary(self):
        """Return `True` if the AR model is weakly stationary, `False`
        otherwise.
        """
        return self.coefs.sum() < 1
  1. Greene, W. H. (2003). Econometric analysis. Pearson Education India.