Why Shouldn't I Invert That Matrix?

A standard claim in textbooks and courses in numerical linear algebra is that one should not invert a matrix to solve for x\mathbf{x} in Ax=b\mathbf{Ax} = \mathbf{b}. I explore why this is typically true.

In a recent research meeting, I was told, “Never invert a matrix.” The person went on to explain that while we always use A1\mathbf{A}^{-1} to denote a matrix inversion in an equation, in practice, we don’t actually invert the matrix. Instead, we solve a system of linear equations. Let me first clarify this claim. Consider solving for x\mathbf{x} in

Ax=b,(1) \mathbf{A} \mathbf{x} = \mathbf{b}, \tag{1}

where A\mathbf{A} is an n×nn \times n matrix and x\mathbf{x} and b\mathbf{b} are nn-vectors. One way to solve this equation is a matrix inversion A1\mathbf{A}^{-1},

x=A1b.(2) \mathbf{x} = \mathbf{A}^{-1} \mathbf{b}. \tag{2}

However, we could avoid computing A1\mathbf{A}^{-1} entirely by solving the system of linear equations directly.

So why and when is one approach better than the other? John Cook has a blog post on this topic, and while it is widely referenced, it is spare in details. For example, Cook claims that “Solving the system is more numerically accurate than the performing the matrix multiplication” but provides no explanation or evidence.

The goal of this post is to expand on the why computing Eq. 22 explicitly is often undesirable. As always, the answer lies in the details.

LU decomposition

Before comparing matrix inversion to linear system solving, we need to talk about a powerful operation that is used in both calculations: the lower–upper decomposition. Feel free to skip this section if you are familiar with this material.

The LU decomposition uses Gaussian elimination to transform a full linear system into an upper-triangular system by applying linear transformations from the left. It is similar to the QR decomposition without the constraint that the left matrix is orthogonal. Both decompositions can be used for solving linear systems and inverting matrices, but I’ll focus on the LU decomposition because, at least as I understand it, it is typically preferred in practice.

The basic idea is to transform A\mathbf{A} into an upper-triangular matrix U\mathbf{U} by repeatedly introducing zeros below the diagonal: first add zeros to the first column, then the second, and so on:

[×××××××××]A[×××0××0××]L1A[×××0××00×]L2L1A.(3) \overbrace{\begin{bmatrix} \times & \times & \times \\ \times & \times & \times \\ \times & \times & \times \end{bmatrix}}^{\mathbf{A}} \rightarrow \overbrace{\begin{bmatrix} \times & \times & \times \\ \mathbf{0} & \times & \times \\ \mathbf{0} & \times & \times \end{bmatrix}}^{\mathbf{L}_1 \mathbf{A}} \rightarrow \overbrace{\begin{bmatrix} \times & \times & \times \\ \mathbf{0} & \times & \times \\ \mathbf{0} & \mathbf{0} & \times \end{bmatrix}}^{\mathbf{L}_2 \mathbf{L}_1 \mathbf{A}}. \tag{3}

On the ii-th transformation, the algorithm introduces nin-i zeros below the diagonal by subtracting multiples of the (i1)(i-1)-th row from the rows beneath it (i+1,,ni+1, \dots, n).

Let’s look at an example that borrows heavily from (Trefethen & Bau III, 1997). Consider the matrix

A=[211433879].(4) \mathbf{A} = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix}. \tag{4}

First, we want to introduce zeros below the top-left 22. Trefethen and Bau picked the numbers in A\mathbf{A} judiciously for convenient elimination. Namely, A0,1=2A0,0A_{0,1} = 2 A_{0,0} and A0,2=4A0,0A_{0,2} = 4 A_{0,0}. We can represent this elimination as a lower triangular matrix L1\mathbf{L}_1 as:

L1A=[100210401][211433879]=[211011035].(5) \mathbf{L}_1 \mathbf{A} = \begin{bmatrix} 1 & 0 & 0 \\ -2 & 1 & 0 \\ -4 & 0 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 3 & 5 \end{bmatrix}. \tag{5}

Now we just need to eliminate 33 in A2,1A_{2,1}, which we can do through another linear map:

L2L1A=[100010031][211011035]=[211011002].(6) \mathbf{L}_2 \mathbf{L}_1 \mathbf{A} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & -3 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 3 & 5 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{bmatrix}. \tag{6}

What this example highlights is that we can maintain the previously computed zeros by only operating on rows below the iith row.

At this point, you may note that we have computed

Ln1Ln1L1A=U.(7) \mathbf{L}_{n-1} \mathbf{L}_{n-1} \dots \mathbf{L}_1 \mathbf{A} = \mathbf{U}. \tag{7}

Therefore, the desired lower triangular matrix requires n1n-1 inverses!

L=L11Ln11Ln11.(8) \mathbf{L} = \mathbf{L}_1^{-1} \dots \mathbf{L}_{n-1}^{-1} \mathbf{L}_{n-1}^{-1}. \tag{8}

As it turns out, the general formula for Li\mathbf{L}_i is such that computing the inverse is trivial: we just negate the values below the diagonal. Please see chapter 2020 of (Trefethen & Bau III, 1997) for details.

How many floating-point operations (flops) are required to compute an LU decomposition? You can find more detailed proofs elsewhere, but here’s the simple geometric intuition provided by Trefethen and Bau. For the ii-th row in A\mathbf{A}, we have to operate over nin - i columns. So for each of the n1n-1 iterations of the algorithm, we have complexity that is (ni)×(ni)(n-i) \times (n-i). We can visualize this as a pyramid of computation (Figure 11).

Figure 1. Geometric intuition for the complexity of the LU decomposition. On each iteration, we perform one less operation across both the rows and columns. This results in step pyramid of computation.

Each unit of volume of this pyramid requires two flops (not discussed here), and the area of the pyramid in the limit of nn is (1/3)n3(1/3) n^3. So the work required for LU decomposition is

23n3flops.(9) \sim \frac{2}{3} n^3 \quad \text{flops}. \tag{9}

Solving a linear system

Now let’s talk about using the LU decomposition to solve a linear system. Let the notation Ab\mathbf{A} \setminus \mathbf{b} denote the vector x\mathbf{x} that solves Ax=b\mathbf{A x} = \mathbf{b}. How do we compute x\mathbf{x} without using a matrix inverse? A standard method is to use the LU decomposition of A\mathbf{A} and then solve two easy systems of equations:

Ax=b,LUx=b,(LU decomposition)Lb=y,(forward substitution)Uy=x,(backward substitution)LUx=Ly=b.(10) \begin{aligned} \mathbf{Ax} &= \mathbf{b}, \\ \mathbf{LUx} &= \mathbf{b}, && \text{(LU decomposition)} \\ \mathbf{L} \setminus \mathbf{b} &= \mathbf{y}, && \text{(forward substitution)} \\ \mathbf{U} \setminus \mathbf{y} &= \mathbf{x}, && \text{(backward substitution)} \\ &\Downarrow \\ \mathbf{LUx} &= \mathbf{Ly} = \mathbf{b}. \end{aligned} \tag{10}

Why is it easy to solve for y\mathbf{y} and x\mathbf{x} w.r.t. L\mathbf{L} and U\mathbf{U} respectively? Because these are lower- and upper-triangular matrices respectively. For example, consider solving for y\mathbf{y} above:

[12,11n1,11n,1n,n11][y1y2yn1yn]=[b1b2bn1bn](11) \begin{bmatrix} 1 \\ \ell_{2,1} & 1 \\ \vdots & \dots & \ddots \\ \ell_{n-1,1} & \dots & \dots & 1 \\ \ell_{n,1} & \dots & \dots & \ell_{n,n-1} & 1 \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_{n-1} \\ y_n \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_{n-1} \\ b_n \end{bmatrix} \tag{11}

For each value yiy_i, we have

yi=bii,1y1i,i1yi1.(12) y_i = b_i - \ell_{i,1} y_1 - \dots - \ell_{i,i-1} y_{i-1}. \tag{12}

Solving for y\mathbf{y} in this way is called forward substitution because we compute y1y_1 and then forward substitute the value into the next row and so on. We can then run backward substitution on an upper-triangular matrix—so starting at the bottom row and moving up—to solve for x\mathbf{x}.

What’s the cost of this computation? Forward substitution requires i1i-1 multiplications and subtractions at each step or

2i=1n(i1)=(2i=1ni)n=n2n.(13) 2 \sum_{i=1}^n (i-1) = \big( 2 \sum_{i=1}^n i \big) - n = n^2 - n. \tag{13}

The cost of backward substitution is slightly larger because the diagonal elements are not one. Therefore, there are nn additional divisions or n2n^2 flops. Therefore the total complexity of solving a linear system of equations using the LU decomposition is

23n3+2n2flops.(14) \sim \frac{2}{3} n^3 + 2 n^2 \quad \text{flops}. \tag{14}

Inverting a matrix

Now let’s consider inverting a matrix. Now that we understand the LU decomposition, matrix inversion is easy. We want to compute A1\mathbf{A}^{-1} in

AA1=I=[e1e2en](15) \mathbf{AA}^{-1} = \mathbf{I} = \left[\begin{array}{c|c|c|c} \textbf{e}_1 & \textbf{e}_2 & \vdots & \textbf{e}_n \\\\ \end{array}\right] \tag{15}

Once we have the LU decomposition, we can simply solve for each column of I\mathbf{I}:

LUxi=ei.(16) \mathbf{LUx}_i = \mathbf{e}_i. \tag{16}

The simplest algorithm would be to perform this operation nn times. Since the LU decomposition requires (2/3)n3(2/3) n^3 flops and solving for each xi\mathbf{x}_i requires 2n22n^2 flops, we see that that matrix inversion requires

83n3flops.(17) \sim \frac{8}{3} n^3 \quad \text{flops.} \tag{17}

That said, matrix inversion is a complicated topic, and there are faster algorithms. But the basic idea is that the inversion itself is not faster than the LU decomposition, which effectively solves for x\mathbf{x}.

Finally, solving for A1\mathbf{A}^{-1} is not the same thing as computing x\mathbf{x}. To compute x\mathbf{x}, we have to perform a matrix multiplication, A1b\mathbf{A}^{-1} \mathbf{b}, which is

2n3flops.(18) \sim 2 n^3 \quad \text{flops.} \tag{18}

As we can see inverting a matrix to solve Ax=b\mathbf{Ax} = \mathbf{b} is roughly three times as expensive as directly solving for x\mathbf{x}. This alone is justification for the claim that, unless you have good reasons, it is better to directly solve for x\mathbf{x}.

Numerical errors and conditioning

As we’ll see, solving a linear system using a matrix inverse can also be less accurate. However, before discussing this, let’s review several concepts related to numerical methods. Consider the function

y=f(x),(19) y = f(x), \tag{19}

and imagine we can only approximate its output, denoted y^\hat{y}. The forward error is the error between the true and approximated outputs, y^y\left| \hat{y} - y \right|. The backward error measures the sensitivity to the problem. We want the smallest Δx\Delta x such that

y^=f(x+Δx).(20) \hat{y} = f(x + \Delta x). \tag{20}

This quantifies: for what input data have we actually solved the problem? If Δx\Delta x is very large, then our approximation y^\hat{y} isn’t even based on a value close to the true input xx. So Δx\left| \Delta x \right| is called the backward error.

For example, imagine we approximate 2\sqrt{2} with 1.41.4. The forward error is the absolute difference between 1.41.4 and 2\sqrt{2}. Since 1.96=1.4\sqrt{1.96} = 1.4, the backward error is the absolute difference between 22 and 1.961.96.

Finally, we say that the problem y=f(x)y = f(x) is well-conditioned if a small change in xx leads to a small change yy and ill-conditioned otherwise. This is often quantified with a condition number. In numerical linear algebra, the condition number of a matrix, denoted by κ(A)\kappa(\mathbf{A}), is

κ(A)=AA1.(21) \kappa(\mathbf{A}) = \lVert \mathbf{A} \rVert \lVert \mathbf{A}^{-1} \rVert. \tag{21}

If \lVert \cdot \rVert is the matrix 22-norm, then Eq. 2121 is just the ratio of largest to smallest singular values,

κ(A)=λ1λn.(21) \kappa(\mathbf{A}) = \frac{\lambda_1}{\lambda_n}. \tag{21}

See this StackExchange answer for a proof. In other words, an ill-conditioned matrix is one with a large spread in its singular values. Intuitively, this is not surprising, since matrices with sharp decays in their singular values are associated with many other instabilities, such as in control systems.

Accuracy

Let’s return to the main thread. A second problem with matrix inversion is that it can be less accurate. Let xLU\mathbf{x}_{\texttt{LU}} denote the vector solved for directly by the LU decomposition, and let xinv\mathbf{x}_{\texttt{inv}} denote the vector solved by inv(A)×b\text{inv}(\mathbf{A}) \times \mathbf{b}, where inv(A)\text{inv}(\mathbf{A}) is computed using the LU decomposition as discussed above. Now imagine that we computed the matrix inverse exactly, meaning we have no rounding errors in the computation inv(A)\text{inv}(\mathbf{A}). Then the best residual bound on the backward error, taken from section 14.114.1 of (Higham, 2002), is:

bAxinvγnAA1b,(22) \left| \mathbf{b} - \mathbf{A} \mathbf{x}_{\texttt{inv}} \right| \leq \textcolor{#8e7cc3}{\gamma_n \left| \mathbf{A} \right|} \left| \mathbf{A}^{-1} \right| \left| \mathbf{b} \right|, \tag{22}

where γn\gamma_n is a small constant factor accounting for things like machine precision. For xLU\mathbf{x}_{\texttt{LU}}, this bound is

bAxLUγnLUxLU.(23) \left| \mathbf{b} - \mathbf{A} \mathbf{x}_{\texttt{LU}} \right| \leq \textcolor{#8e7cc3}{\gamma_n \left| \mathbf{L} \right| \left| \mathbf{U} \right|} \left| \mathbf{x}_{\texttt{LU}} \right|. \tag{23}

If we assume the LU decomposition of A\mathbf{A} is relatively good—something Higham says is “usually true”—then the terms in purple in Eq. 2222 and 2323 are roughly equal, and the terms that dominate the bound for each method are xLU\left| \mathbf{x}_{\texttt{LU}} \right| and A1b\left| \mathbf{A}^{-1} \right| \left| \mathbf{b} \right|. Thus, the matrix inversion approach can have signficantly worse backward error than directly solving for x\mathbf{x} if the matrix A\mathbf{A} is ill-conditioned.

That said, the forward error between matrix inversion and direct solving can be much closer than expected for well-conditioned problems, a point argued by (Druinsky & Toledo, 2012). Thus, in practice, if you care only about the forward error, the folk wisdom that you should never invert a matrix may stress the point too much. I don’t discuss the bounds on forward errors here, but see this StackExchange post for details.

As I see it, the upshot is still the same: solving a system of linear equations by performing a matrix inversion is typically less accurate than solving the system directly.

Numerical experiments

I ran a couple simple numerical experiments to verify the results discussed above.

First, let’s explore whether in practice, solving for x\mathbf{x} directly is faster than using a matrix inversion. For increasing nn, I ran the following experiment: generate a matrix A\mathbf{A} and vector b\mathbf{b}. Then solve for x\mathbf{x} using np.linalg.solve(A, b) and np.linalg.inv(A) @ b. I ran each experiment 1010 times and recorded the mean run time (Figure 22). As expected, solving for x\mathbf{x} directly is faster. While the absolute difference in seconds is trivial here, it is easy to imagine this mattering more in large-scale data processing tasks.

Figure 2. Mean run time to compute x\mathbf{x} in Ax=b\mathbf{Ax} = \mathbf{b} over 1010 trials and increasing matrix dimension nn using both a matrix inversion and a linear solver.

Next, let’s explore the forward and backward error for well- and ill-conditioned problems. I generated two matrices Aw\mathbf{A}_w and Ai\mathbf{A}_i in the following way:

x = np.random.normal(n)

# Well-conditioned A and resulting b.
Aw = np.random.normal(size=(n, n))
bw = Aw @ x

# Ill-conditioned A and resulting b.
lambs = np.power(np.linspace(0, 1, n), b)
inds  = np.argsort(lambs)[::-1]
lambs = lambs[inds]

Q  = ortho_group.rvs(dim=n)
Ai = Q @ np.diag(lambs) @ Q.T
bi = Ai @ x

In words, the well-conditioned matrix is just a full-rank matrix with each element drawn from N(0,1)\mathcal{N}(0, 1). For the ill-conditioned matrix, I generated a matrix with sharply decaying singular values. The decay rate is controlled by an exponent base b. As we can see (Figure 33), both the forward and backward errors are relatively small and close to each other when the matrix is well-conditioned. However, when the matrix is ill-conditioned, the forward and backward errors increase, loosely as a function of the decay rate for the singular values.

Figure 3. (Top row) Forward and backward errors for xinv\mathbf{x}_{\texttt{inv}} and xLU\mathbf{x}_{\texttt{LU}} on a well-conditioned matrix. (Bottom row) Forward and backward errors for xinv\mathbf{x}_{\texttt{inv}} and xLU\mathbf{x}_{\texttt{LU}} on ill-conditioned matrices. The singular values decay w.r.t. to various exponent bases (bottom left).

Summary

When solving for x\mathbf{x} in Ax=b\mathbf{Ax} = \mathbf{b}, it is always faster and often more accurate to directly solve the system of linear equations, rather than inverting A\mathbf{A} and doing a left multiplication with b\mathbf{b}. This is a big topic, and I’m sure I’ve missed many nuances, but I think the main point stands: as a rule, one should be wary of inverting matrices.

  1. Trefethen, L. N., & Bau III, D. (1997). Numerical linear algebra (Vol. 50). Siam.
  2. Higham, N. J. (2002). Accuracy and stability of numerical algorithms. SIAM.
  3. Druinsky, A., & Toledo, S. (2012). How Accurate is inv (A)* b? ArXiv Preprint ArXiv:1201.6035.