Linear Independence, Basis, and the Gram–Schmidt algorithm

I formalize and visualize several important concepts in linear algebra: linear independence and dependence, orthogonality and orthonormality, and basis. Finally, I discuss the Gram–Schmidt algorithm, an algorithm for converting a basis into an orthonormal basis.

Definitions

A collection of one or more dd-vectors a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n is called linearly dependent if

0=α1a1++αnan,(1) \mathbf{0} = \alpha_1 \mathbf{a}_1 + \dots + \alpha_n \mathbf{a}_n, \tag{1}

for some scalars α1,,αn\alpha_1, \dots, \alpha_n which are not all zero and where 0\mathbf{0} is a dd-vector of zeros. The definition of linear independence is just the opposite notion. A collection of dd-vectors are linearly independent if the only way to make them equal to the zero-vector is if all the coefficients are zero:

0=α1a1++αnan    α1==αn=0.(2) \mathbf{0} = \alpha_1 \mathbf{a}_1 + \dots + \alpha_n \mathbf{a}_n \quad\iff\quad \alpha_1 = \dots = \alpha_n = 0. \tag{2}

For example, here are two pairs of 22-vectors that are linearly independent (left) and linearly dependent (right):

u1=[10],u2=[13],v1=[12],v2=[24].(3) \mathbf{u}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \mathbf{u}_2 = \begin{bmatrix} 1 \\ -3 \end{bmatrix}, \qquad\qquad \mathbf{v}_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix} 2 \\ 4 \end{bmatrix}. \tag{3}

The vectors v1\mathbf{v}_1 and v2\mathbf{v}_2 are dependent since 2v1+v2=0-2 \mathbf{v}_1 + \mathbf{v}_2 = \mathbf{0}.

Why do we use the word “dependence” to describe Equation 11? Since αi0\alpha_i \neq 0 for at least one vector for a linearly dependent set, we can rewrite Equation 11 as

ai=(α1/αi)a1++(αi1/αi)ai1+(αi+1/αi)ai+1+(αn/αi)an.(4) \mathbf{a}_i = (-\alpha_1/\alpha_i) \mathbf{a}_1 + \dots + (-\alpha_{i-1} / \alpha_i) \mathbf{a}_{i-1} + (-\alpha_{i+1} / \alpha_i) \mathbf{a}_{i+1} + (-\alpha_n/\alpha_i) \mathbf{a}_n. \tag{4}

In other words, any vector ai\mathbf{a}_i for which αi0\alpha_i \neq 0 can be written as a linear combination of the other vectors in the collection. This is the mathematical formulation for why we call the vectors “dependent.” It’s because we can express the vectors in terms of each other.

Geometric interpretations

Let’s visualize linear dependence and independence. Recall that vector addition a+b\mathbf{a} + \mathbf{b} can be visualized by first drawing the vector a\mathbf{a} from the origin and then drawing the vector b\mathbf{b} from the head of a\mathbf{a}. The resulting vector a+b\mathbf{a} + \mathbf{b} is the vector from the origin to head of b\mathbf{b} (Figure 11). Note that we can flip the order, drawing the vector b\mathbf{b} and then a\mathbf{a} and still arrive at the same location.

Figure 1. Visualization of vector addition a+b\mathbf{a} + \mathbf{b}.

Now what does it mean for two 22-vectors to be scaled such that they add to 0=[0,0]\mathbf{0} = [0, 0]? It means they must lie on the same line (Figure 22). Intuitively, if they did not lie on the same line, there would be no way to scale and add the vectors such that the computation ends at the origin 0=[0,0]\mathbf{0} = [0, 0].

Figure 2. Visualization of two linear independent (left) and dependent (right) vectors in 22-dimensions.

Notice that if two 22-vectors are linearly dependent, we can increase the vectors’ size or dimension by adding zeros and still have linearly dependent vectors. For example, the linearly dependent vectors v1\mathbf{v}_1 and v2\mathbf{v}_2 in Equation 33 are linearly dependent in 33-dimensions provided we just add zeros:

v1=[120],v2=[240].(5) \mathbf{v}_1 = \begin{bmatrix} 1 \\ 2 \\ 0 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix} 2 \\ 4 \\ 0 \end{bmatrix}. \tag{5}

Visually, adding zeros in this way is like embedding 22-vectors on a 22-dimensional plane into in a 33-dimensional space.

This starts to suggest what linear dependence in 33-dimensions might look like. Three linearly dependent 33-vectors must lie on a plane that goes through the origin (Figure 33). Again, the visual intuition is that if this were not true, there would be no way to, speaking loosely, get back to the origin 0=[0,0,0]\mathbf{0} = [0,0,0] using scaled versions of these vectors. Linear independence in three dimensions is precisely the opposite claim, that the 33-vectors do not all lie on a plane that passes through the origin.

Figure 3. Visualization of three linear independent (left) and dependent (right) vectors in 33-dimensions.

Uniqueness

If a vector v\mathbf{v} is a linear combination of linearly independent vectors a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n,

v=α1a1++αnan,(6) \mathbf{v} = \alpha_1 \mathbf{a}_1 + \dots + \alpha_n \mathbf{a}_n, \tag{6}

then the coefficients α1,,αn\alpha_1, \dots, \alpha_n are unique. It is easy to prove this. Suppose there are other coefficients β1,,βn\beta_1, \dots, \beta_n such that

v=β1a1++βnan.(7) \mathbf{v} = \beta_1 \mathbf{a}_1 + \dots + \beta_n \mathbf{a}_n. \tag{7}

Then clearly

0=vv=(α1a1++αnan)(β1a1++βnan)=(α1β1)a1++(αnβn)an.(8) \begin{aligned} \mathbf{0} &= \mathbf{v} - \mathbf{v} \\ &= (\alpha_1 \mathbf{a}_1 + \dots + \alpha_n \mathbf{a}_n) - (\beta_1 \mathbf{a}_1 + \dots + \beta_n \mathbf{a}_n) \\ &= (\alpha_1 - \beta_1) \mathbf{a}_1 + \dots + (\alpha_n - \beta_n) \mathbf{a}_n. \end{aligned} \tag{8}

Since we assumed that the vectors a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n are linearly independent, then clearly αi=βi\alpha_i = \beta_i for all ii.

Dimension and basis

Linear–dimension inequality

Notice that if we have two 33-vectors, they could lie on a plane that passes through the origin and still be linearly independent (Figure 44).

Figure 4. Embedding two linearly independent 22-vectors into 33-dimensions.

For example, we can increase the dimension of vectors u1\mathbf{u}_1 and u2\mathbf{u}_2 in Equation 33 by adding zeros, and the two resultant vectors would still be linearly independent:

u1=[100],u2=[130].(9) \mathbf{u}_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad \mathbf{u}_2 = \begin{bmatrix} 1 \\ -3 \\ 0 \end{bmatrix}. \tag{9}

Furthermore, we could even add another vector to this collection such that the new collection is linearly independent, e.g.:

u1=[100],u2=[130],u3=[007].(10) \mathbf{u}_1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad \mathbf{u}_2 = \begin{bmatrix} 1 \\ -3 \\ 0 \end{bmatrix}, \quad \mathbf{u}_3 = \begin{bmatrix} 0 \\ 0 \\ 7 \end{bmatrix}. \tag{10}

However, we cannot add a new vector to the collection in Equation 1010 and still have a linearly independent set. In general, we cannot have an nn-sized collection of linearly independent dd-vectors if n>dn > d. However, I think it is an intuitive result. Imagine we had two linearly independent 22-vectors, such as in Figure 22. Can you add another vector to the set such that it’s still impossible to scale and add the vectors to return to the origin?

This idea is called the independence–dimension inequality. Formally, if dd-vectors a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n are linearly independent, then ndn \leq d. See (Boyd & Vandenberghe, 2018) for a proof. Therefore, a collection of linearly independent dd-vectors cannot have more than dd elements. This immediately implies that any collection of d+1d+1 or more dd-vectors is linearly dependent.

Basis

A basis is a collection of dd linearly independent dd-vectors. Any dd-vector v\mathbf{v} can be written as a linear combination of the vectors in a basis of dd-vectors:

v=α1a1++αdad.(11) \mathbf{v} = \alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d. \tag{11}

The scalars α1,,αd\alpha_1, \dots, \alpha_d are called the coordinates of the basis. As this definition suggests, you are already familiar with this concept. When we use Cartesian coordinates (x,y)(x, y) to specify the location of an object on the xyxy-plane, those coordinates implicitly rely on the basis vectors [1,0][1, 0] and [0,1][0, 1]:

[xy]=x[10]+y[01].(12) \begin{bmatrix} x \\ y \end{bmatrix} = x \begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \begin{bmatrix} 0 \\ 1 \end{bmatrix}. \tag{12}

Let’s prove that any dd-vector v\mathbf{v} can be represented as a linear combination of a basis of dd-vectors. Since a basis a1,,ad\mathbf{a}_1, \dots, \mathbf{a}_d is linearly independent, we know that α1==αd=0\alpha_1 = \dots = \alpha_d = 0 if and only if

0=α1a1++αdad.(13) \mathbf{0} = \alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d. \tag{13}

Now consider this basis plus an additional vector v\mathbf{v}. By the independence–dimension inequality, we know this new collection is linearly dependent. Therefore, it can be written as

0=α1a1++αdad+νv,(14) \mathbf{0} = \alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d + \nu \mathbf{v}, \tag{14}

where α1,,αd,ν\alpha_1, \dots, \alpha_d, \nu are not all zero. Then clearly ν0\nu \neq 0, otherwise we have a contradiction in our assumption that α1==αd=0\alpha_1 = \dots = \alpha_d = 0. And if ν0\nu \neq 0, we can write the vector v\mathbf{v} as a linear combination of the basis vectors:

v=(α1/ν)a1++(αd/ν)ad.(15) \mathbf{v} = (-\alpha_1 / \nu) \mathbf{a}_1 + \dots + (-\alpha_d / \nu) \mathbf{a}_d. \tag{15}

Furthermore, we proved that if v\mathbf{v} is a linear combination of independent vectors, then its coefficients are unique. Combining these two results, we can say: any dd-vector v\mathbf{v} can be uniquely represented as a linear combination of basis vectors a1,,ad\mathbf{a}_1, \dots, \mathbf{a}_d.

Change of basis

Notice that we can describe the same vector using different bases. I won’t go into this concept in detail, but I think it is quite easy to visualization (Figure 55). A change of basis is an operation that re-expresses all vectors using a new basis or coordinate system. We’ll see in a bit how the Gram–Schmidt algorithm takes any basis and performs a change-of-basis to an orthonormal basis (discussed next).

Figure 5. A vector a\mathbf{a} is represented using two different bases.

Orthogonal and orthonormal vectors

Orthogonality and orthonormality

Two vectors v\mathbf{v} and u\mathbf{u} are orthogonal if their dot product is zero, i.e.

vu=0.(16) \mathbf{v} \cdot \mathbf{u} = 0. \tag{16}

A collection of vectors a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n is orthogonal if Equation 1616 holds for any pair of vectors in the collection. A collection a1,,an\mathbf{a}_1, \dots, \mathbf{a}_n is orthonormal if the collection is both orthogonal and if

ai=1,i=1,,n.(17) \lVert \mathbf{a}_i \rVert = 1, \qquad i = 1, \dots, n. \tag{17}

That is, if every vector has unit length. We can express these two conditions as a single condition:

aiaj={1if i=j,0if ij.(18) \mathbf{a}_i \cdot \mathbf{a}_j = \begin{cases} 1 & \text{if $i = j$,} \\ 0 & \text{if $i \neq j$.} \end{cases} \tag{18}

Orthonormal basis

We can easily prove that orthonormal dd-vectors are linearly independent. Suppose for some coefficients α1,,αd\alpha_1, \dots, \alpha_d

α1a1++αdad=0.(20) \alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d = \mathbf{0}. \tag{20}

We can take the dot product of both sides of this equation with any ai\mathbf{a}_i in the collection to get

0=α1a1++αdadai0=ai(α1a1++αdad)0=α1(aia1)++αd(aiad)0=αi.(21) \begin{aligned} \mathbf{0} &= \alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d \\ \mathbf{a}_i \cdot \mathbf{0} &= \mathbf{a}_i \cdot (\alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d) \\ 0 &= \alpha_1 (\mathbf{a}_i \cdot \mathbf{a}_1) + \dots + \alpha_d (\mathbf{a}_i \cdot \mathbf{a}_d) \\ &\Downarrow \\ 0 &= \alpha_i. \end{aligned} \tag{21}

The last step holds because of the conditions in Equation 1818. Thus, we have shown that for any ai\mathbf{a}_i in the collection, its associated coordinate αi\alpha_i is zero, i.e. orthonormal vectors are linearly independent. Notice that this immediately implies that any dd-sized collection of orthonormal dd-vectors is a basis.

Furthermore, notice that if v\mathbf{v} is a linear combination of orthonormal vectors, we can use the logic in Equation 2121 to find the coefficients: v=α1a1++αdadaiv=ai(α1a1++αdad)aiv=αi.(22) \begin{aligned} \mathbf{v} &= \alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d \\ \mathbf{a}_i \cdot \mathbf{v} &= \mathbf{a}_i \cdot (\alpha_1 \mathbf{a}_1 + \dots + \alpha_d \mathbf{a}_d) \\ &\Downarrow \\ \mathbf{a}_i \cdot \mathbf{v} &= \alpha_i. \end{aligned} \tag{22}

The dot product of each vector ai\mathbf{a}_i in the orthonormal basis with v\mathbf{v} equals the associated coefficient.

The standard basis or standard unit vectors are basis of vectors whose coordinates are one-hot vectors, i.e. all zero except for one 11. For example, the standard basis for d=2d=2 is

e1=[10],e2=[01].(23) \mathbf{e}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \mathbf{e}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}. \tag{23}

Gram–Schmidt algorithm

Imagine we had a collection of dd-vectors v1,,vd\mathbf{v}_1, \dots, \mathbf{v}_d that formed a basis but that were not necessarily orthonormal. Since orthonormal bases are both convention and easier to work with, a common operation in numerical linear algebra is to convert bases to orthonormal bases. The Gram–Schmidt algorithm is a process for doing that.

Vector projection

Before discussing the Gram–Schmidt algorithm, recall the definition of a vector projection. The vector projection of a vector a\mathbf{a} onto a vector b\mathbf{b}, often denoted projb(a)\text{proj}_{\mathbf{b}}(\mathbf{a}), is

projb(a)=(abb)b^,(24) \text{proj}_{\mathbf{b}}(\mathbf{a}) = \left( \frac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{b} \rVert} \right) \hat{\mathbf{b}}, \tag{24}

where b^\hat{\mathbf{b}} is a unit vector in the direction of b\mathbf{b}. This definition falls out of a little trigonometry and the definition of the dot product. We know from basic trigonometry that the projection of a\mathbf{a} onto b\mathbf{b} is acosθ\lVert \mathbf{a} \rVert \cos\theta, where θ\theta is the angle between a\mathbf{a} and b\mathbf{b} (Figure 66).

Figure 6. Two examples of vector projections of a\mathbf{a} onto b\mathbf{b}. The length of the vector x\mathbf{x} is acosθ\lVert \mathbf{a} \rVert \cos \theta.

However, if we don’t know θ\theta, we can simply rewrite acosθ\lVert \mathbf{a} \rVert \cos\theta using the definition of the dot product,

ab:=abcosθacosθ=abb.(25) \begin{aligned} \mathbf{a} \cdot \mathbf{b} &:= \lVert \mathbf{a} \rVert \lVert \mathbf{b} \rVert \cos\theta \\ &\Downarrow \\ \lVert \mathbf{a} \rVert \cos\theta &= \frac{\mathbf{a} \cdot \mathbf{b}}{\lVert \mathbf{b} \rVert}. \end{aligned} \tag{25}

This scalar value (ab)/b(\mathbf{a} \cdot \mathbf{b}) / \lVert \mathbf{b} \rVert is the length of the vector x\mathbf{x} in Figure 66, but we must multiply it by a vector that has unit length and points in the direction of b\mathbf{b}, hence b^\hat{\mathbf{b}} in Equation 2424.

Algorithm

The Gram–Schmidt algorithm is fairly straightforward. It processes the vectors {v1,,vd}\{ \mathbf{v}_1, \dots, \mathbf{v}_d \} one at a time while maintaining an invariant: all the previously processed vectors are an orthonormal set. For each vector vi\mathbf{v}_i, it first finds a new vector v^i\hat{\mathbf{v}}_i that is orthogonal to the previously processed vectors. It then normalizes that v^i\hat{\mathbf{v}}_i to a vector we will call ui\mathbf{u}_i. There are plenty of proofs of the correctness of Gram–Schmidt, e.g. (Boyd & Vandenberghe, 2018). In this post, I will just focus on the intuition. I think it’s fairly straightforward to see that Gram–Schmidt should be correct.

Consider the first step of the algorithm. The algorithm processes the vector v1\mathbf{v}_1. Since no other vectors have been processed, v1\mathbf{v}_1 does not need to be changed to be orthogonal to the empty set. The algorithm simply normalizes v1\mathbf{v}_1, i.e. u1=v1/v1\mathbf{u}_1 = \mathbf{v}_1 / \lVert \mathbf{v}_1 \rVert (Figure 7a7\text{a} and 7b7\text{b}).

Figure 7. Gray circles are unit circles. (a) The first vector v1\mathbf{v}_1. (b) v1\mathbf{v}_1 normalized to u1\mathbf{u}_1. (c) A second vector v2\mathbf{v}_2, decomposed into its vector projection onto v1\mathbf{v}_1 and the orthogonal complement of that projection. (d) The new orthogonal vector v^2\hat{\mathbf{v}}_2. (e) v^2\hat{\mathbf{v}}_2 normalized to u2\mathbf{u}_2.

Next, consider the vector v2\mathbf{v}_2, which is not necessarily orthogonal to u1\mathbf{u}_1 (Figure 7c7\text{c}). This vector can expressed as the sum of two vectors x\mathbf{x} and y\mathbf{y}, where x\mathbf{x} is the projection of v2\mathbf{v}_2 onto u1\mathbf{u}_1 and where y=v2x\mathbf{y} = \mathbf{v}_2 - \mathbf{x}, i.e. it is orthogonal to the vector projection. Therefore, we can find a new vector v^2\hat{\mathbf{v}}_2 that is orthogonal to u1\mathbf{u}_1 (Figure 7d7\text{d}) as

v^2:=v1proju1(v2).(26) \hat{\mathbf{v}}_2 := \mathbf{v}_1 - \text{proj}_{\mathbf{u}_1}(\mathbf{v}_2). \tag{26}

Finally, we simply normalize v^2\hat{\mathbf{v}}_2 to get u2\mathbf{u}_2 (Figure 7e7\text{e}). Gram–Schmidt then proceeds by finding a third vector, v^3\hat{\mathbf{v}}_3, that is orthogonal to the vectors u1\mathbf{u}_1 and u2\mathbf{u}_2 following the same logic, and so on. In general, the equation for v^i\hat{\mathbf{v}}_i is

v^i=viproju1(vi)projui1(vi).(27) \hat{\mathbf{v}}_i = \mathbf{v}_i - \text{proj}_{\mathbf{u}_1}(\mathbf{v}_i) - \dots - \text{proj}_{\mathbf{u}_{i-1}}(\mathbf{v}_i). \tag{27}

This is the basic idea of Gram–Schmidt. Besides proving it is correct, there are some other details such as how to handle cases in which the algorithm discovers that v1,,vd\mathbf{v}_1, \dots, \mathbf{v}_d is not a basis. However, I don’t think these add much to the intuition desired here.

  1. Boyd, S., & Vandenberghe, L. (2018). Introduction to applied linear algebra: vectors, matrices, and least squares. Cambridge university press.