Interpreting Expectations and Medians as Minimizers

I show how several properties of the distribution of a random variable—the expectation, conditional expectation, and median—can be viewed as solutions to optimization problems.

When most people first learn about expectations, they are given a definition such as

E[g(Y)]=g(y)f(y)dy \mathbb{E}[g(Y)] = \int g(y) f(y) \text{d}y

for some random variable YY with density ff and function gg. The instructor might motivate the expectation as the average of a random variable YY over many repeated experiments and then observe that expectations are really averages.

However, another interpretation of the expectation of YY is that it is a minimizer for a particular loss function, and this interpretation actually generalizes to other properties of YY’s distribution. The goal of this post is to give a few examples with detailed proofs. I find this interpretation useful because it helps explain why the mean squared error loss is so common.

Expectation and the squared loss

Let YY be a square-integrable random variable. Then we claim that

E[Y]=arg ⁣minaRE[(Ya)2]. \mathbb{E}[Y] = \arg\!\min_{a \in \mathbb{R}} \mathbb{E}[(Y - a)^2].

We can prove this with a little clever algebra,

E[(Ya)2]=E[(YE[Y]+E[Y]a)2]=E[(YE[Y])2]+E[(E[Y]a)2]+2E[(YE[Y])(E[Y]a)].(1) \begin{aligned} \mathbb{E}[(Y - a)^2] &= \mathbb{E}[(Y - \mathbb{E}[Y] + \mathbb{E}[Y] - a)^2] \\ &= \mathbb{E}[(Y - \mathbb{E}[Y])^2] + \mathbb{E}[(\mathbb{E}[Y] - a)^2] + 2 \mathbb{E}[(Y - \mathbb{E}[Y])(\mathbb{E}[Y] - a)]. \end{aligned} \tag{1}

The above works because

(CDA+DEB)2=A2+B2+2AB. (\overbrace{\phantom{\big|}C - D}^{A} + \overbrace{\phantom{\big|}D - E}^{B})^2 = A^2 + B^2 + 2 AB.

The first term in Equation 11 does not depend on aa and therefore can be ignored in our optimization calculation. Furthermore, note that the cross term is equal to zero. This is because

2E[(YE[Y])(E[Y]a)]=2E[(YE[Y])(E[Y]a)]=2E[YE[Y]](E[Y]a)=2(E[Y]E[E[Y]])(E[Y]a)=2(0)(E[Y]a)=0. \begin{aligned} 2 \mathbb{E}[(Y - \mathbb{E}[Y])(\mathbb{E}[Y] - a)] &= 2 \mathbb{E}[(Y - \mathbb{E}[Y])(\mathbb{E}[Y] - a)] \\ &= 2 \mathbb{E}[Y - \mathbb{E}[Y]](\mathbb{E}[Y] - a) \\ &= 2 (\mathbb{E}[Y] - \mathbb{E}[\mathbb{E}[Y]])(\mathbb{E}[Y] - a) \\ &= 2 (0) (\mathbb{E}[Y] - a) \\ &= 0. \end{aligned}

If any step is confusing, just recall that E[Y]\mathbb{E}[Y] is nonrandom, and E[c]=c\mathbb{E}[c] = c for any constant cc. What this means is that our original optimization problem reduces to

arg ⁣minaRE[(E[Y]a)2]. \arg\!\min_{a \in \mathbb{R}} \mathbb{E}[(\mathbb{E}[Y] - a)^2].

Since ()2(\cdot)^2 is a convex function, a=E[Y]a^{\star} = \mathbb{E}[Y] is the minimizer:

aE[(E[Y]a)2]=02E[E[Y]a]=02E[Y]+2a=0a=E[Y]. \begin{aligned} \frac{\partial}{\partial a} \mathbb{E}[(\mathbb{E}[Y] - a)^2] &= 0 \\ -2 \mathbb{E}[\mathbb{E}[Y] - a] &= 0 \\ -2 \mathbb{E}[Y] + 2a &= 0 \\ a &= \mathbb{E}[Y]. \end{aligned}

Conditional expectation and the best predictor

Now let’s consider a more complicated example. Consider two square-integrable random variables YY and XX. Let G\mathcal{G} be a class of all square-integrable functions of XX. What is the best function gGg \in \mathcal{G} such that the mean squared error is minimized or

g=arg ⁣mingGE[(Yg(X))2]? g^{\star} = \arg\!\min_{g \in \mathcal{G}} \mathbb{E}[(Y - g(X))^2]?

In words, what is the best predictor of YY given XX? It turns out, it is the conditional expectation. The derivation is nearly the same as above. We add and subtract two terms, E[YX]\mathbb{E}[Y \mid X], do a little algebra, and show that the cross term goes to zero:

E[(Yg(X))2]=E[(YE[YX])2]+E[(E[YX]g(X))2]+2E[(YE[YX])(E[YX]g(X))](2) \begin{aligned} \mathbb{E}[(Y - g(X))^2] &= \mathbb{E}[(Y - \mathbb{E}[Y \mid X])^2] + \mathbb{E}[(\mathbb{E}[Y \mid X] - g(X))^2] \\ &+ 2 \mathbb{E}[(Y - \mathbb{E}[Y \mid X])(\mathbb{E}[Y \mid X] - g(X))] \end{aligned} \tag{2}

Using two properties,

E[Y]=E[E[YX]]Law of total expectationE[g(X)YX]=g(X)E[YX],Pull out known factors \begin{aligned} \mathbb{E}[Y] &= \mathbb{E}[\mathbb{E}[Y \mid X]] && \text{Law of total expectation} \\ \mathbb{E}[g(X) Y \mid X] &= g(X) \mathbb{E}[Y \mid X], && \text{Pull out known factors} \end{aligned}

it is straightforward to see that the cross term is zero:

E[(YE[YX])(E[YX]g(X))]=E(E[(YE[YX])(E[YX]g(X))X])=E(E[YE[YX]X])(E[YX]g(X))=E(E[YX]E{E[YX]X})(E[YX]g(X))=E(E[YX]E[YX]=  0)(E[YX]g(X)). \begin{aligned} \mathbb{E}[(Y - \mathbb{E}[Y \mid X])(\mathbb{E}[Y \mid X] - g(X))] &= \mathbb{E}\big( \mathbb{E}[(Y - \mathbb{E}[Y \mid X])(\mathbb{E}[Y \mid X] - g(X))\mid X]\big) \\ &\stackrel{\star}{=} \mathbb{E}\big( \mathbb{E}[Y - \mathbb{E}[Y \mid X] \mid X]\big) \big(\mathbb{E}[Y \mid X] - g(X)\big) \\ &= \mathbb{E}\big( \mathbb{E}[Y \mid X] - \mathbb{E}\{\mathbb{E}[Y \mid X] \mid X\}\big) \big(\mathbb{E}[Y \mid X] - g(X)\big) \\ &\stackrel{\dagger}{=} \mathbb{E}\big( \underbrace{\mathbb{E}[Y \mid X] - \mathbb{E}[Y \mid X]}_{=\;0}\big) \big(\mathbb{E}[Y \mid X] - g(X)\big). \end{aligned}

Step \star holds because E[YX]\mathbb{E}[Y \mid X] is a function of XX but not YY—intuitively, if we didn’t condition on the randomness in XX, then E[Y]\mathbb{E}[Y] would be nonrandom—and therefore (E[YX]g(X))(\mathbb{E}[Y \mid X] - g(X)) is a function of just XX and can be pulled out of the conditional expectation. In step \dagger, we use

E{E[YX]X}=E[YX]E{1X}=E[YX]. \mathbb{E}\{\mathbb{E}[Y \mid X] \mid X\} = \mathbb{E}[Y \mid X] \mathbb{E}\{1 \mid X\} = \mathbb{E}[Y \mid X].

Once again, the first term in Equation 22 does not depend on gg, and therefore

arg ⁣mingGE[(Yg(X))2]=arg ⁣mingGE[(E[YX]g(X))2] \arg\!\min_{g \in \mathcal{G}} \mathbb{E}[(Y - g(X))^2] = \arg\!\min_{g \in \mathcal{G}} \mathbb{E}[(\mathbb{E}[Y \mid X] - g(X))^2]

Again, we have a convex function, this time with a minimum at g=E[YX]g^{\star} = \mathbb{E}[Y \mid X]. The first term in Equation 22, E[(YE[YX])2]\mathbb{E}[(Y - \mathbb{E}[Y \mid X])^2], has a nice interpretation: given the best predictor gg^{\star}, this is the lower bound on our loss. Our remaining loss is a function of how close the conditional expectation E[YX]\mathbb{E}[Y \mid X] is to YY.

Median and the absolute loss

Finally, let’s look at a different loss function, the absolute value. Let XX be a continuous random variable with a Lebesgue density f(x)f(x) and CDF F(x)F(x). The median mm of XX is the value such that mF1(1/2)m \triangleq F^{-1}(1/2). In words, it is the value such that a number is equally likely to fall above or below It. However, we can show that mm is also the minimizer to the expected absolute loss,

m=arg ⁣minaRE[Xa].(3) m = \arg\!\min_{a \in \mathbb{R}} \mathbb{E}[|X-a|]. \tag{3}

This is equivalent to showing that

m=asuch thataE[Xa]=0. m = a^{\star} \quad\text{such that}\quad \frac{\partial}{\partial a} \mathbb{E}[|X - a^{\star}|] = 0.

Let’s first compute the derivative using Leibniz’s rule for improper integrals,

aE[Xa]=axaf(x)dx=a(axaf(x)dx+axaf(x)dx)=aaxaf(x)dx+aaxaf(x)dx=af(x)dx+af(x)dx \begin{aligned} \frac{\partial}{\partial a} \mathbb{E}[|X - a|] &= \frac{\partial}{\partial a} \int_{-\infty}^{\infty} |x - a| f(x) \text{d}x \\ &= \frac{\partial}{\partial a} \Big( \int_{-\infty}^{a} |x - a| f(x) \text{d}x + \int_{a}^{\infty} |x - a| f(x) \text{d}x \Big) \\ &= \int_{-\infty}^{a} \frac{\partial}{\partial a} |x - a| f(x) \text{d}x + \int_{a}^{\infty} \frac{\partial}{\partial a} |x - a| f(x) \text{d}x \\ &= \int_{-\infty}^{a} -f(x) \text{d}x + \int_{a}^{\infty} f(x) \text{d}x \end{aligned}

Note that we can move the limit (derivative) inside the integral because we have well-behaved Lebesgue integrals. Setting this derivative equal to 00, we get

af(x)dx=af(x)dxFX(a)=1FX(a)FX(a)=1/2. \begin{aligned} \int_{-\infty}^{a} f(x) \text{d}x &= \int_{a}^{\infty} f(x) \text{d}x \\ F_X(a) &= 1 - F_X(a) \\ F_X(a) &= 1/2. \end{aligned}

Thus, a=ma = m.

Conclusion

In my mind, these are beautiful and deep results with interesting implications. For example, while there are a number of good reasons to prefer the squared loss function—differentiability, convexity—the first result above provides another significant reason. Since the expectation of the sample mean is the true population parameter, we have a straightforward and unbiased way to estimate a quantity that is the best we can do for our loss.

The conditional expectation result shows that for any square-integrable random variables XX and YY and for any function gg in a massive class of square-integrable functions G\mathcal{G}, the best possible predictor is the conditional expectation E[YX]\mathbb{E}[Y \mid X]. This has many implications for other models, such as linear regression.

Finally, I first came across the fact that the median minimizes Equation 33 in a proof that the median is always within one standard deviation of the mean:

μm=E[X]m=E[Xm]E[Xm]E[Xμ]=(E[Xμ])2E[(Xμ)2]=σ. \begin{aligned} | \mu - m | &= | \mathbb{E}[X] - m| \\ &= | \mathbb{E}[X - m]| \\ &\stackrel{\star}{\leq} \mathbb{E}[|X - m|] \\ &\stackrel{\dagger}{\leq} \mathbb{E}[|X - \mu|] \\ &= \sqrt{(\mathbb{E}[X - \mu])^2} \\ &\stackrel{\star}{\leq} \sqrt{\mathbb{E}[(X - \mu)^2]} \\ &= \sigma. \end{aligned}

The two steps labeled \star are due to Jensen’s inequality since both the absolute-value and square functions are convex. Step \dagger holds because the median mm is the minimizer.