Standard Errors and Confidence Intervals

How do we know when a parameter estimate from a random sample is significant? I discuss the use of standard errors and confidence intervals to answer this question.

The standard deviation is a measure of the variation or dispersion of data, how spread out the values are. The standard error is the standard deviation of the sampling distribution. However, this does not mean that the standard error is the empirical standard deviation.1 Since the sampling distribution of a statistic is the distribution of that statistic derived after nn repeated trials, the standard error is a measure of the variation in these samples. The more samples one draws, the bigger nn is, the smaller the standard error should be. Intuitively, the standard error answers the question: what’s the accuracy of a given statistic that we are estimating through repeated trials?

While the standard error can be estimated for other statistics, let’s focus on the mean or the standard error of the mean. Let X=(X1,,Xn)X = (X_1, \dots, X_n) denote a random sample where X1,,XnX_1, \dots, X_n are independent and identically distributed (i.i.d.) with population variance σ2\sigma^2. Because of this i.i.d. assumption, the variance of the sample mean Xˉ=1/ni=1nXi\bar{X} = 1/n \sum_{i=1}^n X_i is:

V[Xˉ]=V[1ni=1nXi]=1n2V[i=1nXi]=1n2(nσ2)=σnσxˉ.(1) \mathbb{V}[\bar{X}] = \mathbb{V}\left[\frac{1}{n} \sum_{i=1}^n X_i \right] = \frac{1}{n^2} \mathbb{V}\left[\sum_{i=1}^n X_i \right] = \frac{1}{n^2} (n \sigma^2) = \frac{\sigma}{\sqrt{n}} \triangleq \sigma_{\bar{x}}. \tag{1}

If the population variance σ2\sigma^2 is unknown, we can use the sample variance σx2\sigma_x^2 to approximate the standard error:

σxˉσxn.(2) \sigma_{\bar{x}} \approx \frac{\sigma_x}{\sqrt{n}}. \tag{2}

To see the effect of dividing by nn, consider Figure 11, which compares the standard error as a function of nn. To reduce a given standard error by half, we need four times the number of samples:

σn    12(σn)=σ4n.(3) \frac{\sigma}{\sqrt{n}} \implies \frac{1}{2}\left( \frac{\sigma}{\sqrt{n}} \right) = \frac{\sigma}{\sqrt{4n}}. \tag{3}

For example, when σ=100\sigma = 100 and n=4n=4, we have a standard error of 5050. To reduce this standard error to 2525, we need n=16n=16 samples.

Figure 1. The standard error σ/n\sigma / \sqrt{n} as a function of nn and population variance.

Put differently, think about what would happen if we didn’t divide our estimate by n\sqrt{n}. In this case, we would be just be estimating the standard deviation. This may be a nice thing to do—maybe even what we want to do—but it’s not estimating the standard deviation of the mean itself.

Confidence intervals

Standard errors are related to confidence intervals. A confidence interval specifies a range of plausible values for a statistic. A confidence interval has an associated confidence level. Ideally, we want both small ranges and higher confidence levels. Note that confidence intervals are random, since they are themselves functions of the random variable XX.

This is a nuanced topic with a lot of common statistical misconceptions. Therefore, let’s stick to just a single simple example that illustrates this relationship. Please consult a textbook for a more thorough treatment. Imagine we want to estimate the population mean parameter μ\mu of a random variable, which we assume is normally distributed. We can compute a standard score ZZ as

ZXˉμσ/n.(4) Z \triangleq \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}. \tag{4}

There are multiple definitions of standard score; this version tells us the difference between the sample mean Xˉ\bar{X} and the population mean μ\mu; this is why we normalize by the standard error rather than population variance. The units are the units of the standard error.

We can therefore compute numbers z-z and zz such that

P(zZz)=1α,(5) \mathbb{P}(-z \leq Z \leq z) = 1 - \alpha, \tag{5}

where 1α1 - \alpha is our confidence level. If we set α=0.05\alpha = 0.05, then we are computing the probability that the standard score is between z-z and zz with 95%95\% probability. We can compute zz using the cumulative distribution function Φ\Phi of the standard normal distribution, since zz has been normalized:

P(Zz)=Φ(z)=0.975,z=Φ1(Φ(z))=Φ1(0.975)=1.96.(6) \begin{aligned} \mathbb{P}(Z \leq z) &= \Phi(z) = 0.975, \\ &\Downarrow \\ z &= \Phi^{-1}(\Phi(z)) = \Phi^{-1}(0.975) = 1.96. \end{aligned} \tag{6}

We’re just backing out the value zz given a fixed confidence level specified by α\alpha. In other words, we decide how confident we want to be, and then estimate how big our interval must be for that desired confidence level. We can now solve for a confidence interval around the true population mean; it’s a function of our sample mean and standard score:

0.95=P(zZz)=P(1.96Xˉμσ/n1.96)=P(Xˉ1.96(σn)μXˉ+1.96(σn)).(7) \begin{aligned} 0.95 &= \mathbb{P}(-z \leq Z \leq z) \\ &= \mathbb{P}\left(-1.96 \leq \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \leq 1.96\right) \\ &= \mathbb{P} \left( \bar{X} - 1.96 \left( \frac{\sigma}{\sqrt{n}} \right) \leq \mu \leq \bar{X} + 1.96 \left( \frac{\sigma}{\sqrt{n}} \right) \right). \end{aligned} \tag{7}

As we can see, if we compute our sample mean Xˉ\bar{X} and then add and subtract roughly two times the standard score, we get confidence intervals that represent the range of plausible values that the true mean parameter is in, with a confidence level of 95%95\%.

Example

As an example, imagine I wanted to compare two randomized trials. A common example is a medical trial, but in machine learning research, we might want to compare two randomized algorithms, our method and a baseline. If we simply run both algorithms a few times and compare a mean metric, for example the mean accuracy, we may not be able to say anything about our model’s performance relative to the baseline. Intuitively, we may not have enough precision about the metric; and what we want to do is to increase nn to increase our confidence in the estimate.

Figure 2. Estimated mean and 95% confidence intervals (two standard errors) for samples from a standard normal (red) and zero-mean normal with variance σ2=1.1\sigma^2 = 1.1 (blue). The bounds of the confidence intervals are shown in dashed lines.

Consider Figure 22. To generate this plot, I drew realizations x=(x1,,xn)x = (x_1, \dots, x_n) from two normal distributions, N(0,1)\mathcal{N}(0, 1) and N(0,1.1)\mathcal{N}(0, 1.1), for increasing values of nn. I then computed the standard score (Eq. 22) and then plotted the 95%95\% confidence interval around the sample mean (Eq. 77). As we can see, it is not possible to distinguish between mean estimates of the random samples, even when n=1000n=1000 because the confidence intervals overlap. However, when n=10000n=10000, we have a statistically significant result.

Conclusion

Statistical significance is a complicated topic, and I’m by no means an expert. However, just the level of background in this post demonstrates why it’s such an important topic. In my numerical experiments, I could simply increase nn to get confidence intervals that I desired. However, if I were running a clinical trial, I may have to fix nn in advance. In many machine learning papers, researchers will report the mean and standard deviation, without, I suspect, realizing that the standard deviation is simply the standard deviation of the sample (e.g. the randomized trials), not the standard deviation of the estimated mean (e.g. the average accuracy). I refer the reader again to the footnote.

  1. Can you tell that’s what I thought this meant?