Understanding Moments

Why are a distribution's moments called "moments"? How does the equation for a moment capture the shape of a distribution? Why do we typically only study four moments? I explore these and other questions in detail.

While drafting another post, I realized that I didn’t fully understand the idea of a distribution’s moments. I could write down the equation for the kkth moment of a random variable XX with a density function f(x)f(x),

μk=E[Xk]=xkf(x)dx,(1) \mu_k = \mathbb{E}[X^k] = \int_{-\infty}^{\infty} x^k f(x) \text{d}x, \tag{1}

and I understood that the first moment is a random variable’s mean, the second (central moment) is its variance, and so forth. Yet the concept felt slippery because I had too many unanswered questions. Why is it called a “moment”? Is it related to the concept of a moment in physics? Why do the first four moments—mean, variance, skewness, and kurtosis—have commonly used names but higher-order moments do not? Is there a probabilistic interpretation of the seventy-second moment? Why does a moment-generating function uniquely specify a probability distribution? The goal of this post is to explore these questions in detail.

This post is long, and I spent many hours writing it. However, I have consistently found that the most time-consuming blog posts are typically those that I needed to write the most. My guess is that this is because foundational ideas often touch many different topics, and understanding them is an iterative process. By this metric, moments are truly foundational.

As an outline, I begin by discussing the etymology of the word “moment” and discussing some terminology such as types of moments. I then walk through the first five moments—total mass, mean, variance, skewness, and kurtosis—in detail. I then attempt to generalize what we’ve learned in those sections to higher moments. I conclude with moment-generating functions.

Movement and shape

My first question about moments was basic: why are they called “moments”? I am clearly not the first person to have thought about this. According to Wiktionary, the words “moment” and “momentum” are doublets or etymological twins, meaning that they are two words in a language with the same etymological root. Both come from the Latin word “movimentum,” meaning to move, set in motion, or change. Today, a “moment” often refers to an instant in time, but we can guess at the connection to movement. For example, we might say things like, “It was a momentous occasion,” or, “The game was her big moment.” In both cases, the notion of time and change are intertwined.

This probably explains the origin of “moment” in physics. In physics, the moment of inertia is a measure of rotational inertia or how difficult it is to change the rotational velocity of an object. A classic example of the moment of inertia is moving a pencil: spinning a pencil about its short axis, say by pushing on the lead point, requires a different amount of force than rolling it about the long axis. This is because the force required is related to how the mass of the pencil is distributed about the axis of rotation.

That last sentence about how mass is distributed about an axis starts to hint at how early probabilists might have landed on the word “moment” to describe distributions—or rather, why they thought “moment of inertia” was a good analogy. There appears to be a thread from the Latin word for setting in motion to the physicist’s notion of changing an object’s rotational velocity to finally the probabilist’s notion of how probability mass is distributed.

This actually makes a lot of sense because, as we will see, moments quantify three parameters of distributions: location, shape, and scale. By convention, we plot distributions with their support (values that do not have probability zero) on the xx-axis and each supported value’s probability on the yy-axis. A distribution’s location refers to where its center of mass is along the xx-axis. By convention, a mean-centered distribution has a center of mass at zero. The scale refers to how spread out a distribution is. Scale stretches or compresses a distribution along the xx-axis. Finally, the shape of a distribution refers to its overall geometry: is the distribution bimodal, asymmetric, heavy-tailed? As a preview of the next sections, the first moment describes a distribution’s location, the second moment describes its scale, and all higher moments describe its shape.

Many distributions have parameters that are called “location”, “scale”, or “shape” because they control their respective attributes, but some do not. For example, the Poisson’s parameter is typically called “rate”, and increasing the rate increases the location and scale and changes the shape. In such cases, the terms “location”, “scale”, and “shape” still make sense as adjectives.

Types of moments

With the hunch that “moment” refers to how probability mass is distributed, let’s explore the most common moments in more detail and then generalize to higher moments. However, first we need to modify (1)(1) a bit. The kkth moment of a function f(x)f(x) about a non-random value cc is

E[(Xc)k]=(xc)kf(x)dx.(2) \mathbb{E}[(X - c)^k] = \int_{-\infty}^{\infty} (x - c)^k f(x) \text{d}x. \tag{2}

This generalization allows us to make an important distinction: a raw moment is a moment about the origin (c=0c = 0), and a central moment is a moment about the distribution’s mean (c=E[X]c = \mathbb{E}[X]). If a random variable XX has mean μx\mu_x, then its kkth central moment is

mk=E[(Xμx)k]=(xμx)kf(x)dx.(3) m_k = \mathbb{E}[(X - \mu_x)^k] = \int_{-\infty}^{\infty} (x - \mu_x)^k f(x) \text{d}x. \tag{3}

Central moments are useful because they allow us to quantify properties of distributions in ways that are location-invariant. For example, we may be interested in comparing the variability in height of adults versus children. Obviously, adults are taller than children on average, but we want to measure which group has greater variability while disregarding the absolute heights of people in each group. Central moments allow us to perform such calculations.

Finally, the kkth standardized moment is typically defined as the kkth central moment normalized by the standard deviation raised to the kkth power,

mˉk=mkσk=E[(Xμxσx)k],(4) \bar{m}_k = \frac{m_k}{\sigma_k} = \mathbb{E}\left[\left(\frac{X - \mu_x}{\sigma_x}\right)^k\right], \tag{4}

where mkm_k is defined as in (3)(3), and σk\sigma_k is the kkth power of the standard deviation of XX,

σk=σxk=(E[(Xμx)2])k.(5) \sigma_k = \sigma_x^k = \left( \sqrt{\mathbb{E}[(X - \mu_x)^2]} \right)^k. \tag{5}

Note that (4)(4) is only well-defined for distributions whose first two moments exist and whose second moment is non-zero. This holds for most distributions of interest, and we will assume it is true for the remainder of the post. Standardization makes the moment both location- and scale-invariant. Why might we care about scale invariance? As we will see, the third, fourth, and higher standardized moments quantify the relative and absolute tailedness of distributions. In such cases, we do not care about how spread out a distribution is, but rather how the mass is distributed along the tails.

Finally, a sample moment is an unbiased estimator of its respective raw, central, or standardized moment. For example, the sample moment of (2)(2) is

m~k=1Nn=1N(xnc)k.(6) \tilde{m}_k = \frac{1}{N} \sum_{n=1}^{N} (x_n - c)^k. \tag{6}

I use uppercase XX to denote a random variable and lowercase xnx_n to denote the nnth realization from NN samples. Recall that, given a statistical model, parameters summerize data for an entire population, while statistics summarize data from a sample of the population. We compute the former exactly using a statistical model and estimate it from data using the latter. For example, if we assume that XN(μx,σx2)X \sim \mathcal{N}(\mu_x, \sigma_x^2), then the first raw moment is E[X]=μx\mathbb{E}[X] = \mu_x, and we estimate it with the sample mean.

There’s a lot to unpack here. To summarize nomenclature and my notation, we have:

μk=E[Xk]kth raw momentmk=E[(Xμk)k]kth central momentmˉk=E[(Xμxσx)k]kth standardized momentm~k=1Nn=1N(Xnc)k.kth sample moment about c(7) \begin{aligned} \mu_k &= \mathbb{E}[X^k] && \text{$k$th raw moment} \\ m_k &= \mathbb{E}[(X - \mu_k)^k] && \text{$k$th central moment} \\ \\ \bar{m}_k &= \mathbb{E}\left[\left(\frac{X - \mu_x}{\sigma_x}\right)^k\right] && \text{$k$th standardized moment} \\ \tilde{m}_k &= \frac{1}{N} \sum_{n=1}^{N} (X_n - c)^k. && \text{$k$th sample moment about $c$} \end{aligned} \tag{7}

Now let’s discuss the first five moments in order: total mass, mean, variance, skewness, and kurtosis. Then I’ll attempt a synthesis before ending on moment-generating functions.

Total mass

Since x0=1x^0 = 1 for any number xx, the zeroth raw, central, and standardized moments are all

μ0=m0=mˉ0=()0f(x)dx=f(x)dx=1,(8) \mu_0 = m_0 = \bar{m}_0 = \int_{-\infty}^{\infty} (\dots)^0 f(x) \text{d}x = \int_{-\infty}^{\infty} f(x) \text{d}x = 1, \tag{8}

where ()(\dots) denotes xx, (xμx)(x - \mu_x), or (xμx)/σx(x - \mu_x) / \sigma_x. In other words, the zeroth moment captures the fact that probability distributions are normalized quantities, that they always sum to one regardless of their location, scale, or shape (Figure 11).

Figure 1. Beta, normal, exponential, gamma, Bernoulli, and Poisson distributions, each with a total mass of one.

Another way to think about (8)(8) is that the probability that at least one of the events in a sample space will occur is 11. Thus, the zeroth moment captures Kolmogorov’s second probability axiom.

Mean

The first raw moment, the expectation of XX, is

μ1=μx=E[X]=xf(x)dx.(9) \mu_1 = \mu_x = \mathbb{E}[X] = \int_{-\infty}^{\infty} x f(x) \text{d}x. \tag{9}

We can think about the first moment in a few ways. Typically, the expectation of XX is introduced to students as the long-run average of XX. For example, if XX is the outcome of a roll of a fair six-sided die, then its expectation is

E[X]=116+216+316+416+516+616=3.5.(10) \mathbb{E}[X] = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{6} + 6 \cdot \frac{1}{6} = 3.5. \tag{10}

If we repeatedly roll a die many times, the finite average of the die rolls will slowly converge to the expected value (Figure 22, left), hence the name “expectation”. Another way to think about the first moment is that it is that it is the center of mass of a probability distribution. In physics, the center of mass is the point at which all the torques due to gravity sum to zero. Since torque is a function of both the force (gravity) and moment arm (distance from the fulcrum), it makes sense that the mean-as-center-of-mass interpretation suggests the probabiltiy mass is perfectly balanced on the mean (Figure 22, right).

Figure 2. (Left) Expectation as an average: the convergence of the average outcome from rolling a fair six-sided die. (Right) Expectation as a center of mass: the mass of the probability distribution is balanced upon the expected value; plotted is a beta distribution with parameters α=2\alpha = 2 and β=5\beta = 5. The fulcrum is placed at the mean, α/(α+β)=2/7\alpha / (\alpha + \beta) = 2 / 7.

However, there’s a third way to think about the first moment, and it’s the one that I think is most useful for understanding moments generally: the first moment tells us how far away from the origin the center of mass is. I like this interpretation because it is analogous to a moment arm in mechanics, which is the length of an arm between an axis of rotation and a force acting perpendicularly against that arm (Figure 33, left). For example, if you open a door by its handle, the moment arm is the roughly the width of the door. The first moment of a distribution is a bit like a moment arm. It measures the distance between the distribution’s center of mass and the origin (Figure 33, right).

Figure 3. (Left) A moment arm is the length between a pivot point and a force. (Right) The first moment is the distance between the origin point and the center of mass.

This is an important interpretation because it will help justify central moments in the next sections. Subtracting each value of the support of XX by E[X]\mathbb{E}[X] can be visualized as simply shifting the distribution such that its mean is now zero. This gives us a way to normalize distributions with different means so that we can compare their scales and shapes while disregarding the absolute values the random variables can take. Again, think about how you would compare variability in heights of children versus adults while ignoring the fact that adults are obviously taller.

Typically, the first moment of a distribution is always the first raw moment. The first central and standardized moments are less interesting because they are always zeros:

E[Xμx]=E[X]E[μx]=μxμx=0.(11) \mathbb{E}[X - \mu_x] = \mathbb{E}[X] - \mathbb{E}[\mu_x] = \mu_x - \mu_x = 0. \tag{11}

Variance

The second moment is where, in my mind, things start to get a bit more interesting. Typically, people are interested in the second central moment, rather than the second raw moment. This is called the variance of a random variable XX, denoted V[X]\mathbb{V}[X],

m2=V[X]=(xμx)2f(x)dx.(12) m_2 = \mathbb{V}[X] = \int_{-\infty}^{\infty} (x - \mu_x)^2 f(x) \text{d}x. \tag{12}

The second central moment increases quadratically as mass gets further away from the distribution’s mean. In other words, variance captures how spread out a distribution is or its scale parameter. Points that are further away from the mean than others are penalized disproportionally. High variance means a wide distribution (Figure 44), which can loosely be thought of as a “more random” random variable; and a random sample from a distribution with a second central moment of zero always takes the same value, i.e. it is non-random. Again, the loose connection to “moment of inertia” seems clear in that the second central moment captures how wide a distribution is.

Figure 4. Three normal distributions with the same first moment but different variances. Bigger variance means a wider distribution, meaning higher variability in outcomes.

Why are we interested in the second central moment, rather than the second raw moment? Imagine we wanted to compare two random variables, XX and YY. If the values that XX can take are larger than the values that YY can take, the second raw moment could be bigger, regardless of how far away the realizations are from the mean of the distribution. This idea maps onto my example of comparing the variability in height of adults versus children. If we first subtract the mean before calculating the second moment, we can compare each distribution’s relative spread while ignoring its location.

This distinction is easy to demonstrate in code:

import numpy.random as npr

# `X` has larger values on average, but `Y` has higher variability in values.
X = npr.normal(10, 1, size=1000)
Y = npr.normal(0, 2, size=1000)

# So `X` has a larger second raw moment.
mx = (X**2).mean()
my = (Y**2).mean()
print(f'{mx:.2f}\t{my:.2f}')  # 102.32  3.96

# But `Y` has a larger second central moment (variance).
mx = X.var()
my = Y.var()
print(f'{mx:.2f}\t{my:.2f}')  # 1.06    3.96

We are less interested in the second standardized moment because it is always one,

mˉ2=E[(Xμ)2](E[(Xμ)2])2=1.(13) \bar{m}_2 = \frac{\mathbb{E}[(X - \mu)^2]}{(\sqrt{\mathbb{E}[(X - \mu)^2]})^2} = 1. \tag{13}

That said, there is an interesting connection between the variance or second central moment and the second raw moment:

V[X]=E[(XE[X])2]=E[X2+E[X]22XE[X]]=E[X2]+E[X]22E[X]E[X]=E[X2]E[X]2.(14) \begin{aligned} \mathbb{V}[X] &= \mathbb{E}[(X - \mathbb{E}[X])^2] \\ &= \mathbb{E}[X^2 + \mathbb{E}[X]^2 - 2 X \mathbb{E}[X]] \\ &= \mathbb{E}[X^2] + \mathbb{E}[X]^2 - 2 \mathbb{E}[X] \mathbb{E}[X] \\ &= \mathbb{E}[X^2] - \mathbb{E}[X]^2. \end{aligned} \tag{14}

Since E[X]\mathbb{E}[X] is non-random, E[X]2\mathbb{E}[X]^2 is non-random. This implies that the second central moment is equivalent to the second raw moment up to a constant. In fact, since E[X]2\mathbb{E}[X]^2 is nonnegative, we can see that the second moment attains the smallest possible value when taken around the first moment.

Skewness

The third standardized moment, called skewness, measures the relative size of the two tails of a distribution,

mˉ3=S[X]=E[Z3]=E[(Xμxσx)3],(15) \bar{m}_3 = \mathbb{S}[X] = \mathbb{E}[Z^3] = \mathbb{E}\left[\left(\frac{X - \mu_x}{\sigma_x}\right)^3\right], \tag{15}

where ZZ is the standard score or zz-score:

Z=Xμxσx.(16) Z = \frac{X - \mu_x}{\sigma_x}. \tag{16}

To see how (15)(15) quantifies the relative size of the two tails, consider this: any data point less than a standard deviation from the mean (i.e. data near the center) results in a standard score less than 11; this is then raised to the third power, making the absolute value of the cubed standard score even smaller. In other words, data points less than a standard deviation from the mean contribute very little to the final calculation of skewness. Since the cubic function preserves sign, if both tails are balanced, the skewness is zero. Otherwise, the skewness is positive for longer right tails and negative for longer left tails.

You might be tempted to think that (15)(15) quantifies symmetry, but this is a mistake. While a symmetric distribution always has a skewness of zero, the opposite claim is not always true: a distribution with zero skewness may be asymmetric. We’ll see an example at the end of this section.

Figure 5. (Left) Negative or left skewness. (Right) Positive or right skewness.

The above reasoning makes the terminology around skewness fairly intuitive. A left skewed or negatively skewed distribution has a longer left tail and (15)(15) is negative (Figure 55, left). A right skewed or positively skewed distribution has a longer right tail and (15)(15) is positive (Figure 55, right).

To concretize this idea, let’s look at two examples of how the tails dominate the skewness calculation in (15)(15). First, let’s take one thousand samples from the skew normal distribution and compute the percentage of absolute valued standard scores that are within one standard deviation of the mean. As we can see below, roughly 90%90\% of the skewness calculation comes from the tails, despite the tails being less than half of the total mass (Figure 66, left).

from numpy.random import RandomState
from scipy.stats import skewnorm

rng     = RandomState(seed=0)
X       = skewnorm(a=5).rvs(size=1000, random_state=rng)
Z_abs   = abs((X - X.mean()) / X.std())
s_total = (Z_abs**3).sum()
s_peak  = (Z_abs[Z_abs <  1]**3).sum()
s_tails = (Z_abs[Z_abs >= 1]**3).sum()

print(f'Tails: {s_tails / s_total:.4f}')  # 0.9046
print(f'Peak : {s_peak  / s_total:.4f}')  # 0.0954
Figure 6. (Left) Histogram of one thousand samples of a skew normal distribution with parameter α=5\alpha = 5. Light blue bins represent data within one standard deviation of the sample mean. (Right) Probability mass function of a Bernoulli distribution with parameter p=0.3p = 0.3.

For a second example, consider a Bernoulli random variable, XBernoulli(p)X \sim \text{Bernoulli}(p) where p<1/2p < 1/2. Since p<1/2p < 1/2, clearly 1p>p1 - p > p (Figure 66, right). This means that more mass is to the left of the mean, E[X]=p\mathbb{E}[X] = p, but it is easy to show that this distribution is right skewed. Using the fact that μx=p\mu_x = p and σx=p(1p)\sigma_x = \sqrt{p(1-p)}, let’s compute the third standardized moment:

E[(Xpp(1p))3]=(1p)3p+(0p)3(1p)[p(1p)]3/2=(1p)3pp3(1p)[p(1p)]3/2=p(1p)(12p)[p(1p)]3/2=12p[p(1p)]1/2(17) \begin{aligned} \mathbb{E}\left[\left(\frac{X - p}{\sqrt{p(1-p)}}\right)^3\right] &= \frac{(1 - p)^3 p + (0 - p)^3 (1 - p)}{[p(1-p)]^{3/2}} \\ &= \frac{(1 - p)^3 p - p^3 (1 - p)}{[p(1-p)]^{3/2}} \\ &= \frac{p (1 - p) (1 - 2p)}{[p(1-p)]^{3/2}} \\ &= \frac{1 - 2p}{[p (1 - p)]^{1/2}} \end{aligned} \tag{17}

This moment is positive—the distribution is right skewed—if and only if p<1/2p < 1/2, which is true by assumption. Thus, we have a right skewed distribution with more mass to the left of the mean. Again, we see that the smaller mass on the tails dominates the calcuation of skewness.

Standardizing the moment in (15)(15) is important because skewness is both location- and scale-invarant. In other words, two distributions can have the same mean and variance but different skewnesses. For example, consider the two random variables in (18)(18),

XN(μ,σ2)Ygamma(α=μβ,β=σ2μ).(18) \begin{aligned} X &\sim \mathcal{N}(\mu, \sigma^2) \\ Y &\sim \text{gamma}\left(\alpha = \frac{\mu}{\beta}, \beta = \frac{\sigma^2}{\mu}\right). \end{aligned} \tag{18}

Note that the gamma distribution’s second parameter β\beta depends on μ\mu and σ2\sigma^2, and its first parameter α\alpha depends on μ\mu and β\beta. (Credit to John Cook for this idea.) Using the definitions of mean and variance for these random variables, it is easy to see that

E[X]=μ,E[Y]=αβ=μββ=μV[X]=σ2,V[Y]=αβ2=μβ=μσ2μ=σ2.(19) \begin{aligned} \mathbb{E}[X] &= \mu, & \mathbb{E}[Y] &= \alpha \beta = \frac{\mu}{\beta} \beta = \mu \\ \mathbb{V}[X] &= \sigma^2, & \mathbb{V}[Y] &= \alpha \beta^2 = \mu \beta = \mu \frac{\sigma^2}{\mu} = \sigma^2. \end{aligned} \tag{19}

However, the normal distribution is always symmetric and has a skewness of zero, while this gamma distribution has a skewness of 2/α=2σ/μ2 / \sqrt{\alpha} = 2\sigma / \mu (Figure 77). Thus, we want a metric for skewness that ignores location and scale. This is one reason for taking skewness to be a standardized moment. As a fun aside, note that the gamma distribution will have a skewness approaching zero as μ\mu increases.

Figure 7. Two distributions with the same mean and variance but different skewnesses: a normal distribution with mean μ\mu and variance σ2\sigma^2 and a gamma distribution with parameters α=μ/β\alpha = \mu/\beta and β=σ2/μ\beta = \sigma^2/\mu.

To see why standardization works for (15)(15), consider the skewness of two random variables, XX with mean μx\mu_x and variance σx2\sigma_x^2 and Y=cXY = cX for some constant c>0c > 0. Then

S[X]=E[(Xμxσx)3]=c3c3E[(Xμxσx)3]=E[(cXcμxcσx)3]=S[Y].(20) \begin{aligned} \mathbb{S}[X] &= \mathbb{E}\left[\left(\frac{X - \mu_x}{\sigma_x}\right)^3\right] \\ &= \frac{c^3}{c^3} \mathbb{E}\left[\left(\frac{X - \mu_x}{\sigma_x}\right)^3\right] \\ &= \mathbb{E}\left[\left(\frac{cX - c\mu_x}{c \sigma_x}\right)^3\right] \\ &= \mathbb{S}[Y]. \end{aligned} \tag{20}

In words, skewness is invariant to sign-preserving scaling transformations. Note that if c<0c < 0, the skewness’s value would be preserved, but the sign would be flipped. And if of course, the mean-centering makes the metric invariant under translation, i.e. if X=Y+cX = Y+c then

S[X]=E[(Xμxσx)3]=E[(Y+c(μy+c)σx)3]=S[Y].(21) \mathbb{S}[X] = \mathbb{E}\left[\left(\frac{X - \mu_x}{\sigma_x}\right)^3\right] = \mathbb{E}\left[\left(\frac{Y+c - (\mu_y+c)}{\sigma_x}\right)^3\right] = \mathbb{S}[Y]. \tag{21}

Without this standardization, either the location or scale of the distribution could undesirably affect the calculation of skewness. For example, consider this Python code, which estimates the third raw moment of two exponential random variables that have the same skewness but different locations and the third central moment of two Gaussian random variables with the same skewness but different variances:

from numpy.random import RandomState
from scipy.stats import expon, norm

def raw_moment(X, k, c=0):
    return ((X - c)**k).mean()

def central_moment(X, k):
    return raw_moment(X=X, k=k, c=X.mean())

rng = RandomState(seed=0)

# `X` and `Y` have the same skewness but different locations.
# The un-centered moment does not capture this.
X = expon.rvs(0, 1, size=1000, random_state=rng)
Y = expon.rvs(1, 1, size=1000, random_state=rng)
print(f'Raw: {raw_moment(X, k=3):.2f}')  # 6.44
print(f'Raw: {raw_moment(Y, k=3):.2f}')  # 17.99

# `X` and `Y` have the same skewness but different scales.
# The central but not standardized moment does not capture this.
X = norm.rvs(0, 1, size=1000, random_state=rng)
Y = norm.rvs(0, 2, size=1000, random_state=rng)
print(f'Central: {central_moment(X, k=3):.2f}')  # 0.00
print(f'Central: {central_moment(Y, k=3):.2f}')  # 0.85

Clearly, in both cases the raw and central moments do not correctly quantify the relative skewness of the distributions, hence the standardization in (15)(15).

Finally, let’s look explore why “symmmetry” is a bad analogy for skewness by looking at an example of an asymmetric distribution with a skewness of zero. This example is borrows from this article by Donald Wheeler. Following the notation and results in (Dey et al., 2017), a Dagum distributed random variable XX has a probability density function

f(x;β,λ,δ)=βλδx(δ+1)(1+λxδ)(β+1),β>0,λ>0,δ>0,x>0.(22) f(x; \beta, \lambda, \delta) = \beta \lambda \delta x^{-(\delta + 1)} \left( 1 + \lambda x^{-\delta} \right)^{-(\beta + 1)}, \quad \beta > 0, \lambda > 0, \delta > 0, x > 0. \tag{22}

Using the results in section 2.42.4 of Dey, we see that the kkth moment is:

E[Xk]=λk/δΓ(1k/δ)Γ(β+k/δ)Γ(β).(23) \mathbb{E}[X^k] = \lambda^{k / \delta} \frac{\Gamma(1 - k/\delta)\Gamma(\beta + k/\delta)}{\Gamma(\beta)}. \tag{23}

And for k=3k = 3, we can compute the skewness of the Dagum distribution (or any distribution) using just the raw moments,

mˉ3=E[X3]3E[X](E[X2]E[X]2)E[X]3(E[X2]E[X]2)3/2.(24) \bar{m}_3 = \frac{\mathbb{E}[X^3] - 3 \mathbb{E}[X] (\mathbb{E}[X^2] - \mathbb{E}[X]^2) - \mathbb{E}[X]^3}{(\mathbb{E}[X^2] - \mathbb{E}[X]^2)^{3/2}}. \tag{24}

See A1 for a derivation of this skewness decomposition. If we set λ=1\lambda = 1, and use SciPy’s excellent minimize function, we can find optimal parameters that minimize the absolute value of the skewness (Figure 88):

β=0.23624499839δ=10.45787742085S[X]=0.00000000047(25) \begin{aligned} \beta = 0.23624499839 &\quad \delta = 10.45787742085 \\ &\Downarrow \\ \mathbb{S}[X] &= 0.00000000047 \end{aligned} \tag{25}

See A2 for Python code to reproduce the optimization.

Figure 8. An asymmetric Dagum distribution with zero skewness; parameters are β=0.23624499839\beta = 0.23624499839 and δ=10.45787742085\delta = 10.45787742085.

This example is contrived, but it is not hard to imagine real data taking the shape of Figure 88. Thus, I think it is better to always say that skewness captures “relative tailedness” rather than “asymmetry”.

Kurtosis

While skewness is a measure of the relative size of the two tails and is positive or negative depending on which tail is larger, kurtosis is a measure of the combined size of the tails relative to whole distribution. There are a few different ways of measuring this, but the typical metric is the fourth standardized moment,

mˉ4=K[X]=μ4σ4=E[(Xμσ)4].(26) \bar{m}_4 = \mathbb{K}[X] = \frac{\mu_4}{\sigma_4} = \mathbb{E}\left[\left(\frac{X - \mu}{\sigma}\right)^4\right]. \tag{26}

Unlike skewness’s cubic term which preserves sign, kurtosis’s even power means that the metric is always positive and that long tails on either side dominate the calculation. Just as we saw with skewness, kurtosis’s fourth power means that standard scores less than 11—again, data near the peak of the distribution—only marginally contribute to the total calculation. In other words, kurtosis measures tailedness, not peakedness. Again, we standardize the moment because two distributions can have the same mean and variance but different kurtosises (see Figure 1010 below for an example).

Since I approached learning about kurtosis with a blank slate, I did not think of kurtosis as peakedness rather than tailedness. However, a number of resources cautioned against such a mistake, claiming it was a common misinterpretation. See (Darlington, 1970) and (Westfall, 2014) for detailed discussions. In particular, Westfall has a convincing example for how kurtosis is not peakedness, which I’ll replicate here. Consider one thousand random samples from the Cauchy distribution with location 00 and scale 11 (Figure 99).

Figure 9. Histogram of one thousand samples of a Cauchy distribution with location 00 and scale 11. Red dots highlight histogram bins that are more than a standard deviation from the mean.

Clearly, the empirical distribution is peaked or pointy, but it also has outliers. We can compute the sample fourth standardized moment (26)(26) and see what percentage of that calculation comes from the peaks versus the tails by separating the calculation of kurtosis by data within or outside of one standard deviation of the mean:

from numpy.random import RandomState
from scipy.stats import cauchy

rng     = RandomState(seed=0)
N       = 1000
X       = cauchy(0, 1).rvs(size=N, random_state=rng)
Z       = (X - X.mean()) / X.std()
k_total = (Z**4).mean()
k_peak  = (Z[abs(Z) <  1]**4).sum() / N
k_tails = (Z[abs(Z) >= 1]**4).sum() / N

print(f'Tails: {k_tails / k_total:.8f}')  # 0.99999669
print(f'Peak : {k_peak  / k_total:.8f}')  # 0.00000331

The result shows that the kurtosis calculation is absolutely dominated by the outliers. Roughly 99%99\% of the total kurtosis comes from the tails.

With the intuition that kurtosis measures tailedness, let’s consider two alternative ways of interpreting it. First, note that the kurtosis any univariate normal is 33:

E[(Xμσ)4]=12πa4exp{12a2}da=3,(27) \mathbb{E}\left[ \left(\frac{X-\mu}{\sigma}\right)^4 \right] = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} a^4 \exp\Big\{-\frac{1}{2} a^2\Big\} \text{d}a = 3, \tag{27}

where a=(xμ)/σa = (x-\mu)/\sigma and da=dx/σ\text{d}a = \text{d}x / \sigma. See A3 for a complete derivation. So one common re-framing of kurtosis is excess kurtosis,

K[X]3,(28) \mathbb{K}[X] - 3, \tag{28}

which can be thought of as the tailedness of XX relative to the normal distribution. For example, a Laplace distribution with scale parameter 1/21 / \sqrt{2}—chosen so that the variance is 11—has an excess kurtosis of 33. This means it is “more tailed” than a normal distribution (Figure 99). Conversely, a uniform distribution with minimum and maximum values of ±3\pm \sqrt{3}—again chosen so that the variance is 11—has an excess kurtosis of 1.2-1.2. This means it is “less tailed” than a normal distribution (Figure 1010).

Figure 10. Laplace, normal, and uniform distributions with mean 00 and variance 11. Respectively, their excess kurtosises are 33, 00, and 1.2-1.2.

As a second interpretation, (Darlington, 1970) and (Moors, 1986) argue that kurtosis can be viewed as the variance of Z2Z^2 around its mean of 11. To see this, consider the fact that

K[X]=E[Z4]=V[Z2]+E[Z2]2=V[Z2]+V[Z]2=V[Z2]+1.(29) \begin{aligned} \mathbb{K}[X] &= \mathbb{E}[Z^4] \\ &= \mathbb{V}[Z^2] + \mathbb{E}[Z^2]^2 \\ &\stackrel{\star}{=} \mathbb{V}[Z^2] + \mathbb{V}[Z]^2 \\ &= \mathbb{V}[Z^2] + 1. \end{aligned} \tag{29}

Step \star holds because

V[Z]=E[Z2]E[Z]2(30) \mathbb{V}[Z] = \mathbb{E}[Z^2] - \mathbb{E}[Z]^2 \tag{30}

and V[Z]=1\mathbb{V}[Z] = 1 and E[Z]=0\mathbb{E}[Z] = 0 by definition. See A4 for a derivation. In the context of (29)(29), kurtosis can be interpreted as the variance or dispersion of Z2Z^2. If the random variable Z2Z^2 has low variance, then XX has low kurtosis. If Z2Z^2 has high variance, then XX has high kurtosis.

Generalizing and higher moments

Let’s fix the discussion around standardized moments and synthesize what we have learned so far; then we will generalize to higher moments. I want to finally answer the question: “Is there a probabilistic interpretation of the seventy-second moment?”

To review, the zeroth standardized moment is always 11 because raising anything to the zeroth power is 11 and probability distributions are normalized. The first standardized moment is always 00 because we subtract the mean, while the second standardized moment is always 11 because we then divide by the variance. The third standardized moment, skewness, measures the relative size of the two tails where sign indicates which tail is larger and magnitude is governed by the relative difference. The fourth standardized moment, kurtosis, measures the combined weight of the tails relative to the distribution. If either or both tails increases, the kurtosis will increase.

With higher moments, the logic is the same as with skewness and kurtosis. Odd-powered standardized moments quantity relative tailedness and even-powered standardized moments quantify total tailedness. This StackOverflow answer does a great job formalizing this logic, but I actually find the argument more complex than it needs to be except for a proof. Instead, consider Figure 1111.

Figure 11. Odd and even functions g(x;k)=xkg(x; k) = x^k for k=3,4k = 3,4 (left), k=5,6k = 5,6 (middle), and k=7,8k = 7,8 (right) overlaid over a zero-mean skew normal distribution.

In that figure, I have plotted a skew normal distribution and overlaid the function g(x;k)=xkg(x; k) = x^k for odd and even powers of kk. I have split the figure into three frames for legibility. Each function g(x;k)g(x; k) represents how the kkth moment (for k3k \geq 3) weights mass as that mass moves away from the zero mean of a centered distribution. We see that higher moments simply recapitulate the information captured by the third and fourth standardized moments. Thus, by convention, we only ever use skewness and kurtosis to describe a distribution because these are the lowest standardized moments that measure tailedness in their respective ways.

Moment-generating functions

While writing this blog post and trying to understand how moments quantify the location, scale, and shape of a distribution, I realized that a related theoretical idea made a lot more sense now: a random variable’s moment-generating function is an alternative specification of its distribution. Let me explain my intuition for why this feels almost obvious now.

Consider the moment-generating function (MGF) of a random variable XX,

MX(t)=E[etX],tR.(31) M_X(t) = \mathbb{E}[e^{t X}], \quad t \in \mathbb{R}. \tag{31}

To see why it is called a “moment-generating function”, note that

dkdtkMX(t)=E[dkdtketX]=E[XketX](32) \frac{d^k}{dt^k} M_X(t) = \mathbb{E}\left[\frac{d^k}{dt^k} e^{t X}\right] = \mathbb{E}[X^k e^{tX}] \tag{32}

and therefore

dkdtkMX(t)t=0=E[Xk].(33) \frac{d^k}{dt^k} M_X(t) \Bigg|_{t=0} = \mathbb{E}[X^k]. \tag{33}

In words, the kkth derivative evaluated at the origin is the kkth moment. This is nice. We can compute moments from the MGF by taking derivatives. This also means that the MGF’s Taylor series expansion,

E[etX]=E[k=01k!tkXk]=k=01k!tkE[Xk],(34) \mathbb{E}[e^{t X}] = \mathbb{E}\left[\sum_{k=0}^{\infty} \frac{1}{k!} t^k X^k \right] = \sum_{k=0}^{\infty} \frac{1}{k!} t^k \mathbb{E}[X^k], \tag{34}

is really an infinite sum of weighted raw moments. MGFs are important for a lot of reasons, but perhaps the biggest is a uniqueness theorem:

Uniqueness theorem: Let XX and YY be two random variables with cumulative distribution functions FX(x)F_X(x) and FY(y)F_Y(y). If the MGFs exist for XX and YY and if MX(t)=MY(t)M_X(t) = M_Y(t) for all tt near 00, then FX(z)=FY(z)F_X(z) = F_Y(z) for all zRz \in \mathbb{R}.

The big idea is that MGFs give us an another way to uniquely characterize the distribution of XX. This is especially helpful since probability density functions and cumulative distribution functions can be hard to work with, and many times it is easier to shift a calculation to the realm of MGFs. See A5 for an example of a complicated calculation that is trivialized by MGFs.

This property, the fact that MGFs are unique and therefore an alternative way of specifying probability distributions, is what feels almost obvious now. Given what we know about moments and how they describe a distribution’s center of mass, spread, and relative and absolute size of the tails, it makes sense that a function that can generate all of the moments of a random variable is one way to fully describe that random variable. This is obviously not a proof. See Theorem 1.61.6 in (Shao, 2003) for one. However, I suspect this intuition is how, if we were early probabilists, we might have hypothesized that such a result was provable.

Summary

Moments describe how the probability mass of a random variable is distributed. The zeroth moment, total mass, quantifies the fact that all distribution’s have a total mass of one. The first moment, the mean, specifies the distribution’s location, shifting the center of mass left or right. The second moment, variance, specifies the scale or spread; loosely speaking, flatter or more spread out distributions are “more random”. The third moment, skewness, quantifies the relative size of the two tails of a distribution; the sign indicates which tail is bigger and the magnitude indicates by how much. The fourth moment, kurtosis, captures the absolute size of the two tails. Higher standardized moments simply recapitulate the information in skewness and kurtosis; by convention, we ignore these in favor of the third and fourth standardized moments. Finally, moments are important theoretically because they provide an alternative way to fully and uniquely specify a probability distribution, a fact that is intuitive if you understand how moment’s quantify a distribution’s location, spread, and shape.

   

Acknowledgements

I thank Issa Rice, Michael Dewar, and Martin Blais for catching various mistakes.

   

Appendix

A1. Skewness decomposition

By definition, skewness is

mˉ3=m3σ3(A1) \bar{m}_3 = \frac{m_3}{\sigma_3} \tag{A1}

where m3m_3 is the third central moment and σ3\sigma_3 is the standard deviation raised to the third power. We can decompose the numerator as

m3=E[(Xμx)3]=E[(X22Xμx+μx2)(Xμ)]=E[X33X2μx+3Xμx2μx3]=E[X33μx(X2Xμx)μx3]=E[X3]3μxσx2μx3.(A2) \begin{aligned} m_3 &= \mathbb{E}[(X - \mu_x)^3] \\ &= \mathbb{E}[(X^2 - 2 X \mu_x + \mu_x^2)(X - \mu)] \\ &= \mathbb{E}[X^3 - 3X^2 \mu_x + 3 X \mu_x^2 - \mu_x^3] \\ &= \mathbb{E}[X^3 - 3 \mu_x (X^2 - X \mu_x) - \mu_x^3] \\ &\stackrel{\star}{=} \mathbb{E}[X^3] - 3 \mu_x \sigma_x^2 - \mu_x^3. \end{aligned} \tag{A2}

In step \star, we use the fact that

E[X2Xμx]=E[X2]E[X]μx=E[X2]E[X]2=σx2.(A3) \mathbb{E}[X^2 - X \mu_x] = \mathbb{E}[X^2] - \mathbb{E}[X] \mu_x = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \sigma_x^2. \tag{A3}

Therefore, we can express skewness entirely in terms of the first three raw moments:

mˉ3=m3σ3=E[X3]3μxσx2μx3σ3=E[X3]3E[X](E[X2]E[X]2)E[X]3(E[X2]E[X]2)3/2.(A4) \bar{m}_3 = \frac{m_3}{\sigma_3} = \frac{\mathbb{E}[X^3] - 3 \mu_x \sigma_x^2 - \mu_x^3}{\sigma_3} = \frac{\mathbb{E}[X^3] - 3 \mathbb{E}[X] (\mathbb{E}[X^2] - \mathbb{E}[X]^2) - \mathbb{E}[X]^3}{(\mathbb{E}[X^2] - \mathbb{E}[X]^2)^{3/2}}. \tag{A4}

A2. Root finding for Dagum distribution’s skewness parameter

import numpy as np
from   scipy.special import gamma
from   scipy.optimize import minimize


def pdf(x, beta, delta):
    """Compute probability density function of Dagum distribution. See:

    Dey 2017, "Dagum Distribution: Properties and Different Methods of 
    Estimation"
    """
    pos_x = x >  0
    neg_x = x <= 0
    pdf   = np.empty(x.size)
    A     = beta * delta * x[pos_x]**(-(delta+1))
    B     = (1 + x[pos_x]**(-delta))**(-(beta+1))
    pdf[pos_x] = A * B
    pdf[neg_x] = 0
    return pdf


def moment(k, beta, delta):
    """Compute k-th raw moment of Dagum distribution. See:

    Dey 2017, "Dagum Distribution: Properties and Different Methods of 
    Estimation"
    """
    A = gamma(1 - k/delta)
    B = gamma(beta + k/delta)
    C = gamma(beta)
    return (A * B) / C


def skewness(x, absolute=False):
    """Compute skewness using a decomposition of raw moments. If `absolute` is 
    `True`, return absolute value.
    """
    beta, delta = x
    EX1 = moment(1, beta, delta)
    EX2 = moment(2, beta, delta)
    EX3 = moment(3, beta, delta)
    var = EX2 - EX1**2
    mu3 = (EX3 - 3 * EX1 * var - EX1**3) / var**(3/2)
    if absolute:
        return abs(mu3)
    return mu3


SMALL = 1e-10
resp = minimize(
    fun=lambda x: skewness(x, absolute=True),
    x0=[0.5, 5],
    method='L-BFGS-B',
    bounds=[(SMALL, None), (SMALL, None)])
beta, delta = resp.x

print(f'Beta : {beta:.11f}')
print(f'Delta: {delta:.11f}')
print(f'Skew : {skewness([beta, delta]):.11f}')

A3. Kurtosis of any univariate normal is three

The density function of a univariate normal random variable with mean μ\mu and variance σ2\sigma^2 is

f(x;μ,σ2)=1σ2πexp{12(xμσ)2}.(A5) f(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2 \pi}} \exp\Big\{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\Big\}. \tag{A5}

We can write the kurtosis as

E[(Xμσ)4]=(xμσ)41σ2πexp{12(xμσ)2}dx=1σ2π(xμσ)4exp{12(xμσ)2}dx=12πa4exp{12a2}da.(A6) \begin{aligned} \mathbb{E}\left[ \left(\frac{X-\mu}{\sigma}\right)^4 \right] &= \int_{-\infty}^{\infty} \left(\frac{x-\mu}{\sigma}\right)^4 \frac{1}{\sigma \sqrt{2 \pi}} \exp\Big\{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\Big\} \text{d}x \\ &= \frac{1}{\sigma \sqrt{2 \pi}} \int_{-\infty}^{\infty} \left(\frac{x-\mu}{\sigma}\right)^4 \exp\Big\{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2\Big\} \text{d}x \\ &= \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} a^4 \exp\Big\{-\frac{1}{2} a^2\Big\} \text{d}a. \end{aligned} \tag{A6}

In the last step, we use integration by substitution where a=(xμ)/σa = (x-\mu)/\sigma and da=dx/σ\text{d}a = \text{d}x / \sigma. This integral over an exponential function has an analytic solution:

0xnebx2dx=(2k1)!!2k+1akπb,n=2kk integer, b>0.(A7) \int_{0}^{\infty} x^n e^{-b x^2} \text{d}x = \frac{(2k - 1)!!}{2^{k+1} a^k} \sqrt{\frac{\pi}{b}}, \qquad \text{$n = 2k$, $k$ integer, $b > 0$.} \tag{A7}

For us, this means that

0a4exp{12a2}da=(2(2)1)!!22+1122π12=322π.(A8) \int_{0}^{\infty} a^4 \exp\Big\{-\frac{1}{2} a^2\Big\} \text{d}a = \frac{(2(2) - 1)!!}{2^{2+1} \frac{1}{2^2}} \sqrt{\frac{\pi}{\frac{1}{2}}} = \frac{3}{2} \sqrt{2 \pi}. \tag{A8}

Since this is only half of the integral in (A6)(\text{A}6), which we know is symmetric, the full integral is 32π3 \sqrt{2 \pi}. The normalizer in (A6)(\text{A}6) cancels, and we get

E[(Xμσ)4]=12πa4exp{12a2}da=3.(A9) \mathbb{E}\left[\left(\frac{X-\mu}{\sigma}\right)^4 \right] = \frac{1}{\sqrt{2 \pi}} \int_{-\infty}^{\infty} a^4 \exp\Big\{-\frac{1}{2} a^2\Big\} \text{d}a = 3. \tag{A9}

A4. Mean and standard deviation of standard scores

The mean and standard deviation of the standard score (zz-score) are zero and one respectively. Let XX be a random variable with mean μx\mu_x and standard deviation σx\sigma_x. The standard score is defined as

z=xμxσx.(A10) z = \frac{x - \mu_x}{\sigma_x}. \tag{A10}

The mean of the standard scores is then

μz=1Nn=1Nzn=1Nn=1Nxnμxσx=0.(A11) \mu_z = \frac{1}{N} \sum_{n=1}^{N} z_n = \frac{1}{N} \sum_{n=1}^{N} \frac{x_n - \mu_x}{\sigma_x} = 0. \tag{A11}

The last equality holds because

n=1N(xnμx)=n=1Nxnn=1Nμx=NμxNμx.(A12) \sum_{n=1}^{N} (x_n - \mu_x) = \sum_{n=1}^{N} x_n - \sum_{n=1}^{N} \mu_x = N \mu_x - N \mu_x. \tag{A12}

The standard deviation of the standard scores is then

σz=1Nn=1N(znμz)2=1Nn=1Nzn2=1Nn=1N(xnμxσx)2=1σx21Nn=1N(xnμx)2σx2=1.(A13) \begin{aligned} \sigma_z &= \frac{1}{N} \sum_{n=1}^{N} (z_n - \mu_z)^2 \\ &= \frac{1}{N} \sum_{n=1}^{N} z_n^2 \\ &= \frac{1}{N} \sum_{n=1}^{N} \left( \frac{x_n - \mu_x}{\sigma_x} \right)^2 \\ &= \frac{1}{\sigma_x^2} \underbrace{\frac{1}{N} \sum_{n=1}^{N} ( x_n - \mu_x)^2}_{\sigma_x^2} \\ &= 1. \end{aligned} \tag{A13}

A5. Convolution of independent gammas random variables

Here is an example of a calculation that is made surprisingly easy by working with MGFs rather than distribution functions. Let X1Gamma(α1,β)X_1 \sim \text{Gamma}(\alpha_1, \beta) and X2Gamma(α2,β)X_2 \sim \text{Gamma}(\alpha_2, \beta) be independent random variables. What is the distribution of Y=X1+X2Y = X_1 + X_2? Well, the moment generating function of X1Gamma(α1,β)X_1 \sim \text{Gamma}(\alpha_1, \beta) is

MX1(t)=E[etX1]=(ββt)α1.(A14) M_{X_1}(t) = \mathbb{E}[e^{tX_1}] = \Bigg(\frac{\beta}{\beta - t}\Bigg)^{\alpha_1}. \tag{A14}

Since X1X_1 and X2X_2 are independent, we know the MGF of Y=X1+X2Y = X_1 + X_2 decomposes as

MY(t)=MX1(t)MX2(t).(A15) M_Y(t) = M_{X_1}(t) M_{X_2}(t). \tag{A15}

The fact that summing independent random variables reduces to multiplying their MGFs is an important property of MGFs. See (Shao, 2003), around equation 1.581.58, for a discussion. Thus, the MGF of YY must be

(ββt)α1(ββt)α2=(ββt)α1+α2,(A16) \Bigg(\frac{\beta}{\beta - t}\Bigg)^{\alpha_1} \Bigg(\frac{\beta}{\beta - t}\Bigg)^{\alpha_2} = \Bigg(\frac{\beta}{\beta - t}\Bigg)^{\alpha_1 + \alpha_2}, \tag{A16}

which means that YGamma(α1+α2,β)Y \sim \text{Gamma}(\alpha_1 + \alpha_2, \beta). I’m not sure how I would have approached this problem using the cumulative distribution or probability density functions.

To derive the MGF of the gamma distribution ourselves, note that

MX(t)=E[etX]=0etxβαΓ(α)xα1eβxdx=βαΓ(α)0etxxα1eβxdx=βαΓ(α)0xα1ex(βt)dx=βαΓ(α)Γ(α)(βt)α=(ββt)α.(A17) \begin{aligned} M_{X}(t) &= \mathbb{E}[e^{tX}] \\ &= \int_{0}^{\infty} e^{tx} \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha - 1} e^{-\beta x} \text{d}x \\ &= \frac{\beta^{\alpha}}{\Gamma(\alpha)} \int_{0}^{\infty} e^{tx} x^{\alpha - 1} e^{-\beta x} \text{d}x \\ &= \frac{\beta^{\alpha}}{\Gamma(\alpha)} \int_{0}^{\infty} x^{\alpha - 1} e^{-x(\beta - t)} \text{d}x \\ &\stackrel{\star}{=} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \frac{\Gamma(\alpha)}{(\beta - t)^{\alpha}} \\ &= \Bigg(\frac{\beta}{\beta - t}\Bigg)^{\alpha}. \end{aligned} \tag{A17}

Step \star holds because

0xaebx=Γ(a+1)ba+1.(A18) \int_{0}^{\infty} x^{a} e^{-bx} = \frac{\Gamma(a+1)}{b^{a+1}}. \tag{A18}

  1. Dey, S., Al-Zahrani, B., & Basloom, S. (2017). Dagum distribution: Properties and different methods of estimation. International Journal of Statistics and Probability, 6(2), 74–92.
  2. Darlington, R. B. (1970). Is kurtosis really “peakedness?” The American Statistician, 24(2), 19–22.
  3. Westfall, P. H. (2014). Kurtosis as peakedness, 1905–2014. RIP. The American Statistician, 68(3), 191–195.
  4. Moors, J. J. A. (1986). The meaning of kurtosis: Darlington reexamined. The American Statistician, 40(4), 283–284.
  5. Shao, J. (2003). Mathematical statistics.