Lagrangian and Hamiltonian Mechanics

The physics of Hamiltonian Monte Carlo, part 2: Building off the Euler–Lagrange equation, I discuss Lagrangian mechanics, the principle of stationary action, and Hamilton's equations.

This is the second of three posts in a series on Hamiltonian Monte Carlo (HMC). The series is motivated by the observation that the HMC tutorials I have read all assume that the reader already understands the underlying physics: the Euler–Lagrange equation, the calculus of variations, and Hamiltonian dynamics. However, this undercuts any explanation for the uninitiated. Thus, my ambition is to explain Hamiltonian dynamics in sufficient detail such that HMC makes sense.

In the previous post, I explained the Euler–Lagrange equation and the calculus of variations. I recommend reading that post first if you don’t already know what those terms mean. In this post, I will explain how Lagrangian and Hamiltonian mechanics use the Euler–Lagrange equation to reformulate of Newtonian mechanics. In the next and final post, I will present HMC.

Examples

Like all good abstractions, Lagrangian mechanics may feel squishy at first because it is more general than the thing with which you are comfortable, the thing it abstracts. In this case, Lagrangian mechanics models the energies in a system rather than the forces, but it is hard to immediately see why this is a good idea. Because of this, I’ll lead with two simple examples before backing out generalities.

Consider this equation, called the Lagrangian, where TT is kinetic energy and VV is potential energy:

LTV.(1) L \triangleq T - V. \tag{1}

I’m presenting this ex niliho now, but we’ll derive it in the next section. For now, just think of it as a scalar quantity representing the energy in a system. Next, let’s rewrite the Euler–Lagrange equation, which we derived in the previous post, as

Lx=ddt(Lx˙).(2) \frac{\partial L}{\partial x} = \frac{\text{d}}{\text{d}t} \Big(\frac{\partial L}{\partial \dot{x}}\Big). \tag{2}

Here, LL is the Lagrangian in (1)(1), tt is time, xx is position, and x˙\dot{x} indicates the first derivative of xx with respect to time tt. Later, we’ll see x¨\ddot{x}, which is the second derivative with respect to time. I’m embracing this dot notation from physics to be consistent with how that community presents this material

Now consider a classic problem from physics: modeling a frictionless spring with a mass mm and spring constant kk. Recall from high school physics that the kinetic energy is T=12mx˙2T = \frac{1}{2} m \dot{x}^2, and its potential energy is V=12kx2V = \frac{1}{2} k x^2. As a sanity check, the kinetic energy depends on the mass and the first derivative of position w.r.t. time or velocity; for a given spring constant, the potential energy is governed by the position. Then the Lagrangian is

L=12mx˙212kx2.(3) L = \frac{1}{2} m \dot{x}^2 - \frac{1}{2} k x^2. \tag{3}

Using the Euler–Lagrange equation (2)(2), we see that the function for which the Lagrangian is stationary is

Lx=x(12mx˙212kx2)=kxddt(Lx˙)=ddt(x˙[12mx˙212kx2])=mx¨mx¨=kx.(4) \begin{aligned} \frac{\partial L}{\partial x} &= \frac{\partial}{\partial x} \Big( \frac{1}{2} m \dot{x}^2 - \frac{1}{2} k x^2 \Big) = - kx \\ \frac{\text{d}}{\text{d}t} \Big(\frac{\partial L}{\partial \dot{x}}\Big) &= \frac{\text{d}}{\text{d}t} \Big(\frac{\partial}{\partial \dot{x}} \Big[ \frac{1}{2} m \dot{x}^2 - \frac{1}{2} k x^2 \Big] \Big) = m \ddot{x} \\ &\Downarrow \\ m\ddot{x} &= - kx. \end{aligned} \tag{4}

The last line of (4)(4) is just a reformulation of F=maF = ma. To see that, recall—again from high school physics—that the negative derivative w.r.t. position of potential energy is equal to force; therefore

F=ddxV=kx,F=ma=mx¨.(5) F = - \frac{\text{d}}{\text{d}x} V = -kx, \qquad F = ma = m\ddot{x}. \tag{5}

Was this just a coincidence? A cute mathematical trick? Let’s consider another example, a particle moving in three-dimensions with potential energy V(x1,x2,x3)V(x_1, x_2, x_3). The total kinetic energy decomposes as a sum of the kinetic energy for each dimension, giving us

L=12m(x1˙2+x2˙2+x3˙2)V(x1,x2,x3).(6) L = \frac{1}{2} m (\dot{x_1}^2 + \dot{x_2}^2 + \dot{x_3}^2) - V(x_1, x_2, x_3). \tag{6}

And this is equivalent to the vector formulation of F=mx¨\mathbf{F} = m \ddot{\mathbf{x}} because

tx˙L=[mx1¨mx2¨mx3¨]=mx¨,xL=[kx1kx2kx3]=kx,(7) \nabla_{t} \nabla_{\dot{\mathbf{x}}} L = \begin{bmatrix} m \ddot{x_1} \\ m \ddot{x_2} \\ m \ddot{x_3} \end{bmatrix} = m \ddot{\mathbf{x}}, \qquad \nabla_{\mathbf{x}} L = \begin{bmatrix} - k x_1 \\ - k x_2 \\ - k x_3 \end{bmatrix} = -k \mathbf{x}, \tag{7}

and again F=xV\mathbf{F} = - \nabla_{\mathbf{x}} V. So we have one again recovered Newton’s second law, F=ma\mathbf{F} = m \mathbf{a} but in three dimensions.

If this were a physics course, we would probably look at a couple more examples such as a pendulum or double pendulum, although the reader can easily find these in a physics textbook or online. The questions for us are: how did we know that (1)(1) would work for physics problems? Why is this formulation useful? Are there generalizable principles we want to have a good intuition for before learning HMC? Let’s explore these questions.

Deriving Lagrangian’s equation

We want to reformulate classical or Newtonian mechanics into a framework that models energies rather than forces. To show how Joseph–Louis Lagrange may have arrived at his famous equation (1),(1), let’s rederive it directly from Newton’s second law, F=maF = ma. For ease of notation, we’ll stick to one dimension.

We saw in (5)(5) that we can write F=maF = ma as

ddxV=mx,(8) -\frac{\text{d}}{\text{d}x} V = m x, \tag{8}

which we can further rearrange as

dVdxddtmx˙=0.(9) -\frac{\text{d} V}{\text{d}x} - \frac{\text{d}}{\text{d}t} m \dot{x} = 0. \tag{9}

We can then express the right-most term to be a derivative with respect to x˙\dot{x}:

dVdxddt[x˙mx˙22]=0.(10) -\frac{\text{d} V}{\text{d}x} - \frac{\text{d}}{\text{d}t} \Big[ \frac{\partial}{\partial\dot{x}} \frac{m \dot{x}^2}{2} \Big] = 0. \tag{10}

Now notice that the partial derivative of (mx˙2)/2(m \dot{x}^2) /2 w.r.t. xx is zero, and the partial derivative of VV w.r.t. to x˙\dot{x} is also zero. Therefore, we can write (10)(10) as

x[mx˙22V]ddt[x˙(mx˙22V)]=0.(11) \frac{\partial}{\partial x} \Big[ \frac{m\dot{x}^2}{2} - V \Big] -\frac{\text{d}}{\text{d}t} \Big[\frac{\partial}{\partial\dot{x}} \Big( \frac{m \dot{x}^2}{2} - V \Big) \Big] = 0. \tag{11}

We’ve just added zeros to both sides of the equation. Finally, note that 12mx˙2\frac{1}{2} m \dot{x}^2 on both sides of the equation is just TT, the kinetic energy. We can rewrite (11)(11) as

x(TV)ddt[x˙(TV)]=0.(12) \frac{\partial}{\partial x} (T-V) - \frac{\text{d}}{\text{d}t} \Big[ \frac{\partial}{\partial\dot{x}} (T-V) \Big] = 0. \tag{12}

And (12)(12) is just Euler–Lagrange’s equation where the function is equal to TVT - V. And this is how we arrive at the Lagrangian in (1)(1). The function L=TVL = T - V is a consequence of reformulating F=maF = ma in terms of the Euler–Lagrange equation in (2)(2). Thinking of L=TVL = T - V in this way leads to a deep and beautiful idea in physics called the principle of stationary action.

The principle of stationary action

Recall in the previous post that we integrated our function with respect to xx—in this post xx is now tt and y(x)y(x) is now x(t)x(t)—and solved for the stationary function that extremized that integral. Intuitively, think of the functional or integral as representing all “paths” between the endpoints of the integral, and the stationary function as the extremizing or “shortest” path.

The Euler–Lagrange equation is general; it is a solution for any such functional. However, let’s now work specifically with the Lagrangian LL. Then the functional or integral of LL with respect to time is

St1t2L(t,x,x˙)dt.(13) S \triangleq \int_{t_1}^{t_2} L(t, x, \dot{x}) \text{d}t. \tag{13}

This quantity is called the action because it has dimensions (Energy)×(Time)(\text{Energy}) \times (\text{Time}). We can produce an action SS for any function x(t)x(t)—think: any change in position across time. And we already know that the stationary function of this action is given by the Euler–Lagrange equations in (2)(2). Finally, we just saw through two examples and by rederiving L=TVL = T - V that the stationary function x(t)x(t) induced by (2)(2) is in fact F=maF = ma.

What does this mean? Speaking loosely, it means that Newton’s laws of motion can be viewed as objects taking the path of “least action” or least energy across time. This is what it meant by the principle of stationary action, also called Hamilton’s principle:

The path taken by a system between two time points is the one for which the action is stationary.

In my mind, the Euler–Lagrange equation isolated from physics is just a generalization of what we learned in calculus: a function’s minimum or maximum is analogous to a functional’s stationary function. However, the fact that F=maF = ma is the stationary function for a functional that represents energy across time suggests that when apples fall to the ground or when planets move through the solar system, they are taking the path of least action. This is a really cool idea.

Generalized coordinates and constraints

The Lagrangian reformulation does not introduce any new physics. So why is it useful? First, the principle of stationary action is a beautiful insight into the laws of physics and nature. That alone justifies the effort. However, a second benefit of using Lagrangian mechanics is that we can often solve complicated problems more simply than if we used Newtonian mechanics. This is because Lagrangian mechanics elegantly handles constraints. Understanding this point requires a bit of setup, but I think it’s worth it in the context of understanding HMC.

A constraint is a restriction in the freedom of movement in a particle or a system of particles. For example, suppose a particle is free to move in three dimensions. We can specify its position using different coordinate systems: Cartesian (x,y,z)(x,y,z), cylindrical (r,φ,z)(r,\varphi,z), spherical (r,θ,φ)(r,\theta,\varphi), and so forth. A holonomic constraint is a constraint that reduces the number of coordinates needed to specify the position of a particle. For example, if the particle can only move on a table top, we might model this as a holonomic constraint to the xyx–y plane, and the number of Cartesian coordinates needed to fully specify the particle’s position is reduced from three to two. A nonholonomic constraint restricts the range of the particle. For example, requiring that the particle stay within a box or only takes a positive zz value is a nonholonomic constraint.

The minimal set of required coordinates are called generalized coordinates. Since we can use any coordinate system we like, we denote these generalized coordinates by qkq_k with k=1,2,,Kk = 1, 2, \dots, K. Here, KK is the number of all coordinates for all particles in the system after subtracting constraints. If there are NN particles, CC coordinates which can be eliminated, and JJ coordinates per particle, K=JNCK = JN - C. Thus, qkq_k denotes a single particle’s position on a single coordinate.

We can even combine coordinate systems. For example, a two particle system with K=6K = 6 might have Cartesian coordinates (x,y,z)(x, y, z) for the first particle and polar coordinates (r,θ,φ)(r, \theta, \varphi) for the second. Each generalized coordinate has an associated generalized velocity q˙k=dqk/dt\dot{q}_k = \text{d}q_k / \text{d}t. This is called a velocity because it is a change in position w.r.t. to time, but the units are not necessarily (length/time)(\text{length} / \text{time}).

After we have chosen a set of generalized coordinates for our problem, the Lagrangian can be written as

L=L(t,qk,q˙k),(14) L = L(t, q_k, \dot{q}_k), \tag{14}

and the action or time integral over the Lagrangian is

S[qk(t)]=t1t2L(t,qk,q˙k)dt.(15) S[q_k(t)] = \int_{t_1}^{t_2} L(t, q_k, \dot{q}_k) \text{d}t. \tag{15}

We can trace the same logic as in the previous post to rewrite a generalized Euler–Lagrange equation as

LqkddtLq˙k=0,k=1,2,,K.(16) \frac{\partial L}{\partial q_k} - \frac{\text{d}}{\text{d}t} \frac{\partial L}{\partial \dot{q}_k} = 0, \qquad k = 1, 2,\dots, K. \tag{16}

Part of the power of this system is that the functional we extremize, the action, is a scalar quantity. If we reparameterize the coordinate system, we induce the same action. We simply “relabel” the stationary path with the new coordinate system.

Furthermore, since the Lagrangian (1)(1) is only in terms of the energy of the system, any force on the system that does no work, that does not have a net energetic contribution, can simply be dropped from the problem at setup. We can even be clever about choosing an appropriate coordinate system that suits the problem by minimizing the number of generalized coordinates in (16)(16).

The full benefit of the Lagrangian framework is, I admit, hard to grok without doing problems, but my ultimate goal is to provide sufficient intuition for setting up Hamiltonian mechanics and therefore HMC—which I think we have done at this point. However, if you want to see a problem that is simpler in Lagrangian mechanics, consider the double pendulum. It might look like both solutions are equally tedious, but notice that the Lagrangian formulation does not require worrying about constraints. They are implicitly handled through the problem setup. As a result, the Lagrangian solution is roughly two pages, while the Newtonian solution is roughly four.

Finally, we can (kind of) understand what physicists mean when they say that the Lagrangian formulation can exploit the “symmetry” in a problem. By symmetry, we mean that if we change the coordinates by some small values, another quantity is conserved, meaning it does not change. If we know a problem has such a symmetry, we can use appropriate generalized coordinates to exploit it. This gives some intuition for why the Lagrangian reformulation might make very complex dynamical systems easier to work with. Even when many particles are moving in the system, from the perspective of a conserved quantity, they are still well-behaved.

Hamiltonian mechanics

We are finally ready for Hamiltonian mechanics, which is a reformulation of Lagrangian mechanics. Compared to Newtonian mechanics, both Lagrangian and Hamiltonian mechanics seem similar. However, there are good reasons to prefer the latter reformulation, and I’ll focus on points that make sense in the context of Hamiltonian Monte Carlo.

While the Lagrangian was L=TVL = T - V where TT is kinetic energy and VV is potential energy, the Hamiltonian is

Hq˙kpkL=T+V.(17) H \triangleq \dot{q}_k p_k - L = T + V. \tag{17}

Here, pkp_k is the generalized momentum induced by the generalized coordinates. This is because pk=L/q˙kp_k = \partial L / \partial \dot{q}_k. Let’s derive this definition from the Lagrangian in (1)(1). This derivation will highlight a couple of key points about the Hamiltonian reformulation and show us how to get from Lagrangian to Hamiltonian mechanics. The big idea is that we can derive the Hamiltonian (17)(17) by taking the derivative of LL w.r.t to time and setting that expression equal to zero.

Note that L(t,qk,q˙k)L(t, q_k, \dot{q}_k) explicitly depends on tt, but that qkq_k and q˙k\dot{q}_k also implictly depend on tt. Therefore, the total derivative of LL w.r.t. tt requires the multivariable chain rule:

dLdt=Lqkqkt+Lq˙kq˙kt+Lt(18) \frac{\text{d} L}{\text{d}t} = \frac{\partial L}{\partial q_k} \frac{\partial q_k}{\partial t} + \frac{\partial L}{\partial \dot{q}_k} \frac{\partial \dot{q}_k}{\partial t} + \frac{\partial L}{\partial t} \tag{18}

Now let’s substitute the left-most term in (18)(18) with the right-hand side of the Euler–Lagrange equation while simplifying any derivative w.r.t. tt using our dot notation:

dLdt=ddtLq˙kq˙k+Lq˙kqk¨+Lt.(19) \frac{\text{d} L}{\text{d}t} = \underbrace{\frac{\text{d}}{\text{d}t} \frac{\partial L}{\partial \dot{q}_k} \dot{q}_k + \frac{\partial L}{\partial \dot{q}_k} \ddot{q_k}} + \frac{\partial L}{\partial t}. \tag{19}

Now note that everything in the underbrace above can be viewed as a product rule because

ddt(q˙kLq˙k)=(ddtLqk)q˙k+(ddtq˙k)Lq˙k=ddtLq˙kq˙k+Lq˙kq¨k(20) \begin{aligned} \frac{\text{d}}{\text{d} t} \Big( \dot{q}_k \frac{\partial L}{\partial \dot{q}_k} \Big) &= \Big(\frac{\text{d}}{\text{d}t}\frac{\partial L}{\partial q_k}\Big) \dot{q}_k + \Big(\frac{\text{d}}{\text{d} t} \dot{q}_k \Big) \frac{\partial L}{\partial \dot{q}_k} \\ &= \frac{\text{d}}{\text{d}t} \frac{\partial L}{\partial \dot{q}_k} \dot{q}_k + \frac{\partial L}{\partial \dot{q}_k} \ddot{q}_k \end{aligned} \tag{20}

Substituting (20)(20) into (19)(19), we get

dLdt=ddt(q˙kLq˙k)+Lt.(21) \frac{\text{d} L}{\text{d}t} = \frac{\text{d}}{\text{d} t} \Big( \dot{q}_k \frac{\partial L}{\partial \dot{q}_k} \Big) + \frac{\partial L}{\partial t}. \tag{21}

We can now move the left-hand side to the right-hand side and factor out the total time derivative,

0=ddt(q˙kLq˙kL)+Lt.(22) 0 = \frac{\text{d}}{\text{d} t} \Big( \dot{q}_k \frac{\partial L}{\partial \dot{q}_k} - L \Big) + \frac{\partial L}{\partial t}. \tag{22}

We’re almost done. Now notice that L/q˙k\partial L / \partial \dot{q}_k is the generalized momentum in (17)(17). So we can write (22)(22) as

0=ddt(q˙kpkL)Hamiltonian H+Lt.(23) 0 = \frac{\text{d}}{\text{d} t} \underbrace{\Big( \dot{q}_k p_k - L \Big)}_{\text{Hamiltonian $H$}} + \frac{\partial L}{\partial t}. \tag{23}

The underbraced term in (23)(23) is the Hamiltonian or HH in (17)(17). Finally, note that the generalized momentum, pk=L/q˙kp_k = \partial L / \partial \dot{q}_k can be written as mq˙km \dot{q}_k. This is by taking the partial derivative of the Lagrangian in (17)(17) w.r.t. q˙\dot{q}. Thus, we can write the Hamiltonian as

H=q˙kpkL=q˙k2m12q˙k2m+V(qk)=12mq˙k2+V(qk)=T+V.(24) \begin{aligned} H &= \dot{q}_k p_k - L \\ &= \dot{q}_k^2 m - \frac{1}{2} \dot{q}_k^2 m + V(q_k) \\ &= \frac{1}{2} m \dot{q}_k^2 + V(q_k) \\ &= T + V. \end{aligned} \tag{24}

Here, we use the fact that pk=mq˙kp_k = m \dot{q}_k and that LL can be written in terms of generalized coordinates, velocities, and momenta as

L=12mq˙k2+V(qk).(25) L = \frac{1}{2} m \dot{q}_k^2 + V(q_k). \tag{25}

Why is this Hamiltonian a nice object? Notice that in (23)(23), if L/t=0\partial L / \partial t = 0, if the Lagrangian is not an explicit function of time, the Hamiltonian is constant or conserved. And the interpretation of HH is fairly straightforward: it is simply the energy of the particle. In fact, one way physicists can define “energy” is simply as the conserved quantity when a system has time translational symmetry. Thus, the Hamiltonian reformulation emphasizes a particular invariance that is commonly found in physical systems.

Again, none of this breaks Newtonian physics, but hopefully it’s clear that this is a very elegant and useful way to model the world.

Hamilton’s equations

Remember how the real goal of these posts was to understand enough physics to understand Hamiltonian Monte Carlo? Well, now we are finally ready to present the very first set of equations in Radford Neal’s introduction to HMC (Neal, 2011): Hamilton’s equations. Let WW denote work, and note that we can write the potential energy VV as

V=W=Fqk=dpkdtqk=p˙kqk.(24) V = -W = -F q_k = -\frac{\text{d}p_k}{\text{d}t} q_k = -\dot{p}_k q_k. \tag{24}

Furthermore, note that we can write the kinetic energy as

T=12mq˙k2=pk22m(25) T = \frac{1}{2} m \dot{q}_k^2 = \frac{p_k^2}{2m} \tag{25}

because pk=mq˙kp_k = m \dot{q}_k. Taken together, this means

H=T+V=pk22mp˙kqk,(26) H = T + V = \frac{p_k^2}{2m} -\dot{p}_k q_k, \tag{26}

and therefore

Hqk=p˙k=dpkdtHpk=pkm=q˙k=dqkdt.(26) \begin{aligned} \frac{\partial H}{\partial q_k} &= -\dot{p}_k = - \frac{\text{d}p_k}{\text{d}t} \\ \frac{\partial H}{\partial p_k} &= \frac{p_k}{m} = \dot{q}_k = \frac{\text{d}q_k}{\text{d}t}. \end{aligned} \tag{26}

(26)(26) are called Hamilton’s equations. For some physical intuition, consider the one-dimensional case of a single particle. Since

dpdt=p˙=mq¨=F,(27) \frac{\text{d}p}{\text{d}t} = \dot{p} = m \ddot{q} = F, \tag{27}

the first of Hamilton’s equations says that Newtonian force of the particle is equal to the negative gradient of its potential energy. And since the time derivative of position is generalized velocity, the second equation says that the velocity of the particle is equal to the time derivative of its kinetic energy.

As we will see in the next post, we are going to leverage Hamilton’s equations and some of their properties to build a proposal distribution that reduces correlation between samples by conserving energy across Markov chain steps.

  1. Neal, R. M. (2011). MCMC using Hamiltonian dynamics.