Yay! Finally something more directly from physics to data science. We will also have a chance to see how Metropolis-Hastings algorithm works!

The Hamiltonian Monte Carlo method is a kind of Metropolis-Hastings method. One of the weak points of Monte Carlo sampling comes up with random walks. Hamiltonian Monte Carlo method (HMC) is an approach to reducing the randomizing in algorithm of the sampling.

The original name was hybrid Monte Carlo method. I prefer calling it Hamiltonian Monte Carlo method!. Sounds cooler and reminds me physics! Hamiltonian mechanics is one of the coolest topics you can learn in mechanics class. I would not have introduced about the real amazing part of the Hamiltonian mechanics, but I still strongly recommend you to look into it from other references. Graduate level of mechanics books must have the part.

We will approach the idea from Metropolis-Hastings method with very narrow target density.

Metropolis-Hastings method review

In Metropolis-Hastings sampling, we use a proposal density $Q (\boldsymbol{x}, \boldsymbol{x'})$, which is a probability to transit from $\boldsymbol{x}$ to $\boldsymbol{x'}$. $\boldsymbol{x'}$ is the probable next position.

  • If the proposal distribution is symmetric under the switch of position vectors, the proposal distribution becomes irrelevant at the procedures. What matters is relative proportion of $P^* (\boldsymbol{x})$.
  • If $P^* (\boldsymbol{x'}) / P^* (\boldsymbol{x})$ is bigger than 1, just draw the $\boldsymbol{x'}$ to the sample. If $\boldsymbol{x} = x^{(n)}$, we set $\boldsymbol{x'} = x^{(n+1)}$.
  • If $P^* (\boldsymbol{x'}) / P^* (\boldsymbol{x})$ is smaller than 1, draw the $\boldsymbol{x'}$ to the sample with the probability, $P^* (\boldsymbol{x'}) / P^* (\boldsymbol{x})$.

Intuitively, if we keep doing this procedure, we will obtain the sample of $P(\boldsymbol{x})$ up to some proportional constant.

import numpy as np
import random
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

def P_star(X1,X2):
    return np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2))


# initial point

x0, y0 = np.array([-1,1])

Listx , Listy = [x0], [y0]

# number of sample
N = 100


for i in range(N):
    xn, yn = random.uniform(-1,1),random.uniform(-1,1)
    a = P_star(xn,yn)/P_star(x0,y0)
    if a >=1:
        Listx.append(xn)
        Listy.append(yn)
        x0,y0 = xn, yn
    else:
        if random.random()<a:
            Listx.append(xn)
            Listy.append(yn)
            x0,y0 = xn, yn


You will see we define the proportion of the target densities.

a = P_star(xn,yn)/P_star(x0,y0)

And then accept with the above conditions. I did not make a script for rejection respectively. Not needed.

# the rate of the sample

len(Listx)/100.0
0.14
plt.plot(Listx, Listy,'o')

png

plt.figure()
plt.plot(Listx, Listy,'o-', alpha=0.4)
x = np.arange(-1, 1, 0.01)
y = np.arange(-1, 1, 0.01)
X, Y = np.meshgrid(x, y)
plt.contour(X,Y,P_star(X,Y))

png

When the target density is narrow-shaped like the figure, the random samples get out of the zone more probably, and the Metropolis-Hastings sampler performs very poorly. In the above example, only 14 % of samples are selected.

Hamiltonian Monte Carlo Method

In statistical physics, the probability distribution of particles is

$$ P^* (x) = \exp (- E(x)) $$

Earlier, I wrote that Hamiltonian Monte Carlo method reduces the randomness. How do we diminish it? We suggest the direction the sample move toward, by the leapfrog algorithm. The leapfrog algorithm chooses the direction by learning the gradient of the energy.1

In the above 2 dimensional contour plot, the center of the ellipse is more probable. It is very much similar to a map of mountain. Imagine that you stand at the edge of the mountain2, and want to conquer the top of the mountain as soon as possible. The sample run by leapfrog algorithm is in the similar situation. It has the conserved dynamical energy for each turn3, kinetic energy plus potential energy. It feels the slope of the mountain and tends to go along where the slope is steeper.

def P_Hamiltonian(X1,X2,p1,p2):
    return np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)-p1*p1/2.0-p2*p2/2.0)

def gradE(x):
    return np.array([500.50*x[0]-2*249.75*x[1],-2*249.75*x[0]+500.50*x[1]])




# initial point
x0 = np.array([-1,1])

Listx = [x0]

# number of samples
N = 100
T = 100
epsilon = 0.056



for i in range(N):
    p0 = np.random.normal(0,1,2)

    for t in range(T):
        ph = p0 - epsilon* gradE(x0)/2.0
        xn = x0 + epsilon* ph
        pn = ph - epsilon* gradE(xn)/2


    a = P_Hamiltonian(xn[0],xn[1],pn[0],pn[1])/P_Hamiltonian(x0[0],x0[1],p0[0],p0[1])

    if a >=1:
        Listx.append(xn)
        x0,p0 = xn, pn
    else:
        if random.random() < a:
            Listx.append(xn)
            x0,p0 = xn, pn



for t in range(T):
    ph = p0 - epsilon* gradE(x0)/2.0
    xn = x0 + epsilon* ph
    pn = ph - epsilon* gradE(xn)/2

This for loop is where the leapfrog algorithm works, and it is nothing but finite displacement of the momentum and positions in the continuous phase space consists of the position and the momentum vectors, $(\boldsymbol{x}, \boldsymbol{p})$ under the rule of the dynamics.

To understand how samples move in the dynamic system, we solve the equation of motion, Hamiltonian equation.

$$ \dot{\boldsymbol{x}} = \boldsymbol{p} $$

$$ \dot{\boldsymbol{p}} = - { \partial E(\boldsymbol{x}) \over \partial \boldsymbol{x}} $$

# the rate of sample

len(Listx)/100.0
0.61

Now 61 % of samples are accepted.

x11=[Listx[i][0] for i in range(len(Listx))]
x22=[Listx[i][1] for i in range(len(Listx))]
plt.figure()
plt.plot(x11,x22,'-',alpha=0.4)
plt.scatter(x11,x22,c='red', s=30)
x1 = np.arange(-1, 1, 0.01)
x2 = np.arange(-1, 1, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)))

png

See how the Markov chain is sailing, started at $(-1, ~1)$. It goes toward the center quickly, but passed the center because it was too fast, and gradually converges inside the target distribution density in few turns. The size (or length) of steps are controlled by the parameter, epsilon and T. If epsilon is too big, the differentiation won’t work approximately. If it is too tiny, it will take too much time to make a convergent Markov chain over the target density. In other words, the chain of sample will move too slowly by tiny steps.


  1. More exactly, it is a potential energy. ↩︎

  2. The initial momentum to run the leapfrog algorithm is given randomly by the normal distribution $\mathcal{N}(0,1)$ in both dimensions, in the example. ↩︎

  3. For more intuitive explanation, I gave an example of a mountain. Nevertheless, a pit in the ground could be better comparison, and then roll the ball in it. The ball would not dig in the deepest hole at once. It would roll back a few times and gradually converges to the hole. Don’t get confused the total dynamical energy, potential plus kinetic energy of the sample is not conserved. For each turn, the initial momentum is given randomly by the normal distribution $\mathcal{N}(0,1)$ in both dimensions, in the example. ↩︎