MCMC (6): Gibbs Sampling and Overrelaxation
Efficient Monte Carlo sampling
This post is on the extension of the post about Hamiltonian Monte Carlo method. Therefore, I assume the readers already read the post. Overrelaxation also reduces the random property of the Monte Carlo sampling, and speeds up the convergence of the Markov chain.
Gibbs sampling
In advance of studying over relaxation, we study Gibbs sampling. In the general case of a system with K variables, a single iteration involves sampling one parameter at a time.
$$ x^{(t+1)}_1 \sim P(x_1|x^{(t)}_2,x^{(t)}_3,x^{(t)}_4,…,x^{(t)}_K), $$
$$ x^{(t+1)}_2 \sim P(x_2|x^{(t)}_1,x^{(t)}_3,x^{(t)}_4,…,x^{(t)}_K), $$
$$ \cdots $$
Gibbs sampling suffers from the same defect as simple Metropolis algorithms - the state space is explored by a slow random walk. It slowly progresses made by Gibbs sampling when $ L \gg \epsilon$. However Gibbs sampling involves no adjustable parameters, so it is an attractive strategy when one wants to get a model running quickly.
We consider a simple 2 dimensional Gaussian distribution.
$$ P(x, y| \mu_x, \mu_y, \sigma_x, \sigma_y) = {1 \over Z} e^{a~x^2+b~ x~ y + c~y^2\over 2\sigma^2} $$
$(\mu_i,~ \sigma_i^2)$ is the mean and the variation as $x_j$ are constants for all $j \ne i$. Let us see the Python code and the plot.
import numpy as np
import random
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
def generalGibbsDensity(X1,X2):
return np.exp(-(1.25*X1*X1+1.25*X2*X2))
#initial point
x0 = np.array([-1.3,-1.3])
x1 = x0
# standard dev from the density function
sigma = 0.6324555320336759
# mu is also obtained from the density
N = 20
Listx = np.zeros(2*N+1)
Listy = np.zeros(2*N+1)
Listx[0], Listy[0] = x0
for i in range(N):
#conditional gaussian for x
x1[0] = np.random.normal(0, sigma)
Listx[2*i+1] = x1[0]
Listy[2*i+1] = x0[1]
#conditional gaussian for y
x1[1] = np.random.normal(0, sigma)
Listx[2*i+2] = x1[0]
Listy[2*i+2] = x1[1]
x0 = x1
plt.figure()
x1 = np.arange(-2.0, 2.0, 0.01)
x2 = np.arange(-2.0, 2.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(1.25*X1*X1+1.25*X2*X2)))
plt.plot(Listx, Listy,'o-')
plt.axis([-1.5,1.5,-1.5,1.5])
pass
For this simple example, the given target density is $e^{-1.25(x^2+y^2)}$. Therefore, when $y$ is constant, obtained $\mu_x = 0$ and $\sigma_y = \sqrt{2/(1.25)}$.
By x-y exchange symmetry, $\mu_y = \mu_x$ and $\sigma_y = \sigma_x$.
In Gibbs sampling, the even numbered sample is drawn from 1-dimensional Gaussian $P(y|x= constant, \mu_y, \sigma_y)$1 and the odd numbered sample is drawn from 1-d Gaussian $P(x|y= constant, \mu_x, \sigma_x)$. For generalized target density in n-dimensions, we run another loop for this procedure. In other words, this algorithm is fundamentally a double-loop algorithm.
For narrow target distribution.
Now, we will consider the same density function which we used in Hamiltonian Monte Carlo sampling.
def density(X1,X2):
return np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2))
#initial point
x0 = np.array([-1.3,-1.3])
x1 = x0
# standard dev from the density function
sigma = 0.04469901562676742
# mu is also obtained from the density
N = 200
Listx = np.zeros(2*N+1)
Listy = np.zeros(2*N+1)
Listx[0], Listy[0] = x0
for i in range(N):
#conditional gaussian for x
x1[0] = np.random.normal(0.998*x0[1], sigma)
Listx[2*i+1] = x1[0]
Listy[2*i+1] = x0[1]
#conditional gaussian for y
x1[1] = np.random.normal(0.998*x1[0], sigma)
Listx[2*i+2] = x1[0]
Listy[2*i+2] = x1[1]
x0 = x1
plt.figure()
x1 = np.arange(-2.0, 1.0, 0.01)
x2 = np.arange(-2.0, 1.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)))
plt.plot(Listx, Listy)
plt.axis([-1.5,0.5,-1.5,0.5])
pass
plt.figure()
x1 = np.arange(-2.0, 1.0, 0.01)
x2 = np.arange(-2.0, 1.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)))
#sns.kdeplot(orbit)
plt.plot(Listx[:50], Listy[:50])
plt.axis([-1.5,0,-1.5,0])
pass
See that the samples are stuck around the starting point. The Hamiltonian Monte Carlo sampling also had a similar problem. The samples in Hamiltonian Monte Carlo sampling tend to move where the gradient descent is big. In other words, they tend to move along semi-minor axis of the ellipse. The remained randomness makes it move along semi-major axis, but slowly. Even though the accept rate is high, it takes long time to make a convergent Markov chain.
Also Note that the mean values for the conditional probability depends on the position because of the coupled term, $-2 \times 249.75 \times x~ y$ in the joint probability density.
Simultaneous update
We can update x and y at the same time.
# if you update x, y at the same time.
x0 = np.array([-1.3,-1.3])
x1 = x0
# standard dev from the density function
sigma = 0.04469901562676742
# mu is also obtained from the density
N = 100
Listx = np.zeros(N+1)
Listy = np.zeros(N+1)
Listx[0], Listy[0] = x0
for i in range(N):
#conditional gaussian for x
x1[0] = np.random.normal(0.998*x0[1], sigma)
x1[1] = np.random.normal(0.998*x0[0], sigma)
Listx[i+1] = x1[0]
Listy[i+1] = x1[1]
x0 = x1
plt.figure()
x1 = np.arange(-2.0, 1.0, 0.01)
x2 = np.arange(-2.0, 1.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)))
#sns.kdeplot(orbit)
plt.plot(Listx[:50], Listy[:50])
plt.axis([-1.5,0,-1.5,0])
pass
Overrelaxation
def density(X1,X2):
return np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2))
#initial point
x0 = np.array([-1.3,-1.3])
x1 = x0
# standard dev from the density function
sigma = 0.04469901562676742
# mu is also obtained from the density
N = 200
Listx = np.zeros(2*N+1)
Listy = np.zeros(2*N+1)
Listx[0], Listy[0] = x0
for i in range(N):
#conditional gaussian for x
x1[0] = np.random.normal(0.998*x0[1], sigma)
Listx[2*i+1] = x1[0]
Listy[2*i+1] = x0[1]
#conditional gaussian for y
x1[1] = np.random.normal(0.998*x1[0], sigma)
Listx[2*i+2] = x1[0]
Listy[2*i+2] = x1[1]
x0 = x1
plt.figure(figsize=(10,10))
plt.subplot(2,2,1)
x1 = np.arange(-2.0, 1.0, 0.01)
x2 = np.arange(-2.0, 1.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)))
#sns.kdeplot(orbit)
plt.plot(Listx, Listy)
plt.axis([-1.5,0.5,-1.5,0.5])
plt.subplot(2,2,3)
x1 = np.arange(-2.0, 1.0, 0.01)
x2 = np.arange(-2.0, 1.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)))
#sns.kdeplot(orbit)
plt.plot(Listx[:50], Listy[:50])
plt.axis([-1.5,0,-1.5,0])
# if you update x, y at the same time.
x0 = np.array([-1.3,-1.3])
x1 = x0
# standard dev from the density function
sigma = 0.04469901562676742
# mu is also obtained from the density
N = 100
Listx = np.zeros(N+1)
Listy = np.zeros(N+1)
Listx[0], Listy[0] = x0
# over-relaxation constant alpha between -1 and 0
a = -0.98
for i in range(N):
#conditional gaussian for x
x1[0] = 0.998*x0[1]+a*(x0[0]-0.998*x0[1])+(1-a*a)**(0.5)*sigma*np.random.normal(0,1)
x1[1] = 0.998*x0[0]+a*(x0[1]-0.998*x0[0])+(1-a*a)**(0.5)*sigma*np.random.normal(0,1)
Listx[i+1] = x1[0]
Listy[i+1] = x1[1]
x0 = x1
plt.subplot(2,2,2)
x1 = np.arange(-2.0, 1.0, 0.01)
x2 = np.arange(-2.0, 1.0, 0.01)
X1, X2 = np.meshgrid(x1, x2)
plt.contour(X1,X2,np.exp(-(250.25*X1*X1-2*249.75*X1*X2+250.25*X2*X2)), alpha=0.3)
plt.plot(Listx[:50], Listy[:50])
plt.axis([-1.5,1,-1.5,1])
plt.subplot(2,2,4)
plt.plot(Listx[:50], Listy[:50])
plt.axis([-1.5,3,-1.5,3])
There are 200 samples in the left top plot, and only 50 samples in the right top plot. As we know from intuition about the plot, this overrelaxation also reduces the random property and gain the purpose.
We still update x and y Simultaneously. The only change in this algorithm is the for loop. When we consider the conditional probability is Gaussian, $\mathcal{N}(\mu, \sigma^2)$. For the n-th sample $x_i^{(n)}$, the next n+1 th sample is chosen as follows.
$$ x_i^{(n+1)} = \mu +\alpha (x_i^{(n)} - \mu) + (1 - \alpha^2)^{1/2} \sigma \nu $$
$\alpha$ is the parameter to control the extent of the overrelaxation. It is between -1 and 1. When $\alpha \sim -1$, the sample move to about the opposite point of the previous sample. If it is positive, it is an underrelaxation. The conditional probability distribution is invariant under the transformation. $\nu$ is remained randomness. Here, we chose $\nu = \mathcal{N}(0, ~1)$. We are allow to make the other random distribution as the distribution as the transformation does not break the invariance.
Other efficient Monte Carlo method
Simulated annealing is also one of the way to find the convergent Markov chain faster. Probability distribution in statistical mechanics is
$$ P(x) = {e^{-E(x)}\over Z} $$
Mathematically, simulated annealing is introducing a parameter to control a perturbative Energy term.2 You split the energy term to a good one and a bad one, $E_g(x)+E_b(x)/T$, and you control the parameter coefficient of the bad term, $1/T$. Note that other efficient Monte Carlo methods have parameters. Hamiltonian Monte Carlo has a leaping parameter, and Gibbs overrelaxation has a overrelaxation parameter. All of the parameters is used to relax the random walk in the Monte Carlo sampling. Well, what I am really interested in is a quantum annealing. Hope I would have a chance to discuss about it in detail.