Neural Network (5) : Very Simple Boltzmann Machine
Stochastic Hopfield net
Boltzmann machine is nothing but stochastic Hopfield net1. If you did not yet read the post of the Hopfield net in the blog, just go read it. I assume the readers are familiar to it, and directly use many results we had in the post. The magic of deep learning which we have discussed a couple of times works here, too. Such as $\epsilon$-greedy off-policy algorithm2, the stochastic character of the binary units allows the machine occasionally increase its energy to escape from poor local minima. Then, keen readers would have realized the simulated annealing3 in the temperature factor of the energy function could improve the performance of the machine to explore the global minimum over the potential wall around the local minima. Thus, I will include the simulated annealing in the algorithm. I did not have a sophisticated fine tuning for those setting, but anyway it seems working.
We still use the simplified energy function of Ising model.
$$ E(\boldsymbol{x}) = - \frac{1}{2} \boldsymbol{x}^{T} \boldsymbol{W} \boldsymbol{x} $$
Activity rule of Boltzmann machine is, after computing the activation $a_i$, we use Gibbs sampling for the probability distribution.
$$
\begin{array}{l}
\mbox{set $x_i = +1$ with probability
$\displaystyle\frac{1}{1+e^{-2a_i}}$}\
\mbox{else set $x_i = -1$}.
\end{array}
$$
This can be explained as follows, too. The derivative of the log likelihood is:
The first term is the Hebbian term we treated in the post of Hopfield net. The 2nd term is the difficult to evaluate, but can be obtained from Monte Carlo sampling as above Gibbs sampling.
Plain Shifter ensemble4
import random
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import scipy.special as sp
import scipy.integrate as integrate
%precision 12
from __future__ import division
%matplotlib inline
plt.style.use('ggplot')
def generator(N):
X = np.zeros((N, 19))
for j in range(N):
firstline = np.random.randint(0,2,8)
secline = np.zeros(8)
translation = np.random.randint(0,3)
thirdline = np.zeros(3)
if translation == 0: # left
for i in range(8):
secline[i] = firstline[(i+1)%8]
thirdline[0] = 1
elif translation == 1: # stay
for i in range(8):
secline[i] = firstline[i]
thirdline[1] = 1
else :
for i in range(8): # right
secline[i] = firstline[(i-1)%8]
thirdline[2] = 1
X[j] = np.concatenate((firstline,secline,thirdline))
X=(X-0.5)*2
return X
def display(X):
plt.imshow(X[:16].reshape((2,8)))
XX2 = generator(5)
fig = plt.figure(figsize=(4,8))
for j in range(5):
ax = fig.add_subplot(5, 1, j+1)
display(XX2[j])
Here are 5 samples of the shifter ensemble. The second line is of of 3 kinds of shifts of the first line. Left shift, stay, right shift is expressed at the last 3 states of states of inputs. Our goal is from the 2 lines of inputs, find the last 3 states to answer for the direction of the shift. This algorithm uses the associative memory property of Hopfield net.
However, we cannot make machine learn the shift ensemble from the normal Hopfield net and Boltzmann machine. They are limited to 2nd order statistics. The shift ensemble is simple but still 3rd ordered.The second order correlation between any label neuron and any image neuron will be overall zero in the total ensemble.
Hidden units are the solutions suggested by G. E. Hinton et al. And D. Mackay propose to create models that directly capture higher-order correlations.
Then, the we need to consider the $\Delta V$ and should consider this learning of energy in the activation.
Note that the learning rate of $\boldsymbol{W}$ and $\boldsymbol{V}$ can be different. I would say that learning rate of $V$ could be smaller since the higher order would fine-tune. And these parameters such as learning rates and annealing temperature is to control the probability density for Gibbs sampling. To use the stochastic property, the follow probability should not be too big at early iterations. It is also notable that Boltzmann machine becomes Hopfield net When $T \rightarrow \infty$.
$$ {1 \over 1+e^{-2a_i}} $$
Let us consider 10000 samples. Use the generator we make above. Temp is considered to simulate annealing. Temperature is decreasing by iterations and stop decreasing after some iterations.
XX2 = generator(N)
N = 10000
T = 1
eta1 = 0.01 # learning rate of W
eta2 = 0.01 # learning rate of V
def Temp(T, N, n):
if n < N/2:
return 2*T/N*(N/2-n)+0.01
else :
return 0.001
wake = np.zeros((19,19))
v = np.zeros((19,19,19))
for k in range(N):
for i in range(19):
for j in range(19):
wake[i,j] += eta1*XX2[k][i]*XX2[k][j]/Temp(T,N,k)
for l in range(19):
v[i,j,l] += eta2/2*XX2[k][i]*XX2[k][j]*XX2[k][l]/Temp(T,N,k)
a = np.dot(wake,XX2[k])+1/2*np.dot(np.dot(v,XX2[k]),XX2[k])
if random.random() < 1/(1+np.exp(a[j])):
XX2[k][j] = 1
else :
XX2[k][j]= -1
Try a test sample. Fix the labeling neurons to have the value +1. It will yields higher energy and we want to see the learning to find the minimum.
testX = generator(1)[0]
testX[16:] = 1
display(testX)
turn = 0
turns = 100
for j in range(turns):
for i in range(19):
a = np.dot(wake,testX)+1/2*np.dot(np.dot(v,testX),testX)
if random.random() < 1/(1+np.exp(-a[i])):
testX[i] = 1
else :
testX[i]=-1
display(testX)
testX[16:]
array([ 1., -1., -1.])
We checked the program remembers the pattern and find the correct labeling.
Try 1000 samples.
S = 1000
test = generator(S)
result = test[:,16:]
test[:,16:] = 1
for t in range(T):
for i in range(19):
a = np.dot(wake,test[t])+1/2*np.dot(np.dot(v,test[t]),test[t])
if random.random() < 1/(1+np.exp(-2*a[i])):
test[t][i] = 1
else :
test[t][i]=-1
a = sum(result == test[:,16:])
print "precision:", a[0]/1000*100, "%"
precision: 100.0 %
All the samples find the correct labels. We have seen Boltzmann machine has associative memory again. This makes Boltzmann machine so powerful with imbalance data, Noisy labels, Missing values.
Similarity with hand digit recognition problem.
Fundamentally, this example is same as other famous Boltzmann machine examples such as hand digit recognition. Train the images and label data and from test image data, predict the label of the test data. In the hand digit recognition problem, the labels are 0 to 9. and in the shifter problem, the labels are left, stay, right. And we vectorize the label with binary vectors.
-
I recommend you to read G, Hinton, “Boltzmann Machines” ↩︎
-
We studied it in the detail at the post of deep reinforcement learning. ↩︎
-
We studied it roughly at the post of soft K-means, and also used it at the post of deep reinforcement learning. ↩︎
-
This example is introduced by G. Hinton and T. Sejnowski in Learning and relearning in Boltzmann machines. They solved this example by introducing hidden units. I will consider the hidden units in the post of restricted Boltzmann machine. ↩︎