Clustering (1), Hard K-means and its Failure
As I mentioned at the previous posting, one of the purposes of this blog is to supplement the Github of my data science study. I will gradually post and present all the iPython notebooks or Mathematica notebooks.
I felt there’s no good Python tutorial for spectral clustering (at least from my search). Who can’t use scikit-learn among who is serious about machine learning. It was not difficult to find the theory of spectral clustering as well. However, it was hard to find a Python tutorial for the algorithm of spectral clustering.
The theory for clustering and soft k-means can be found in the book Information Theory, Inference, and Learning Algorithms of David Mackay. In particular, I have read chapter 20 ~ 22 and used the algorithms to make Python codes in various clusterings.
Cluster
Today, the topic of our discussion is K-means clustering, and it is well-known clustering you’ll see at first if you start to learn clustering.
What I want to show you is when it does not work well. And then I will also post how we avoid the troubles.
K-means
The K-means algorithm is an algorithm for putting N data points in an M-dimensional space into K clusters. Each cluster is parameterized by a vector ${\textbf m}^{(k)}$ called its mean.
First of all, set K means ${\textbf m}^{(k)}$ to random values. In the assignment step, each data point n is assigned to the nearest mean.
$$ \hat {k}~ ^ {(n)} = argmin_k d({\textbf m}^{(k)}, {\textbf x}^{(n)}) $$
We will make n-groups whose elements have the smallest distances.
#!/usr/local/Cellar/python/2.7.6/bin/python
# -*- coding: utf-8 -*-
from numpy import *
from matplotlib import pyplot
import random
import numpy as np
import matplotlib.pyplot as plt
N=100
data1=[[0.5+0.5*random.random(),0.5+0.5*random.random()] for i in range(N)]
data2=[[0.5*random.random(),0.5*random.random()] for i in range(N)]
data1=np.array(data1)
data2=np.array(data2)
data = np.concatenate((data1,data2))
X=c_[data[:,0],data[:,1]]
def displaydata():
pyplot.scatter(X[:,0],X[:,1])
pyplot.show()
def findClosestCentroids( X, centroids ):
K = shape( centroids )[0]
m = shape( X )[0]
idx = zeros( (m, 1) )
for i in range(0, m):
lowest = 999
lowest_index = 0
for k in range( 0, K ):
cost = X[i] - centroids[k]
cost = cost.T.dot( cost )
if cost < lowest:
lowest_index = k
lowest = cost
idx[i] = lowest_index
return idx + 1 # add 1, since python's index starts at 0
def computeCentroidsLoop(X, idx, K):
m, n = shape( X )
centroids = zeros((K, n))
for k in range(1, K+1):
counter = 0
cum_sum = 0
for i in range( 0, m ):
if idx[i] == k:
cum_sum += X[i]
counter += 1
centroids[k-1] = cum_sum / counter
return centroids
def computeCentroids( X, idx, K ):
m, n = shape( X )
centroids = zeros((K, n))
data = c_[X, idx] # append the cluster index to the X
for k in range( 1, K+1 ):
temp = data[data[:, n] == k] # quickly extract X that falls into the cluster
count = shape( temp )[0] # count number of entries for that cluster
for j in range( 0, n ):
centroids[k-1, j] = sum(temp[:, j]) / count
return centroids
def runkMeans( X, initial_centroids, max_iters, plot=False ):
K = shape( initial_centroids )[0]
centroids = copy( initial_centroids )
idx = None
for iteration in range( 0, max_iters ):
idx = findClosestCentroids( X, centroids )
centroids = computeCentroids( X, idx, K )
if plot is True:
data = c_[X, idx]
# Extract data that falls in to cluster 1, 2, and 3 respectively, and plot them out
data_1 = data[data[:, 2] == 1]
pyplot.plot( data_1[:, 0], data_1[:, 1], 'ro', markersize=5 )
data_2 = data[data[:, 2] == 2]
pyplot.plot( data_2[:, 0], data_2[:, 1], 'go', markersize=5 )
data_3 = data[data[:, 2] == 3]
pyplot.plot( data_3[:, 0], data_3[:, 1], 'bo', markersize=5 )
data_4 = data[data[:, 2] == 4]
pyplot.plot( data_4[:, 0], data_4[:, 1], 'yo', markersize=5 )
pyplot.plot( centroids[:, 0], centroids[:, 1], 'k*', markersize=14 )
pyplot.xlim([-0.1,1.1])
pyplot.ylim([-0.1,1.1])
pyplot.show( block=True )
return centroids, idx
def kMeansInitCentroids( X, K ):
return np.random.permutation( X )[:K]
def findcentroids(X, centroids):
K = 4
initial_centroids = centroids
idx = findClosestCentroids( X, initial_centroids )
print idx[0:K]
centroids = computeCentroids( X, idx, K )
print centroids
def Kmeans(X,centroids):
K = 4
max_iters = 10
runkMeans( X, centroids, max_iters, plot=True )
Here are the data points. I chose random 50 data from the two boxes with 0.5 lengthy edges. We know the proper K-number for K-means clustering is 2. However, when we set K > 2, the K-means clustering gives a ugly result. Besides, the clustering strongly depends on the initial choice of the 4 assignment points.
set_printoptions(precision=6, linewidth=200)
displaydata()
Choose the K-means for K = 4. The proper K should be 2 for the data points. However, what happens if K = 4? From the below plots, the program is not working well.
For this given example, $K = 2$ is obvious, but for the most realistic problems, we can not always set the correct $K$ number. Thus, our goal is make the machine itself find the $K$.
centroids = array([[random.random(), random.random()], [random.random(), random.random()], [random.random(), random.random()],[random.random(), random.random()]])
findcentroids(X,centroids)
[[ 4.]
[ 4.]
[ 2.]
[ 4.]]
[[ 0.199317 0.208257]
[ 0.79478 0.780224]
[ 0.53369 0.865798]
[ 0.54678 0.551297]]
Kmeans(X,centroids)
The K-means(cetroids) converge to the wrong points in the above figures.
Appendix : to convert iPython notebooks to markdown
$ ipython nbconvert Clustering.ipynb --to markdown
or
$ jupiter nbconvert Clustering.ipynb --to markdown
This make a simple Clustering.md file and a folder, Clustering_files. The folder includes the figures in the notebooks. You might have to change the name of the folders in case that you store the image files and notebooks in the specific folder rather than _post folder.