As I mentioned at the previous posting, one of the purposes of this blog is to supplement the Github of my data science study. I will gradually post and present all the iPython notebooks or Mathematica notebooks.

I felt there’s no good Python tutorial for spectral clustering (at least from my search). Who can’t use scikit-learn among who is serious about machine learning. It was not difficult to find the theory of spectral clustering as well. However, it was hard to find a Python tutorial for the algorithm of spectral clustering.

The theory for clustering and soft k-means can be found in the book Information Theory, Inference, and Learning Algorithms of David Mackay. In particular, I have read chapter 20 ~ 22 and used the algorithms to make Python codes in various clusterings.

Cluster

Today, the topic of our discussion is K-means clustering, and it is well-known clustering you’ll see at first if you start to learn clustering.

What I want to show you is when it does not work well. And then I will also post how we avoid the troubles.

K-means

The K-means algorithm is an algorithm for putting N data points in an M-dimensional space into K clusters. Each cluster is parameterized by a vector ${\textbf m}^{(k)}$ called its mean.

First of all, set K means ${\textbf m}^{(k)}$ to random values. In the assignment step, each data point n is assigned to the nearest mean.

$$ \hat {k}~ ^ {(n)} = argmin_k d({\textbf m}^{(k)}, {\textbf x}^{(n)}) $$

We will make n-groups whose elements have the smallest distances.

#!/usr/local/Cellar/python/2.7.6/bin/python
# -*- coding: utf-8 -*-


from numpy import *
from matplotlib import pyplot

import random
import numpy as np
import matplotlib.pyplot as plt

N=100

data1=[[0.5+0.5*random.random(),0.5+0.5*random.random()] for i in range(N)]
data2=[[0.5*random.random(),0.5*random.random()] for i in range(N)]

data1=np.array(data1)
data2=np.array(data2)

data = np.concatenate((data1,data2))

X=c_[data[:,0],data[:,1]]




def displaydata():
	pyplot.scatter(X[:,0],X[:,1])
	pyplot.show()




def findClosestCentroids( X, centroids ):
	K = shape( centroids )[0]
	m   = shape( X )[0]
	idx = zeros( (m, 1) )

	for i in range(0, m):
		lowest = 999
		lowest_index = 0

		for k in range( 0, K ):
			cost = X[i] - centroids[k]
			cost = cost.T.dot( cost )
			if cost < lowest:
				lowest_index = k
				lowest  = cost

		idx[i] = lowest_index
	return idx + 1 # add 1, since python's index starts at 0

def computeCentroidsLoop(X, idx, K):
	m, n = shape( X )
	centroids = zeros((K, n))

	for k in range(1, K+1):
		counter = 0
		cum_sum = 0
		for i in range( 0, m ):
			if idx[i] == k:
				cum_sum += X[i]
				counter += 1
		centroids[k-1] = cum_sum / counter
	return centroids

def computeCentroids( X, idx, K ):
	m, n = shape( X )
	centroids = zeros((K, n))

	data = c_[X, idx] # append the cluster index to the X

	for k in range( 1, K+1 ):
		temp = data[data[:, n] == k] # quickly extract X that falls into the cluster
		count = shape( temp )[0]		# count number of entries for that cluster

		for j in range( 0, n ):
			centroids[k-1, j] = sum(temp[:, j]) / count

	return centroids

def runkMeans( X, initial_centroids, max_iters, plot=False ):
	K = shape( initial_centroids )[0]
	centroids = copy( initial_centroids )
	idx = None

	for iteration in range( 0, max_iters ):
		idx = findClosestCentroids( X, centroids )
		centroids = computeCentroids( X, idx, K )

		if plot is True:
			data = c_[X, idx]

			# Extract data that falls in to cluster 1, 2, and 3 respectively, and plot them out
			data_1 = data[data[:, 2] == 1]
			pyplot.plot( data_1[:, 0], data_1[:, 1], 'ro', markersize=5 )

			data_2 = data[data[:, 2] == 2]
			pyplot.plot( data_2[:, 0], data_2[:, 1], 'go', markersize=5 )

			data_3 = data[data[:, 2] == 3]
			pyplot.plot( data_3[:, 0], data_3[:, 1], 'bo', markersize=5 )

			data_4 = data[data[:, 2] == 4]
			pyplot.plot( data_4[:, 0], data_4[:, 1], 'yo', markersize=5 )

			pyplot.plot( centroids[:, 0], centroids[:, 1], 'k*', markersize=14 )

			pyplot.xlim([-0.1,1.1])
			pyplot.ylim([-0.1,1.1])

			pyplot.show( block=True )

	return centroids, idx

def kMeansInitCentroids( X, K ):
	return np.random.permutation( X )[:K]

def findcentroids(X, centroids):

	K = 4

	initial_centroids = centroids

	idx = findClosestCentroids( X, initial_centroids )
	print idx[0:K]


	centroids = computeCentroids( X, idx, K )
	print centroids


def Kmeans(X,centroids):
	K = 4

	max_iters = 10

	runkMeans( X, centroids, max_iters, plot=True )

Here are the data points. I chose random 50 data from the two boxes with 0.5 lengthy edges. We know the proper K-number for K-means clustering is 2. However, when we set K > 2, the K-means clustering gives a ugly result. Besides, the clustering strongly depends on the initial choice of the 4 assignment points.

set_printoptions(precision=6, linewidth=200)
displaydata()

png

Choose the K-means for K = 4. The proper K should be 2 for the data points. However, what happens if K = 4? From the below plots, the program is not working well.

For this given example, $K = 2$ is obvious, but for the most realistic problems, we can not always set the correct $K$ number. Thus, our goal is make the machine itself find the $K$.

centroids = array([[random.random(), random.random()], [random.random(), random.random()], [random.random(), random.random()],[random.random(), random.random()]])
findcentroids(X,centroids)
[[ 4.]
 [ 4.]
 [ 2.]
 [ 4.]]
[[ 0.199317  0.208257]
 [ 0.79478   0.780224]
 [ 0.53369   0.865798]
 [ 0.54678   0.551297]]
Kmeans(X,centroids)

png

png

png

png

png

png

png

png

png

png

The K-means(cetroids) converge to the wrong points in the above figures.

Appendix : to convert iPython notebooks to markdown

$ ipython nbconvert Clustering.ipynb --to markdown

or

$ jupiter nbconvert Clustering.ipynb --to markdown

This make a simple Clustering.md file and a folder, Clustering_files. The folder includes the figures in the notebooks. You might have to change the name of the folders in case that you store the image files and notebooks in the specific folder rather than _post folder.