Session 7: Distance Meas. - Module B: K-Means Program

1. Overview

What is clustering?
What is k-means?
Applications of clustering.
Play with a k-means using a 3D dataset.

2. Explanation of K-Means

K-means is a clustering algorithm. Clustering is a way of grouping points together based on their distance from each other. Usually this distance measure relates to how similar they are: Birds of a feather flock together!

K represents the number of groups or clusters you have. The value of k is important: it influences the accuracy of your results. If k is too small, then the algorithm will group things which don't belong together. And if k is too large, it will start splitting up points which should be grouped together.

Clustering has many uses! Here are some which might interest you:

3. Running K-Means

Before we can play with clustering, we need a program and some data:

  1. Download the program kMeans.py to your programs directory.
  2. Download the dataset clean3cluster.csv to your programs directory.
  3. Download the dataset noisy8cluster.csv to your programs directory.

Open kMeans.py in gedit. Find the line where the dataset is loaded. Replace the filename with 'clean3cluster.csv'. Now find the line with numberOfClusters. Set the number to 1.

Open a terminal and start IPython. Now you can run your program with the command run kMeans.py. What do you notice about the figure? How many clusters do there appear to be? Did k-means successfully find the center of the cluster? Drag the figure to look at the points from a different angle!

What happens if you increase numberOfClusters to 2? Where do you think the center of the two clusters cluster will be? Make the change and rerun the program. Was your guess about the center of the clusters correct?

Now change numberOfClusters to 3. Where do you think the center of the three clusters will be? Was your guess correct?

Don't forget to save the figure for your web page!

4. Running K-Means on a Noisy Dataset

Find the line where the dataset is loaded. Replace the filename with 'noisy8cluster.csv'. Now find the line with numberOfClusters and set the value to 8. Run the program again in IPython.

How does the noise (random data points) influence k-means ability to cluster the data? Compare your result with the results of those around you.

Don't forget to save the figure for your web page and answer these questions on your webpage!