# Session 7: Distance Meas. - Module B: K-Means Program

## 1. Overview

What is clustering?

What is k-means?

Applications of clustering.

Play with a k-means using a 3D dataset.

## 2. Explanation of K-Means

*K-means* is a *clustering* algorithm. Clustering is a way of grouping
points together based on their distance from each other. Usually this distance
measure relates to how similar they are: Birds of a feather flock together!

**K** represents the number of groups or clusters you have. The value of k is
important: it influences the accuracy of your results. If k is too small, then the
algorithm will group things which don't belong together. And if k is too large, it
will start splitting up points which should be grouped together.

Clustering has many uses! Here are some which might interest you:

- Amazon shopping recommendations
- Netflix
- Facebook friend suggestions
- Grocery store layout

## 3. Running K-Means

Before we can play with clustering, we need a program and some data:

- Download the program
**kMeans.py** to your programs directory.
- Download the dataset
**clean3cluster.csv** to your programs directory.
- Download the dataset
**noisy8cluster.csv** to your programs directory.

Open kMeans.py in gedit. Find the line where the dataset is loaded. Replace the filename
with 'clean3cluster.csv'. Now find the line with **numberOfClusters**. Set the number to **1**.

Open a terminal and start IPython. Now you can run your program with the command
`run kMeans.py`. What do you notice about the figure? How many clusters do
there appear to be? Did k-means successfully find the center of the cluster? Drag the
figure to look at the points from a different angle!

What happens if you increase **numberOfClusters** to **2**? Where do you think
the center of the two clusters cluster will be? Make the change and rerun the program.
Was your guess about the center of the clusters correct?

Now change **numberOfClusters** to **3**. Where do you think the center of the
three clusters will be? Was your guess correct?

Don't forget to save the figure for your web page!

## 4. Running K-Means on a Noisy Dataset

Find the line where the dataset is loaded. Replace the filename with 'noisy8cluster.csv'.
Now find the line with **numberOfClusters** and set the value to **8**. Run the program
again in IPython.

How does the *noise* (random data points) influence k-means ability to cluster the data?
Compare your result with the results of those around you.

Don't forget to save the figure for your web page and answer
these questions on your webpage!