Project 3: k-Means Clustering

Summary

In this project you will implement a higher-order k-means clustering algorithm. The algorithm takes as inputs a list of two-dimensional points, and partitions them into k clusters. Each cluster has a central point, called a center. A center does not have to be one of the input points. A successful clustering must obey two properties:

1. 1. - Each center is the average (mean) of the points in it's cluster.
    - Each point is in the cluster of the nearest center.

The basic algorithm is quite simple:

- - 1. Start with n points.
    2. Take k points as starting centers.
    3. Assign each point to its nearest center, forming the initial clusters.
    4. Repeatedly improve the clusters by:
      1. Update each center to be the mean of its cluster.
        Assign each point to the nearest updated center.

The algorithm terminates when the clusters stop changing. This video explains the clustering in slightly more detail. We will consider several different meanings of the word "nearest", using the following metrics.

Euclidean distance.
Manhattan distance: only allows only horizontal and vertical travel (but not diagonal), simulating travel on a street grid.
Chebychev distance: allows arbitrary diagonal movement, and so counts only the longer of horizontal and vertical.
Traffic distance: a version of Manhattan distance where moving east-west takes twice as long as north-south.
Township distance: a distance metric that prioritizes movement within a township.

Create your Github Classroom Repository.

Timeline

- - - Monday 10/4 - Project Released.
    - Monday 10/11 - Milestone deadline.
    - Monday 10/18 - Project submission deadline.
    - Wednesday 10/20 - Late submission deadline.

Be sure to push your changes to the repository for each deadline. Late submissions are automatically accepted, but will be assessed a 5% (1/3rd letter grade) penalty.

Repository Files

Cluster.hs - This is the file you will edit. All type aliases are included. The functions are listed in roughly the suggested implementation order. Problem details are given in the file.
Main.hs - This file contains the I/O actions that turn your code into an executable. You do not need to understand anything in this file.
Coloring.hs - Contains ANSI code. Very boring, and you do not need to understand.
ClusterData.hs - Contains data and code to read in data from files.
README.md - This file contains a brief project description, as well as spots for your name and Trinity ID (i.e. sfogarty).
Be certain to add your name and ID.
Makefile - The makefile supports make, make setup, and make clean. The executable is output to cluster.
Be certain make runs before you submit!

Executable

The executable cluster can be built using make. If the executable does not build, use make setup to install any missing Haskell packages. The executable and compiled source files can be removed using make clean. By default cluster works on 10 points, and can be run with:

cluster [options] [k] [file]

Where k is the number of points, and file contains the points. Valid options are:

-h or --help Prints a help message and exits.

--test Runs unit tests on your code. You must complete all milestone tests before remaining tests will run.

-n n Uses only first n points.

-q or --quiet Only print error messages on tests, or minimal output when solving.

-d Display clusters visually, instead of using the standard text-based output.

-m str or --metric=str Use an alternate distance metric: manhat, cheby, traffic, or town. If the region metric is used, point
labels are aliases for townships, instead of indexes.

-v Display all points in each cluster. Incompatible with -d, possibly very verbose.

-i or --incremental Run the algorithm one "improve" step at a time, displaying the results in between,
and relying on the user to terminate.

Milestone: Basic Point Functions

For the milestone, you must implement the basic mechanics of clustering. You are given the following type aliases.

type Point = (Double, Double, Int)
type Center = Point
type Cluster = (Center, [Point])

Points have an x coordinate, a y coordinate, and an integer label. It is not required that each point have a unique label. Center is just a type alias for point. After centers move, they take the label of the nearest point. A cluster is the tuple of a center and the list of points in that cluster. You must implement the following functions:

- - - Taking the initial centers of a list of points. For the milestone, just take the first k elements.
    - Implement each of the five distance metrics.
    - Implement the higher-order minimize and bucket functions.
    - The average function.

Project: k-Means Clustering

For the full project you must implement the k-means algorithm. All of the k-means functions take a distance function as a parameter.

- - - assignPoint and assignPoints assign points to the nearest center.
    - findMean and moveCenters move the center locations to be the mean of the points in a cluster.
    - improveClusters combine the above two steps into a single function.
    - kMeans implements the above algorithm. You will need to use an iteration count to ensure the algorithm terminates.

Full Credit: getKElems

The initial centers, by default, are chosen as the first k elements. If the data is heavily sorted, this could cause problems. For full credit, select k elements spaced evenly throughout the list. There may be multiple valid even spacings. For instance, the 3 evenly spaced elements from [1..6] are 1,3,6. This problem has a minimal impact on this application, but is algorithmically interesting.

Full Credit: Empty Clusters

It is possible that a cluster becomes empty, when every point is closer to some other center. In this case the center is undefined and the cluster vanishes. This violates our notion of having k means. To fix this, when there are less than k centers, split one center into two. This is done in between updating the centers and assigning points to their nearest clusters. You can choose to only split one cluster in the call to improveClusters, or to repeat this process and ensure there are k clusters.

- - Find the biggest cluster.
  - Choose an arbitrary point in the cluster that is different from the center. It is sufficient to insist that it has a non-zero distance.
  - Make that point the center of a new cluster.

Tentative Rubric

Subject to change

Milestone 20
Bucket and Minimize 10
Core Project 44
getKElems Implementation 10
Empty Clusters 10
Appropriate use of HOFs 6

Page updated

Report abuse