Project 3: k-Means Clustering

Summary

This is a pair-programmed project. You are encouraged (but not required) to work with and submit with another person, can collaborate on code with other teams, and are allowed to use Copilot, ChatGPT, or other LLM's so long as you cite them in the source code.

If you are submitting alone, create a team named p3-<userID> (ie. p3-sfogarty).
If you are submitting in pairs, one person should create a team with their user ID, and the other should join it.
If you want to change the teams (i.e. to work in a pair with someone), email Dr. Fogarty.

Github Classroom Repository

In this project you will implement a higher-order k-means clustering algorithm. The algorithm takes as inputs a list of two-dimensional points, and partitions them into k clusters. Each cluster has a central point, called a center. A center does not have to be one of the input points. A successful clustering must obey two properties:

1. 1. - Each center is the average (mean) of the points in it's cluster.
    - Each point is in the cluster of the nearest center.

The basic algorithm is quite simple:

- - 1. Start with n points.
    2. Take k points as starting centers.
    3. Assign each point to its nearest center, forming the initial clusters.
    4. Repeatedly improve the clusters by:
      1. Update each center to be the mean of its cluster.
        Assign each point to the nearest updated center.

The algorithm terminates when the clusters stop changing. This video explains the clustering in slightly more detail. We will consider several different meanings of the word "nearest", using the following metrics.

Euclidean distance.
Manhattan distance: only allows only horizontal and vertical travel (but not diagonal), simulating travel on a street grid.
Chebychev distance: allows arbitrary diagonal movement, and so counts only the longer of horizontal and vertical.
Traffic distance: a version of Manhattan distance where moving east-west takes twice as long as north-south.
Township distance: a distance metric that prioritizes movement within a township.

Timeline

- - - Friday 10/4- Project Released.
    - Wednesday 10/9 - Milestone deadline.
    - Wednesday 10/16 - Project submission deadline.
    - Friday 10/18 - Late submission deadline.

Be sure to push your changes to the repository for each deadline. Late submissions are automatically accepted, but will be assessed a 5% (1/3rd letter grade) penalty.

Repository Files

Cluster.hs - This is the file you will edit. All type aliases are included. The functions are listed in roughly the suggested implementation order. Problem details are given in the file.
Main.hs - This file contains the I/O actions that turn your code into an executable. You do not need to understand anything in this file.
Coloring.hs - Contains ANSI code. Very boring, and you do not need to understand.
ClusterData.hs - Contains data and code to read in data from files.
README.md - This file contains a brief project description, as well as spots for your name and Trinity ID (i.e. sfogarty).
Be certain to add your name and ID.
Makefile - The makefile supports make, make setup, and make clean. The executable is output to cluster.
Be certain make runs before you submit!
If you are working on a non-departmental machine, you my have to run make setup once to install missing packages.

Executable

The executable cluster can be built using make, and removed using make clean. By default cluster works on 10 points, and can be run with:

cluster [options] [k] [file]

Where k is the number of points (default 10), and file contains the points. If file is not provided, random points are used. Valid options are:

-h or --help Prints a help message and exits.
--test Runs unit tests on your code. You must complete all milestone tests before remaining tests will run.
-n n Uses only first n points.
-q or --quiet Only print error messages on tests, or minimal output when solving.
-d Display clusters visually, instead of using the standard text-based output.
-m str or --metric=str Use an alternate distance metric: manhat, cheby, traffic, or town. If the town metric is used, point
labels are taken as aliases for townships.
-v Display all points in each cluster. Incompatible with -d, possibly very verbose.
-i or --incremental Run the algorithm one "improve" step at a time, displaying the results in between,
and relying on the user to terminate.

Milestone: Basic Point Functions

For the milestone, you must implement the basic mechanics of clustering. You are given the following type aliases.

type Point = (Double, Double, Int)
type Center = Point
type Cluster = (Center, [Point])

Points have an x coordinate, a y coordinate, and an integer label. It is not required that each point have a unique label. Center is just a type alias for Point. After Centers move, they take the label of the nearest Point. A Cluster is the tuple of a Center and the list of Points in that cluster. You must implement the following functions:

- - - Taking the initial centers of a list of points. For the milestone, just take the first k elements.
    - Each of the five distance metrics.
    - The higher-order minimize and bucket functions.
    - The average function.

Project: k-Means Clustering

For the full project you must implement the k-means algorithm. All of the k-means functions take a distance function as a parameter.

- - - assignPoint and assignPoints assign points to the nearest center.
    - findMean and moveCenters move the center locations to be the mean of the points in a cluster.
    - improveClusters combine the above two steps into a single function.
    - kMeans implements the above algorithm. You will need to use an iteration count to ensure the algorithm terminates.

Full Credit: getKElems

The initial centers, by default, are chosen as the first k elements. If the data is heavily sorted, this could cause problems. For full credit, select k elements spaced evenly throughout the list. There may be multiple valid even spacings. For instance, the 3 evenly spaced elements from [1..6] are 1,3,6. This problem has a minimal impact on this application, but is algorithmically interesting.

Full Credit: Empty Clusters

It is possible that a cluster becomes empty, when every point is closer to some other center. In this case the center is undefined and the cluster vanishes. This violates our notion of having k means. To fix this, when there are less than k centers, split one center into two. This is done in between updating the centers and assigning points to their nearest clusters. You can choose to only split one cluster in the call to improveClusters, or to repeat this process and ensure there are k clusters.

- - Find the biggest cluster.
  - Choose an arbitrary point in the cluster that is different from the center. It is sufficient to insist that it has a non-zero distance.
  - Make that point the center of a new cluster.

Tentative Rubric

Subject to change

Milestone 19
Bucket and Minimize 10
Core Project 45
Implementation Details 5
getKElems Implementation 7
Empty Clusters 8
Appropriate use of HOFs 6

Report abuse