Homework‎ > ‎

HW3 - Minhash based LSH implementation

This homework is mainly based on the following web site:

1. Overview

For this assignment, you'll implement a locality based hashing (LSH) approach on a particular dataset.
Due date : June 10th

2. Data

The data you'll use for this assignment comes from the so-called "Bag of Words" dataset at the UCI
Machine Learning Repository. This dataset is actually four datasets; we'll use the one called enron.
If you missed all of the news excitement a few years ago regarding the collapse of the energy company
Enron, feel free to dig through some news archives to read about the scandal. This remarkable
dataset contains nearly 40,000 internal emails from the company that were released for the public

All four datasets are located at the  public directory:

The descriptions about data are provided at the web page:

You'll only need to look at the readme.txt, which explains the layout, and the .file docword.enron.txt which contains the actual data. Note
that this file has been compressed via gzip, and so you'll need to use the command gunzip to uncompress it.
Happily, this .has alread converted all words to integers. The words themselves are unnecessary for this exercise,
but if you're curious to see a list of them, they're in vocab.enron.txt.

This dataset contains counts for the number of times each word appears, but you'll ignore the count.
Since we're focusing on Jaccard similarity, you only need keep track of whether or not a word appears.

3. LSH

You can find the python version of LSH at the following web site.

As you can see, after one interation of MapReduce, you construct the band from the dataset.
Try to understand the python version first and then  implement your own LSH program in JAVA.

4. Finding nearest neighbors (Top-k processing)

You need to Implement an LSH approach for finding the k nearest neighbords for a given document ID by im-
plementing the banding technique. For homework set K as 20.

Please refer to the following web site also.