Project 1 - Document Similarity

Due Friday 2/17 - 5:00pm

The goal of this project is to give you more experience with file i/o, collections, and string manipulation. For this project, you will write a program that determines whether two text files are similar. You will count the frequency with which each word occurs in each document and calculate the cosine similarity for the documents. See http://wordhoard.northwestern.edu/userman/analysis-comparingtexts.html for additional information.

You are required to implement the design outlined in the javadoc found here: Project 1 Documentation of Required Design

First, you will process each text file and, for each, build a list of all words in either document along with a count of the number of times the word appears. You will modify the SortedWordList class you built for Lab 3 for this purpose. This will give you a score vector for document 1 and a score vector for document 2, where the score is the frequency with which the word appears in the document. Note that 0 is a valid score.

Once you have processed both documents, you will need to calculate the cosine similarity. According the the WordHoard web page referenced above, the cosine similarity is "the vector dot product of the score vectors for the two works divided by the square root of the product of the vector dot products of each score vector with itself". Following is an example:

Document 1: The cat and the dog ran.

Document 2: The white cat and the brown cat played.

The vector dot product of the vectors is: (1*1)+(0*1)+(1*2)+(1*0)+(0*1)+(1*0)+(2*2)+(0*1) = 7

The square root of the dot product of vector 1 and itself is: sqrt((1*1)+(0*0)+(1*1)+(1*1)+(0*0)+(1*1)+(2*2)+(0*0)) = 2.83

The square root of the dot product of vector 2 and itself is: sqrt((1*1)+(1*1)+(2*2)+(0*0)+(1*1)+(0*0)+(2*2)+(1*1)) = 3.46

Cosine Similarity = (7/(2.83*3.46)) = .71

The design outlined in the javadoc will help you to break down the problem and implement the project a small piece at a time. It is recommended you begin by completing the implementation of the SortedWordList and testing it. Once you are convinced it works correctly, begin the implementation of the DocumentProcessor class. Finally, implement the SimilarityCalculator.

You should test you program on a variety of inputs, including the two texts attached at the bottom of the page.

Submission Instructions

Please submit your work in an SVN directory https://www.cs.usfca.edu/svn/<username>/cs112/project1

Page updated

Google Sites

Report abuse