For this project, you will write a Java program that recursively processes all text files in a directory and builds an inverted index to store the mapping from words to the documents (and position within those documents) where those words were found. For example, suppose we have the following mapping stored in our inverted index:
elephant → { ( mammals.txt, [ 3, 8 ] ),
( endangered.txt, [ 2 ] ) }
This indicates that the word elephant is found in two files, mammals.txt and endangered.txt. In the file mammals.txt, it is found in two locations (the 3rd and 8th word). In file endangered.txt, it is found in one place as the 2nd word in the file.
The suggested deadline for this project is Monday, February 18, 2012 at 11:59pm.
For this program, you must traverse the supplied directory and process all text files found in that directory (including all subdirectories). For each text file, you must parse each line into words. For each word, you must store a mapping of the word to the file and position that word was found in a custom inverted index data structure.
Specifically, the core functionality of your project must satisfy the following requirements:
Your inverted index must store a mapping from word to the file(s) it was found, and the position(s) in that file it is located. This will require nesting multiple data structures. You should choose these data structures wisely.
The positions stored in your inverted index should start at 1. For example, if a file has words "apple banana carrot", then "apple" is in position 1, "banana" is in position 2, and "carrot" is in position 3.
Your program must separate a file into words by any whitespace, including spaces, tabs, and new line characters.
Your program must be case-insensitive. For example, the words APPLE, Apple, apple, and aPpLe should all be seen as the same word.
Your program must ignore all characters except letters and digits. For example, the word "age-long" should be seen as "agelong" (since the dash "-" should be ignored) and the word "Hello!" should be seen as "hello" without the exclamation "!" mark. One way to accomplish this is to replace any special characters with the empty string before parsing.
Your program must support large text files. As a result, you should not read the entire file into memory at once. Instead, read a single line from the file into memory at a time.
Your program must be designed to object-oriented. For example, you should separate directory traversing, file parsing, and data maintenance into different classes.
You must also protect the integrity of your inverted index, making sure you have proper encapsulation of any private data members. For example, do not return a reference to a private data member in a public method.
Your program must output the inverted index to a text file. See the Output section for specific output requirements.
Every project must also satisfy the following design requirements:
Your program should conform to Java code conventions. See http://www.oracle.com/technetwork/java/codeconv-138413.html for details.
Your program should use Javadoc comments for all classes, members, and methods. See http://www.oracle.com/technetwork/java/javase/documentation/index-137868.html for details.
You must create object-oriented code that is generalized and encapsulated. This includes protecting the data integrity of all private data members.
You must perform proper error and exception handling, and display a user-friendly error message when issues occur. The user should never see the Java exception stack trace.
Your code must be reasonably efficient (both from the execution time and memory usage perspectives).
You must satisfy the core functionality prior to submitting your project. See the Testing section for details. If there are issues with the design of your project that are found during the code review, you will be asked to refactor and resubmit your project.
The output of your program should be a file invertedindex.txt (or use the filename provided on the command-line) that contains the contents of your inverted index in sorted order using the following output format:
cow
"/home/sjengle/files/mammals.txt", 11
elephant
"/home/sjengle/files/endangered.txt", 2
"/home/sjengle/files/mammals.txt", 3, 8
where the word is listed alone on a single line, followed by lines with the absolute file path, and a comma-separated list of locations. An empty line should separate entries. The words should be output in sorted order, and the files should be sorted by the absolute path name.
Your code must run on the lab computers. If you are developing your code on a home computer or laptop, be sure to check out your code on a lab computer and test it. Your main method must be placed in a class named Driver. This should be the only file that is not generalized and specific to the project.
Your code will be tested using the following commands:
java -cp project1.jar Driver <arguments>
where <arguments> will be the following command-line arguments (in any order):
-d <directory> where -d indicates the next argument is a directory, and <directory> is the directory of text files that must be processed
-i <filename> where -i is an optional flag that indicates the next argument is a file name/path. amd <filename> is the name/path to use when saving the inverted index. If this flag is not provided, you should use invertedindex.txt as the default filename.
If the proper command-line arguments are not provided, your program should output a user-friendly error message to the console and exit gracefully.
You must submit your project to your SVN repository at:
https://www.cs.usfca.edu/svn/<username>/cs212/project1
where <username> should be replaced with your CS username. You should include the following files in this directory:
a jar file named project1.jar in all lowercase that includes all of the necessary *.class files to run your program
a src directory with all of the *.java files necessary to compile your program
a readme.txt file with your name, email address, student id, and brief description/justification of your approach
See this guide for how to properly setup Eclipse to include the required files and directories. (You will have to generate the jar file manually, however.) Once you have your project properly submitted, please fill out the Project Submission form. If there are any issues with your submission, you will be asked to resubmit the project and a code review will not be performed.
You should thoroughly test your own code. Make sure it meets the functionality requirements, performs proper exception and error handling, and produces the correct output. Your code must be fully functional before submitting it for code review.
PENDING