For this project, you will extend your previous project to support partial search. Your program will return a sorted list of results from your inverted index that start with the query word(s).
For example, suppose your inverted index contains the words: after, apple, application, and happen. If the query word is app, you should return results for both apple and application (but not happen).
Your search results must be sorted such that the most relevant search result is listed first, and the least relevant search result is listed last. You will determine relevancy based on both the frequency of the query word and position.
The suggested deadline for this project is Monday, March 11, 2013 at 11:59pm. You must still meet the functionality requirements of the previous project.
In addition to the requirements of the previous project, you must extend your inverted index to support the following functionality:
Efficiently return partial search results from your inverted index, such that any word in your inverted index that starts with a query word is returned.
Sort the search results using a simple metric for relevancy (details below).
Support multiple word search queries, where the search results from individual words are combined together intelligently for multiple-word queries. For example, a file should only appear once in the search results even if it contains multiple query words.
To rank your search results, use the following criteria:
Frequency: Files where the query word(s) are more frequent should be ranked above others.
Position: For files that have the same frequency, files where the words appear in earlier positions should be ranked above others.
Path: For files that have the same frequency and position, sort files alphabetically by the canonical path.
As such, a file that contains 15 occurrences of a query word should be ranked above a file containing 5 occurrences of that word. For multiple-word queries, you should add the count for each word together and use the earliest position any of the query words appear.
For example, suppose the query words are "apple banana". Let file1.txt have 15 results for apple and let file2.txt have 5 results for apple and 5 results for banana. Then, file1.txt will be ranked higher than file2.txt since the total count of 15 for file1.txt is greater than that of 10 for file2.txt (even though file2.txt has both of the search terms). If apple first appears in position 3 in file2.txt and banana first appears in position 14, the position that should be used for sorting is 3.
The details of how your program will be run and the expected output are below.
Your code must run on the lab computers. If you are developing your code on a home computer or laptop, be sure to check out your code on a lab computer and test it. Your main method must be placed in a class named Driver. This should be the only file that is not generalized and specific to the project.
Your code will be tested using the following commands:
svn export https://www.cs.usfca.edu/svn/<username>/cs212/project2
cd project2
java -cp project2.jar Driver <arguments>
where <arguments> will be the following command-line arguments (in any order):
-d <directory> where -d indicates the next argument is a directory, and <directory> is the directory of text files that must be processed for the inverted index
-q <queryfile> where -q indicates the next argument is a file path, and <file> is a text file containing search queries
-i <filename> where -i is an optional flag such that:
If present, you should output the inverted index to a file. If not present, do not output the inverted index.
If the <filename> is missing, you should use invertedindex.txt as the default filename.
-r <filename> where -r is an optional flag such that:
If present, you should output the search results to a file. If not present, do not output the search results.
If the <filename> is missing, you shoud use searchresults.txt as the default filename.
If the proper command-line arguments are not provided, your program should output a user-friendly error message to the console and exit gracefully.
The input query file will consist of one multi-word search query per line. The query file may contain symbols and mixed-case words. You will need to perform the necessary transformations to match the queries with your inverted index. A sample query file may look like:
1918
a-coming
you
irrelphant
WH
app ba
There are six different search queries in this example. The first query 1918 consists of only one search word. The second query a-coming should have the special character removed to be acoming instead. The query WH should be made all lowercase as wh instead. The last query should match all words that start with app or start with ba.
If a query, like irrelphant, does not have any search results, you should still include the query in the output (just without any file locations). See the next section for details on the output.
The output of your program should be a file named searchresults.txt that contains the search results for each query in the following format:
<query 1>
<file path 1>, <frequency>, <initial_position>
<file path 2>, <frequency>, <initial_position>
<query 2>
<file path 1>, <frequency>, <initial_position>
where the query (including all words if it is a multi-word query) is listed alone on a single line, followed by lines with the canonical file path, frequency the word appears in that file, and the first position that word appears. An empty line should separate entries. For example, a sample results file for the above queries might be:
1918
"/home/public/cs212/input/index/simple/pg22014c.txt", 1, 51
"/home/public/cs212/input/index/simple/pg22014b.txt", 1, 62
"/home/public/cs212/input/index/simple/pg22014a.txt", 1, 93
a-coming
"/home/public/cs212/input/index/simple/pg22014a.txt", 1, 36
you
"/home/public/cs212/input/index/simple/pg22014c.txt", 2, 11
irrelphant
WH
"/home/public/cs212/input/index/simple/pg22014a.txt", 3, 13
"/home/public/cs212/input/index/simple/pg22014b.txt", 1, 5
(Note above that irrelphant has no search results.)
Additionally, you should output a file with the contents of your inverted index in the same format as before if the appropriate flag is provided.
You must submit your project to your SVN repository at:
https://www.cs.usfca.edu/svn/<username>/cs212/project2
where <username> should be replaced with your CS username. You should include the following files in this directory:
a jar file named project2.jar in all lowercase that includes all of the necessary *.class files to run your program
a src directory with all of the *.java files necessary to compile your program
a readme.txt file with your name, email address, student id, and brief description/justification of your approach
If there are any issues with your submission, you will be asked to resubmit the project and a code review will not be performed.
You should thoroughly test your own code. Make sure it meets the functionality requirements, performs proper exception and error handling, and produces the correct output. Your code must be fully functional before submitting it for code review.
Several test input, output, and JUnit files have been provided to help you test your code. You may access these files on the lab computers at:
/home/public/cs212
Alternatively, you can download the files below and place them in the appropriate local directories.