Lab 2 - Multithreaded Indexing and Search

Part 1 - Due Monday 2/21/2011 - 11:59PM

Part 2 - Due Friday 3/4/2011 - 11:59PM

The goal of this assignment is to give you practice with concurrent programming and Java threads. You will also gain experience with code redesign. You will extend Lab 1 in two ways: (1) you will add search functionality to your inverted index and (2) you will use threads to process several text files in parallel when building your inverted index.

For this assignment, you may not use any of the java.util.concurrent classes.

Requirements - Part 1

    1. Search - Your search mechanism will take as input a multi-word query and will return a list of documents where the given words appear.
      1. Your score for this portion of the assignment will be based on the intelligence of your algorithm. The easiest solution is to simply return a list of documents where any of the words appear. Another approach is to give greater weight to (by placing earlier in the result set) documents where more than one of the query words appears. Yet another option is to give greater weight to the documents where the words specified appear closer to one another. For example, if you search for "computer science" and document 1 has computer at position 1 and science at position 2 while document 2 has computer at position 1 and science at position 100, document 1 would be given higher weight.
      2. Your program will take as input the directory as before along with a file containing a set of queries. Each line of the file will contain a multi-word query. The output of your program will be a text file results.txt that contains the result of running each query.
    2. Redesign - You will redesign and reimplement your code based upon in-class discussion and the comments you received for Lab 1. Some of the elements I expect you to address are the following:
      1. Use Generics when appropriate.
      2. Use the java.util data structures where appropriate.
      3. Separate data maintenance from parsing logic.
      4. Use iteration over recursion where appropriate.
      5. Use regular expressions where appropriate.
      6. Keep in mind that Strings are immutable.
      7. Do not initialize variables by creating "empty" Objects that are never used.

Requirements - Part 2

    1. Thread Pool - You will use a thread pool or work queue and process up to 10 text files in parallel. As your program traverses the directory specified by the user, for each txt file found insert a new job into the queue.
      1. To accomplish this, you will need to implement a work queue and a locking mechanism for your inverted index data structure.
      2. Your locking mechanism will ensure that only 1 thread may change the inverted index at a time. Multiple threads may read data in the inverted index simultaneously.
      3. Once all files have been processed, all threads will gracefully exit and the program will gracefully complete.

Submission

    1. Your submission must be checked in by the deadline.
    2. You will be graded on the code in https://www.cs.usfca.edu/svn/your-user-name/cs212/lab2.
    3. You must provide a jar file containing your code and class files. The name of this file must be invertedindex.jar. To verify that you have correctly submitted this element of the assignment, you may use the following command:
    4. svn list https://www.cs.usfca.edu/svn/your-user-name/cs212/lab2
      1. You should see invertedindex.jar in the result.
  1. invertedindex.jar must contain all java files and all class files you are submitting. From the command line, you can use the command jar xvf invertedindex.jar to inflate the jar file. The contents should include all java and class files.
    1. Your program will be run as follows:

java -cp invertedindex.jar Driver -d /My/Directory -q /Query/file.txt

Grading - Part 1 (still subject to change...)

  1. (20 points) Functional search mechanism.
  2. (10 points) Intelligence of search algorithm.
  3. (20 points) Lab 1 code has been integrated and redesigned where appropriate.

Grading - Part 2 (still subject to change...)

  1. (25 points) Work queue is implemented and works correctly.
  2. (25 points) Lock class is implemented and correctly provides read/write access to inverted index.

Submission Instructions