Assignment IV: Inverted Index

Submission

This is one of 4 options for Assignment IV: Application.

Please zip your entire directory and submit it via moodle by Monday 4/30 at 11:59pm.

Inverted index

from Prof. Andrews' assignment from Fall '11

An inverted index is a data structure that maps occurrences of data (in this case, words) back to locations in original content. This is frequently used to support full text support for information stored in databases. It is typically used to return records or documents that contain the queried word. You will use it to return lines of text.

There are a number of ways to implement an inverted index -- it basically just requires an associative structure. In this assignment, you will use a hash table. The keys in this case will be words. The values will be a String containing the lines in which that word occurred. When a user queries the index with a word, they will be able to see all of the lines containing the queried word.

Reminders

  • Include a README.txt file that clearly states:
    • what your application does
    • how to run your program from the command line
  • Be sure to javadoc all your code!

Building the index (60 points)

  • The constructor should take one parameter: a String[] of the content to store; each entry in the array holds a line of text.
  • Each line of text (i.e., entry of the array) should be broken down into individual words, so that you can put the line number into the String associated with the word (e.g., use the word as the key)
    • convert the words to lowercase before you store them so that you can find all instances of a word with a search
    • you will have to test if the word has already been put into the table already
      • if not, create an empty String as its value
      • if so, append the line number to the String (your Strings will look like "1,3,4,10,22")
  • Implement a getLines( String word ) method that returns a list of the lines containing that word, formatted as:
<lineNumber>: <line text>
    • Hint: You will need to parse the String to find each line number and use it to get the line text from the original content
  • Bonus: surround each occurrence of the word with asterisks (*)

Reading a file to create an index (20 points)

Create and implement a new class called InvertedIndexUtility.

  • Implement a method that returns an instance of an InvertedIndex and takes one parameter: a String with the path to a file containing the text for the inverted index
    • The file should be read in and stored, line by line into a List<String> (e.g., an ArrayList<String>), which in turn should be converted into a String[]
    • The array should be used to create an instance of an InvertedIndex to return

Interacting with an index (20 points)

Create and implement a new class called InvertedIndexApplication.

  • This should have a main method that reads in one parameter from the command line: the path to the file containing the content for your inverted index.
  • Using the InvertedIndexUtility class, you should create an InvertedIndex instance based on the file. Then your application should enter an interactive mode. You may want to refer to this sample code for interactively reading input from the command line. When the user enters a word, your index should look it up and print the lines that contain it.

If the user uses the RETURN key without entering any text, the application should quit.

You can use Project Gutenberg to obtain testing content.

An example

Given the text from this random page:

  1. The Banana Story
  2. Start with a cage containing five monkeys. Inside the cage, hang a banana on a string and place a set of
  3. stairs under it. Before long, a monkey will go to the stairs and start to climb towards the banana. As soon
  4. as he touches the stairs, spray all of the other monkeys with cold water. After a while, another monkey
  5. makes an attempt with the same result - all the other monkeys are sprayed with cold water. Pretty soon,
  6. when another monkey tries to climb the stairs, the other monkeys will try to prevent it.
  7. Now, put away the cold water. Remove one monkey from the cage and replace it with a new one. The new
  8. monkey sees the banana and wants to climb the stairs. To his surprise and horror, all of the other monkeys
  9. attack him. After another attempt and attack, he knows that if he tries to climb the stairs, he will be assaulted.
  10. Next, remove another of the original five monkeys and replace it with a new one. The newcomer goes to the
  11. stairs and is attacked. The previous newcomer takes part in the punishment with enthusiasm! Likewise, replace
  12. a third original monkey with a new one, then a fourth, then the fifth.
  13. Every time the newest monkey takes to the stairs, he is attacked. Most of the monkeys that are beating him have
  14. no idea why they were not permitted to climb the stairs or why they are participating in the beating of the newest
  15. monkey.
  16. After replacing all the original monkeys, none of the remaining monkeys have ever been sprayed with cold water.
  17. Nevertheless, no monkey ever again approaches the stairs to try for the banana.
  18. Why not?
  19. Because as far as they know that's the way it's always been done around here.
  20. And that, my friends, is how 95% of agents have learned how to market homes.

Then, interactively entering banana should print out:

banana was found at
line 1: The *Banana* Story
line 3: Start with a cage containing five monkeys. Inside the cage, hang a *banana* on a string and place a set of
line 4: stairs under it. Before long, a monkey will go to the stairs and start to climb towards the *banana*. As soon 
line 10: monkey sees the *banana* and wants to climb the stairs. To his surprise and horror, all of the other monkeys 
line 22: Nevertheless, no monkey ever again approaches the stairs to try for the *banana*.

The * are only if you do the bonus!