Lab 4 - Crawler

Due - Friday 11/5 - 11:59pm

The goal of this assignment is to give you practice with Sockets and HTTP by writing a multithreaded web crawler that you will integrate with Labs 2 and 3.

Your program will take as input a seed URL to begin the crawl. Your program will create a WorkQueue and insert into the queue the 'job' associated with crawling the given page. For each URL you crawl, your program will open a Socket, download the page, strip all of the tags, insert the remaining words into the InvertedIndex, and insert any links found in the page as jobs into the WorkQueue.

For testing purposes, you can restrict the number of total pages your crawler processes. You can also simply wait some fixed amount of time, or wait until a fixed number of pages have been traversed, before executing some test queries.

Your program will be run as follows:

java -cp invertedindex.jar Driver -w URLofSeedPage -q /Query/file.txt

Requirements Clarifications:

    1. Your program must be capable of crawling 10 pages concurrently. The easiest way to achieve this is to create a Runnable class whose run method downloads and processes a single page. For each link found, it will create a new instance of the Runnable class and insert the new object into a WorkQueue.
    2. The output of your program must be saved in a file called results.txt. This file will look identical to the results.txt file generated by Lab 2 except that it will output URLs instead of file paths.
    3. You may restrict the total number of pages crawled by your site to 30.
    4. For the purposes of this assignment, you will not crawl the same page multiple times. If page A links to page B and page B links back to page A, you will only crawl page A once. The easiest way to achieve this is to keep a HashMap of the URLs you have already crawled and checking the HashMap for each link found.
    5. You may ignore HTML escape sequences such as  .

Grading

  1. (25 points) Crawler functionality.
  2. (25 points) Tag stripper integration.
  3. (25 points) InvertedIndex integration.
  4. (10 points) WorkQueue operation.
  5. (15 points) Design and refactoring.

Submission Instructions