Project 4 – Web Crawler

Summary

For this project, you will create a web crawler that takes as input a seed URL to crawl and a query file. Your program must then crawl all links found on the seed web page and resulting pages until all links have been crawled or you have reached a maximum of 50 unique links. For each webpage crawled, you must remove all HTML tags and populate an inverted index from the resulting text. Finally, your program must return partial search results for each query in the supplied query file. You must still use multithreading for building and searching your index.

You will find your HTML link parser and HTML tag stripper from previous homework assignments essential for this project.

Please output the files invertedindex.txt and searchresults.txt in the same format as the previous projects. (Except, instead of full file paths, you will output full URLs.)

The suggested deadline for this project is Monday, April 29, 2013 at 11:59pm. You must still meet the functionality requirements of the previous projects.

Functionality

In addition to the requirements of the previous project, you must extend your inverted index to support the following functionality:

Add support to build the inverted index from a seed URL instead of a directory. Specifically, build a web crawler that does the following:
Open a socket and download the webpage specified by the URL using HTTP.
Collect (but do not immediately crawl) all of the HTML links from the webpage. See Homework 3: HTML Link Parser for the functionality requirements for this part.
Strip all of the HTML tags from the webpage. See Homework 8: HTML Cleaner for the functionality requirements for this part.
Parse the resulting text into words to populate the inverted index.
AFTER parsing the entire page, add a new job to a WorkQueue to crawl each link found on that page. Track the number of URLs carefully to make sure you do not parse the same link twice and only parse up to the maximum number of pages.
The building of the index should still be multithreaded, whether from a directory or a seed URL. In the case of using a seed URL, each worker thread should parse a single webpage.
The inverted index must still be thread-safe, and support placing URLs instead of file paths into the index.
Your web crawler should support crawling relative links within a webpage, and should only consider the non-fragment portions of the URL when crawling.
The partial search functionality should remain unchanged.

Your program should still support building your index from a directory if provided. Below are some additional considerations for this project.

Relative Links

The majority of links on webpages are relative (i.e. specified relative to the current webpage url). You will need to convert those relative links into an absolute link. For this, you may use the java.net.URL class. For example, consider the following:

URL base = new URL("http://www.cs.usfca.edu/~sjengle/cs212/");

URL absolute = new URL(base, "../index.html");

// outputs http://www.cs.usfca.edu/~sjengle/index.html

System.out.println(absolute);

You must still use sockets in your web crawler. Do NOT use the getContent() or openConnection() methods in the URL class.

Unique Links

You should store a set of unique links to make sure you do not crawl the same link twice. The fragment portion should be disregarded in this comparison. You can use the sameFile() method in the URL class, or make sure you only store the non-fragment getFile() portion of the URL in your set.

For example, the link http://docs.python.org/2/library/string.html is equivalent to the following links:

Execution

Your code must run on the lab computers. If you are developing your code on a home computer or laptop, be sure to check out your code on a lab computer and test it. Your main method must be placed in a class named Driver. This should be the only file that is not generalized and specific to the project.

Your code will be tested using the following commands:

svn export https://www.cs.usfca.edu/svn/<username>/cs212/project4

cd project4

java -cp project4.jar Driver <arguments>

where <arguments> will be the following command-line arguments (in any order):

-u <seed> where -u is an optional flag indicating the next argument is a URL, and <seed> is the seed URL that must be initially processed for the inverted index
- If missing, the -d flag must be provided instead
- -d <directory> where -d is an optional flag indicating the next argument is a directory, and <directory> is the directory of text files that must be processed for the inverted index
  - If missing, the -u flag must be provided instead.
- -q <queryfile> where -q is an optional flag indicating the next argument is a file path, and <file> is a text file containing search queries
  - If missing, do not perform any searching.
-i <filename> where -i is an optional flag such that:
- If present, you should output the inverted index to a file. If not present, do not output the inverted index.
- If the <filename> is missing, you should use invertedindex.txt as the default filename.
-r <filename> where -r is an optional flag such that:
- If present, you should output the search results to a file. If not present, do not output the search results.
- If the <filename> is missing, you shoud use searchresults.txt as the default filename.
- -t <threads> where -t is an optional flag such that:
  - If present, the next argument <threads> is the number of threads to use in the work queue/thread pool
  - If missing, default to using 2 threads in the work queue/thread pool

If the proper command-line arguments are not provided, your program should output a user-friendly error message to the console and exit gracefully.

Output

The format of the query file will be identical to previous projects. The output files invertedindex.txt and searchresults.txt must also be in the same format as previous projects, except instead of full file paths you will output URLs.

Submission

You must submit your project to your SVN repository at:

https://www.cs.usfca.edu/svn/<username>/cs212/project4

where <username> should be replaced with your CS username. You should include the following files in this directory:

a jar file named project4.jar in all lowercase that includes all of the necessary *.class files to run your program
a src directory with all of the *.java files necessary to compile your program
a readme.txt file with your name, email address, student id, and brief description/justification of your approach

If there are any issues with your submission, you will be asked to resubmit the project and a code review will not be performed.

Testing

Seed URLs, expected output, and unit tests have been provided on the lab computers and below. Your project should pass the Project 4, Project 3, Project 2, and Project 1 unit tests!

Page updated

Report abuse