For this final project, you will create a customizable search engine that builds off all of your previous projects. Your project will use multi-threading, an inverted index, servlets, sockets, cookies, Jetty, HTTP, HTML, JDBC, and SQL. Examples will be shown in class.
You should still meet the functionality requirements of the previous projects, except that you should only output the files invertedindex.txt and searchresults.txt if the -i and -q flags are provided. Instead, your primary output will be HTML pages returned by your Jetty web server.
You must create a Java application that crawls a specified seed web page, builds an inverted index, and starts a search engine for a website that provides weighed partial search capability, authenticates users, and maintains a history of each user's prior searches. See your previous projects for details on the required web crawler functionality, weighted partial search functionality, and inverted index functionality.
Regarding the search engine functionality of this project, you must implement several core features worth 60 points, plus several additional features worth 40 points, for a total of 100 points plus possible extra credit. The core and additional features are discussed next.
You must implement several core features worth a total of 60 points. The core features are:
Search (20 points): Your web application should display a webpage with a text box where users may enter a search query, and a button that submits the query to your search engine. Your search engine should perform a partial search from an inverted index generated by your web crawler, and return an HTML page with sorted links to the search results.
User Registration (10 points): Your web application should allow users to register an account with your search engine. Your application should store, at a minimum, the username and password of that user in a mySQL database.
User Login/Logout (10 points): Your web application should allow a user to login using a username and password. Once a user has logged into your web application, your application should track the user's session using cookies.
Search History (10 points): Once a user has logged on to your search engine, your web application should store a history of all search queries entered by that user in a mySQL database. Your application should allow a user to view his/her search history.
Account Maintenance (10 points): Your web application should allow a user to change his/her password, and clear his/her search history.
You should not begin working on any additional features until you have the core functionality working properly.
You must implement at a minimum 40 points worth of additional features. These features include:
Page Snippet (15 points): In the search results returned by your search engine, your application could display a snippet of the search result page. A snippet usually consists of 2-3 lines from the web page. For performance reasons, these snippets should be stored in a mySQL database when the page is crawled.
Visited Pages (15 points): In addition to tracking a user's search history, you may also store which result pages have been visited by that user. This requires your web application to provide result links that direct back to your application, so that you may first store that the link was visited and then redirect the user to that link.
Administrator Interface: In additional to account maintenance, provide an administrator interface. In this interface implement one (or both) of the following:
New Crawl (10 points): Allow the administrator to enter a new seed URL to crawl. The results should be added to your inverted index (not replace the already existing results).
Server Shutdown (10 points): Allow the administrator to gracefully shutdown the web server.
Suggested Queries (10 points): Provide five suggested queries for the user based on how many results a particular word has in your inverted index. For example, if your seed is a news website, it is likely the suggested queries will relate to the major news of that day.
Sophisticated Search Sorting (10 points): Implement a more sophisticated method of sorting search results. To earn full points for this feature, your method of weighting search results must be more sophisticated than the one required by the partial search project. For example, weight pages that have all query words higher, or that have the words appearing consecutively higher.
Result Comments (10 points): Allow users to add a comment to search results, and display these comments whenever the result is displayed.
Favorite Searches (10 points): Allow users to save favorite searches or favorite links.
Password Reset (10 points): Save the answers to different "security questions" when a user registers, and allow the user to answer those questions to reset the password.
StringTemplate (5 points): Use StringTemplate to generate your HTML instead of several println() statements. See for http://www.cs.usfca.edu/~parrt/course/601/lectures/stringtemplate.html more information on StringTemplate.
Advanced Search Options: In addition to providing weighted partial search capabilities, implement advanced search options that a user may set with the following features:
Excluded Words (5 points): Allow the user to enter words that should NOT appear in the search results.
Consecutive Words (5 points): Allow the user to enter several words in quotation marks "like so" that should appear consecutively in the search results.
Partial Search Toggle (5 points): Allow the user to toggle on/off partial search.
Results Per Page (5 points): Allow the user to select the number of results that should be displayed on each page. As a result, some search results may span multiple pages.
Logged In Users (5 points): Display a footer on each page that shows which users are currently logged in to the search engine.
Search Brand (5 points): Design a search engine with a distinct brand. This includes creating a logo, and creating a distinct and consistent style for your web pages.
Time Stamps (5 points): Add timestamps to the user's search history, and store when the user last logged in successfully.
Private Search (5 points): Allow users to set an option that turns off the search history feature.
Search Statistics (5 points): Display the total number of results along with the time it took to calculate and fetch those results.
Change Theme (5 points): Allow users to change the visual theme used throughout your website.
You may implement more than 40 points of these additional features for extra credit. See the next section for details.
You may implement additional features to receive extra credit on this project. You may implement as many additional features as you like, and may even suggest some features to the instructor.
You can earn up to a 110% percent overall project grade. This will boost your midterm and final exam grades if you did poorly.
Your code must run on the lab computers. If you are developing your code on a home computer or laptop, be sure to check out your code on a lab computer and test it. Your main method must be placed in a class named Driver. This should be the only file that is not generalized and specific to the project.
Your code will be executed using the following commands:
svn export https://www.cs.usfca.edu/svn/<username>/cs212/project5
cd project5
java -cp project5.jar Driver <arguments>
where <arguments> will be the following command-line arguments (in any order):
-u <seed> where -u indicates the next argument is a URL, and <seed> is the seed URL that must be initially processed for the inverted index
-p <port> where -p indicates you should start a web server, and <port> is the port the web server should accept connections on
-t <threads> where -t indicates the next argument <threads> is the number of threads to use in the work queue/thread pool
Your code should still support the functionality and command-line arguments from the previous projects. If the proper command-line arguments are not provided, your program should output a user-friendly error message to the console and exit gracefully.
Your search engine will output dynamic web pages based on the query entered via an HTML form. There are no specific output requirements beyond that.
You may use the test results from the previous programs to test your code. There are no specific tests for this project.
This project will be graded interactively. Interactive grading for this project will take place on finals week. You will demonstrate your functioning search engine from your own laptop or on a lab computer.
You must submit your project to your SVN repository at:
https://www.cs.usfca.edu/svn/<username>/cs212/project5
where <username> should be replaced with your CS username. You should include the following files in this directory:
a jar file named project5.jar with the *.class files to run your program
a src directory with all of the *.java files necessary to compile your program
a readme.txt file with your name, email address, student id, and brief description/justification of your approach
You must also provide the following files in the root of your project directory:
a database.properties file that allows the user to change the database configuration
a create.sql file that contains the CREATE statements necessary to setup the tables required by your search engine
You should double-check that you are able to use your submitted code to run your search engine from the lab computers!