This page is dedicated to the use of concurrency in more advanced applications, such as this multithreaded webcrawler. It is aimed for intermediate to advanced level Java developers, and the approach for this article is different from the rest. Design decisions and implementation for proper concurrency are discussed, instead of step-by-step build and clarifications.
The original motivation for this article was the lack of complete real world applications in most concurrency books. Java: Concurrency in Practice is one excellent source on the matter but unfortunately most of the examples demonstrate a single concurrency topic and the reader has to rely on his own imagination to bring them together into some working application.
The solution is to have the crawler process the links in a breadth-first fashion (that is, process all sublinks of a given link, before going onto the deeper level). This will yield the largest spread of new links and allow for their faster discovery and indexing. Since parsing web pages for followable links is an identical process for any web page involved, the breadth first algorithm can be implemented in a concurrent fashion - by processing several links at once by a fixed number of worker threads.
The following UML shows the heart of the crawler:
The concurrent breadth-first algorithm itself is short - for each level in the n levels specified, Crawler object iterates through its
allCurrentLevelLinks list and submits all of the links to its executor, as such:
This invokes the run() method of the submitted LinkObject, which allows the link to parse its page.
There are two considerations for concurrency here:
1. All of the current level links need to complete their full tree of sublinks before the crawler can proceed to the next level.
2. When each link writes back its sublinks to the crawler to build the nodes of the subsequent level, multiple links should not overwrite each other's sublinks. If the Link objects do not write their sublinks back synchronously, and two or more are written back at the same time, a race condition will result where only the last write to the list is saved.
The workhorse of the Link class are the scrapeYourself() and addSublinksToCrawler() methods which parse the URL this Link object points to and add all of the links found to a list of this Link's sublinks. The process itself does not have a constant time, as the pages themselves differ in size, hence in a breadth-first search the Link objects on the same level can take a different time to complete. The algorithm and the concurrency policy then is to allow a fixed number of threads to go through the list of sublinks on level n, proceeding to level n+1 only when all of the sublinks for that level are formed.
The Executor class coupled with a FixedThreadPool available in the standard java.util.concurrent package is perfect for the task.
The interface was introduced in Java 1.5 as a way to decouple task submission from task execution, and since it would be unwieldy to
manually create and allocate individual Thread objects to process the Links, the ExecutorService coupled with a FixedThreadPool simplifies submission of the large number of Links for scraping.
The answer for the first concurrency consideration is to allow the Link objects to report back on the status of their execution, and proceed to the next
level only once all of the current level links report their status as complete. In Java, the above pattern is implemented with the Callable and Future interfaces.
Callable should be implemented by classes whose instances are intended to be executed by a thread and should be able to return a result, essentially it is a beefed-up version of Runnable. Future interface represents the result of an asynchronous computation and has methods to check if the computation is complete, wait for completion and to retrieve the result of the computation. The above two fit into the Executor framework.
Implementing Link objects as interfaces of Callable instead of Runnable can provide functionality such as live crawl status and being able to cancel parsing through individual links, but is not essential in this example. Instead we can initiate the executor.shutdown(); method to complete all of the current tasks in the queue - that is crawl through the remaining links for the level, and after the above is done, reinitialize the executor to prepare it for the subsequent level of links.
For a simple implementation which is given, the cost of reinitializing the executor with 4 fixed threads is smaller than the cost of managing and checking all of the future objects.
create link objects and then submit them back to the Crawler for processing. However if multiple links are writing back their sublinks at about the same time, some of them will certainly overwrite the others without proper synchronization. The good news is that Java provides out of the box wrappers that can return a synchronized collection, and hence this issue can be handled automatically.
servlet searching the database index and passing the output to a jsp to display to the user). A simple index can contain a hash-table where the searched keywords are the keys and the links themselves are the values. For introduction to indexing methods, take a look at here.
For the ConcurrentCrawler implementation as it stands on July 2011, the indexing is not currently implemented. Instead, for now a graphic library, JGraphT is used to graph out the relationships between the links and present them to the user.