Multithreaded Webcrawler

This page is dedicated to the use of concurrency in more advanced applications, such as this multithreaded webcrawler. It is aimed for intermediate to advanced level Java developers, and the approach for this article is different from the rest. Design decisions and implementation for proper concurrency are discussed, instead of step-by-step build and clarifications.

The original motivation for this article was the lack of complete real world applications in most concurrency books. Java: Concurrency in Practice is one excellent source on the matter but unfortunately most of the examples demonstrate a single concurrency topic and the reader has to rely on his own imagination to bring them together into some working application.

Build path tip

If the attached Eclipse project gives errors when you try to run it - add the jgraph.jar and jgrapht-jdk.1.6.jar to the project's build path. The above jars are used for graphing the links, and can be found in the /lib directory. You can find the ConcurrentCrawler project attached at the end of the article.


Why a Web Crawler?

World wide web contains a virtually unlimited amount of pages, so to have a shot at indexing even a small part of it, a web crawler (also called a web spider) needs not only to find and retrieve links as quickly as possible, but also search in multiple directions simultaneously. The task is complicated by the fact that most starting links are cyclical in nature - that is, a link deep inside the tree eventually brings the crawler to a level higher up the tree and is thus not useful in discovering and indexing new pages.

The solution is to have the crawler process the links in a breadth-first fashion (that is, process all sublinks of a given link, before going onto the deeper level). This will yield the largest spread of new links and allow for their faster discovery and indexing. Since parsing web pages for followable links is an identical process for any web page involved, the breadth first algorithm can be implemented in a concurrent fashion - by processing several links at once by a fixed number of worker threads.

Design Considerations

The design of a multithreaded crawler is relatively simple - the only two objects you should be really concerned with is how a Link can be represented, and the Crawler itself. A Link object encapsulates a URL, and must be able to parse it to get the URL's sublinks. A Crawler object coordinates the parsing activity among multiple links, enforcing proper concurrency and ultimately indexing them (or for now, graphically displaying their tree to the user).

The following UML shows the heart of the crawler:

The concurrent breadth-first algorithm itself is short - for each level in the n levels specified, Crawler object iterates through its
allCurrentLevelLinks list and submits all of the links to its executor, as such:

Submitting Link objects to the Executor

for (int levels = 0; levels < theCrawler.levelsToCrawl; levels++) {
               
                for (Link currentLink : theCrawler.allCurrentLevelLinks) {
                    theCrawler.executor.execute(currentLink);
                }

...

This invokes the run() method of the submitted LinkObject, which allows the link to parse its page.

Runnable Link

    @Override
    /** This method is run by the Crawler's executor, enabling concurrent
     *  processing.
     *
     */
    public void run() {       
        try {
            this.scrapeYourself();
            this.addSublinksToCrawler();
        } catch (IOException e) {   

There are two considerations for concurrency here:
1. All of the current level links need to complete their full tree of sublinks before the crawler can proceed to the next level.
2. When each link writes back its sublinks to the crawler to build the nodes of the subsequent level, multiple links should not overwrite each other's sublinks. If the Link  objects do not write their sublinks back synchronously, and two or more are written back at the same time, a race condition will result where only the last write to the list is saved.

Concurrency Policy - First Consideration

The workhorse of the Link class are the scrapeYourself() and addSublinksToCrawler() methods which parse the URL this Link object points to and add all of the links found to a list of this Link's sublinks. The process itself does not have a constant time, as the pages themselves differ in size, hence in a breadth-first search the Link objects on the same level can take a different time to complete. The algorithm and the concurrency policy then is to allow a fixed number of threads to go through the list of sublinks on level n, proceeding to level n+1 only when all of the sublinks for that level are formed.

The Executor class coupled with a FixedThreadPool available in the standard java.util.concurrent package is perfect for the task.
The interface was introduced in Java 1.5 as a way to decouple task submission from task execution, and since it would be unwieldy to
manually create and allocate individual Thread objects to process the Links, the ExecutorService coupled with a FixedThreadPool simplifies submission of the large number of Links for scraping.

FixedThreadPool vs CachedThreadPool

The number of links increases exponentially with every subsequent level in the tree, it might seems that allocating a fixed number of threads to process the Links on a level is not a good practice for scalability, as it takes longer for the existing threads to parse through the level. However, allocating additional threads to process more links in parallel is not the right approach. In a problem such as crawling world wide web, where the number of links is unlimited, if more and more threads are created, its only a (very short) matter of time until the machine runs out of memory and the threads' stacks are blown off.

The answer for the first concurrency consideration is to allow the Link objects to report back on the status of their execution, and proceed to the next
level only once all of the current level links report their status as complete. In Java, the above pattern is implemented with the Callable and Future interfaces.
Callable should be implemented by classes whose instances are intended to be executed by a thread and should be able to return a result, essentially it is a beefed-up version of Runnable. Future interface represents the result of an asynchronous computation and has methods to check if the computation is complete, wait for completion and to retrieve the result of the computation. The above two fit into the Executor framework.

Implementing Link objects as interfaces of Callable instead of Runnable can provide functionality such as live crawl status and being able to cancel parsing through individual links, but is not essential in this example. Instead we can initiate the executor.shutdown(); method to complete all of the current tasks in the queue - that is crawl through the remaining links for the level, and after the above is done, reinitialize the executor to prepare it for the subsequent level of links.

Controlling the breadth-first search with the Executor interface

            for (int levels = 0; levels < theCrawler.levelsToCrawl; levels++) {

                /* Process each link on the current level using the executor */
                for (Link currentLink : theCrawler.allCurrentLevelLinks) {
                                        theCrawler.executor.execute(currentLink);
                }

    theCrawler.executor.shutdown();

                while (!theCrawler.executor.isTerminated()) {
                    // just wait till the threads finish until proceeding in the
                    // main method.
                }

                /*
                 * Processing has finished. Make note of what links were already
                 * traversed, then clear the list of allCurrentLevelLinks,and
                 * add the next level links to it. Start again for the next
                 * level.
                 */
                theCrawler.allTraversedLinks
                        .addAll(theCrawler.allCurrentLevelLinks);

                theCrawler.allCurrentLevelLinks.clear();
                theCrawler.allCurrentLevelLinks
                        .addAll(theCrawler.allTempCurrentLevelLinks);
               
                               

                /*
                 * Once its done, restart it. MAY need to use Callable and
                 * Futures instead of Runnable,and check until all of those have
                 * completes before proceeding the next level. A better
                 * solution but only if additional functionality needs to be added.
                 */
                if (theCrawler.executor.isTerminated()) {
                    theCrawler.reinitializeExecutor();
                }

...

    /**
     * Create a new executor to process next level's links.
     *
     */
    public void reinitializeExecutor() {
        this.executor = Executors.newFixedThreadPool(this.threadNumber);
    }


For a simple implementation which is given, the cost of reinitializing the executor with 4 fixed threads is smaller than the cost of managing and checking all of the future objects.

Concurrency Policy - Second Consideration

Once a link object scrapes its URL and saves the sublinks, the sublinks need to be prepared for further processing. The easiest way to do that is to
create link objects and then submit them back to the Crawler for processing. However if multiple links are writing back their sublinks at about the same time, some of them will certainly overwrite the others without proper synchronization. The good news is that Java provides out of the box wrappers that can return a synchronized collection, and hence this issue can be handled automatically.

Creating a thread-safe list by using a static wrapper

    public List<Link> allCurrentLevelLinks = Collections
            .synchronizedList(new ArrayList<Link>());

Output Methods

A full fledged web crawler would index the URLs it scraped and save them in a database for later retrieval by a search engine (which can be a
servlet searching the database index and passing the output to a jsp to display to the user). A simple index can contain a hash-table where the searched keywords are the keys and the links themselves are the values. For introduction to indexing methods, take a look at here.

For the ConcurrentCrawler implementation as it stands on July 2011, the indexing is not currently implemented. Instead, for now a graphic library, JGraphT is used to graph out the relationships between the links and present them to the user.



A graph of link relationships.

Next Steps

As of right now the project is in first steps of infancy, in order for it to be a full-fledged web crawler, indexing, a GUI interface and J2EE frontend for the search itself need to be implemented. All of the above are planned for subsequent versions and will be implemented in stages.






Your Comments




Č
ċ
ď
ConcurrentCrawlerJuly14V1.zip
(1877k)
Evgheni C,
Jul 14, 2011, 9:38 PM
Comments