a. Evaluating a Crawl Report

Once a test crawl has been completed (either due to all hosts being crawled or the time limit elapsing), the test crawl's report should be evaluated to determine if the seed's scope is appropriate or will need to be modified. Seed scopes may need to be modified if the crawl report reveals that a large amount of unnecessary content would be captured or that there would be a lack of content that is vital to faithfully capturing the seed due to crawler time-out, robots.txt exclusions, and so on.

To view a crawl report, select Crawls from the Archive-It navigation menu and visit Current Reports for a list of recent crawls. Note that this list contains saved crawls as well as test and deleted test crawls. Click on the six-digit crawl ID next to the relevant crawl in the Crawl Reports tab. The hyperlinked title of the crawl redirects to the collection home page and not the crawl report itself.

Crawls that have completed within the last 24 to 48 hours may not have associated crawl reports or content indexed in the Wayback Machine. If no option to view crawl reports exists, it will be necessary to wait until the reports have been generated.

The first tab of the crawl report, Crawl Overview, provides a high-level overview of the crawl, including the total amount of pages and data that would be crawled if the seed scope(s) were left as-is.

While this data is helpful for getting a quick, high-level understanding of a crawl, most of the relevant information for determining if crawl scopes need to be modified is found in several of the other, more detailed, tabs included in the crawl report.

The Hosts tab includes information about which specific hosts were captured during the crawl, including how much data was captured from a host, how many URLs from each host were captured within the crawl's time limit, how many were queued (still waiting to be captured when the crawl reached its time limit), and how many URLs were excluded from the crawl because they were blocked by a host's robots.txt file or were out of the crawl scope.

Clicking on the number of URLs under each column will, depending on the number of URLs, present you with either a list of URLs in the browser or will prompt you to download a txt document containing the list of URLs to be evaluated using an external tool. While it would not be appropriate to check every single URL in evaluating the scope of the crawl, a quick browse through the captured and queued URLs can often reveal issues in the crawl scope, such as the existence of crawler traps.

The repeating directories exhibited in the above image are evidence of a crawler trap in which the crawler gets caught crawling repeating directories that don't actually exist.

In addition to the lists of URLs being useful in identifying crawler traps and other sorts of undesirable content being captured, the lists of queued, robots.txt blocked, and out of scope URLs can be informative for determining if there were contents at a host that ought to have been captured but were not due to the way in which the seed URL was formed, the scope of the crawl, the hosts robots.txt provisions, and so on.

If the test crawl reveals anything unsatisfactory about the current crawl scope, the archivist will need to modify the crawl scope for one or more hosts.

The following tab, "Seed Status," provides a listing of seeds that were included in the crawl and indicates whether those seeds were crawled successfully, if they were redirected, or if they were not crawled due to robots.txt exclusions or HTTP errors.

If the seed status is Redirected or has an HTTP Error, diagnose the issue and modify the seed URL accordingly. If the seed status is Blocked by Robots.txt, consult with the Archivist for Metadata and Digital Projects about next steps.

Page updated

Report abuse