05. Conducting Quality Assurance On Completed Crawls

Reviewing Crawl Reports

Once a crawl has completed, Archive-It has generated crawl reports, and the crawl has been indexed for playback in the Wayback Machine, it is possible to conduct quality assurance on the archived content. Quality assurance ensures that seeds are captured in Archive-It and rendered successfully in the Wayback Machine. A number of different factors can impact capture and display, including robots.txt exclusions, poorly-defined scoping rules, crawler traps, and pre-existing site anomalies (i.e. broken links, missing content). Quality assurance identifies issues with crawled content and provides opportunities for mitigating these issues through patch crawls and other methods. It is particularly useful to conduct quality assurance on test crawl results to avoid wasting data on ineffective crawls, but quality assurance should also be performed on regular scheduled recurring crawls, as seeds that were previously crawled effectively may present new issues.

To begin performing quality assurance on a completed crawl, navigate to that crawl's crawl report. The crawl overview, seed, host, and file type reports provide important information about the finished crawl. For extensive instructions on reading and interpreting these reports, visit the Archive-It Help Center. When examining the seed overview portion of a crawl report, it is important to note how much new data was captured and how many new documents were crawled for each site. Abnormally high or low data/document numbers may reflect potential problems meriting further correction. The seed overview section will also identify any seeds that had crawling issues, such as the site no longer existing at the given seed URL.

From the seed report, it is also important to view each seed individually by clicking the Wayback link on the right side of the page. From the Wayback calendar page, choose the most recent crawl date, which should from within the time period of the crawl that you are examining. When reviewing a URL in Wayback, briefly browse the page for obvious instances of content not being captured or rendered incorrectly. Test interactive features, database-driven content, videos, and other kinds of content that may present issues for the crawler. Do not feel obligated to click on every link, to thoroughly browse the site, or to note the presence of small cosmetic issues. The primary purpose of viewing a seed URL in Wayback during quality assurance is to ensure, at a very high level, that the site was captured and renders successfully.

After reviewing the seed overview, check the host report section of the crawl report. Within the host report, check to see if any hosts were blocked by robots.txt exclusions and if any hosts have a large number of URLs in the queued column. Abnormally large numbers of URLs in either column may indicate that an insufficient amount of require content was captured or that the web crawler may have encountered a crawler trap, greatly expanding the amount of data captured by the crawl and drawing the crawler's resources away from capturing essential content. The Archive-It Help Center provides specific information about how to identify, avoid, and mitigate the effects of crawler traps.

Enable QA Tool

In addition to reviewing crawl reports and manually reviewing crawled seeds, it is also possible to enhance quality assurance through a built-in function called Enable QA. Enable QA generates a list of URLs that were not captured during a crawl. These often include font files, JavaScript, and stylesheets from external/out of scope domains. Enable QA can help to identify meaningful content that is missing from a crawl.

While Enable QA can be useful, it can also slow down the process of conducting quality assurance and can generate a long list of URLs that may not be worth capturing (e.g., Google Analytics files). Enable QA should only be performed in cases where one of the following conditions apply:

If it seems appropriate to Enable QA for a given seed, visit the capture for that seed in Wayback and click Enable QA on the banner at the top of the page. Once Enable QA is activated, Archive-It will begin assembling missing URLs. Click View Missing Documents at the top of the page to be redirected to the list of missing URLs in Archive-It. Once there, it is possible to review the list of missing URLs and to select URLs to capture in a patch crawl, adding the missing content to Archive-It.

Possible Quality Assurance Actions

Quality assurance may result in a number of different actions being taken to improve future captures for a given seed. Several of the possible actions, along with scenarios for which they might be most appropriate, are detailed below.

Proxy Mode

Proxy mode is a quality assurance tool in Archive-It that enables users to view replays “offline” to ensure that content was captured completely. This tool can be useful when performing quality assurance because sometimes replayed captures will pull data from live sites, displaying content that may not have been truly harvested. Proxy mode is not necessary to use for every completed crawl, however, as it is time consuming. Proxy mode is best used while performing quality assurance on completed crawls that contain specific content of particular importance (e.g., distinct PDFs, streaming audio/video, and so on) rather than captures intended to take a general ‘snapshot’ of a website. Communication should be maintained between the nominating archivist and the web archivist conducting the crawl so that the correct content is captured.

To use proxy mode, first install the Proxy Mode Toggle add-on in the Firefox browser. It is possible to manually set up proxy mode in Firefox settings, but this method is not recommended as it is more time-consuming and difficult to use. Specific instructions for installing proxy mode both manually and with the extension can be found in the Archive-It Help Center documentation at the following URL: https://support.archive-it.org/hc/en-us/articles/208002206-Access-to-your-archives-in-Proxy-Mode

Once the Toggle add-on is installed, there should be a triangular Archive-It logo button in the address bar. Clicking this icon toggles proxy mode on and off and can only be used while viewing capture replays in the Wayback Machine. Proxy mode is easily discernible by the green address bar when the add-on is toggled on.

When the icon is toggled “on”, it is possible to begin viewing captures with proxy mode. First, navigate to a replay in Wayback Machine in Archive-It. Examine the capture as it is without proxy mode, noting any functioning dynamic content (e.g. video or audio files). Next, while in Wayback Machine, toggle the proxy mode icon on. The address bar will turn green and the actual data that was captured is displayed. Compare the two views and note any differences. The replay viewed in proxy mode is the true capture of data from the crawl. If the replay is not satisfactory, use the appropriate quality assurance methods described above to attempt a more complete capture.

Documenting Quality Assurance Actions

In order to maintain an administrative history of the actions taken for a given seed, document significant quality assurance actions taken in a Note associated with the relevant seed. To do so, navigate to the Metadata tab for a seed, click edit, and add a custom field with a type of Field Name of Note and a Value of the form "[your initials] [date]: [action]" (e.g., "DJP 2017-06-30: seed deactivated. Seed redirects to new URL, [URL]. A new seed has been created for [URL]." or "DJP 2017-06-30: adding scoping rules to block repeating directory crawler traps.")