02. Configuration of Crawler Settings

[Please note: All Archive-It screenshots and instructions are intended to be used with Archive-It 5.0. If you are logged into Archive-It and it looks very different than the screenshots included herein, look for a "Switch to Archive-It 5.0" button near the bottom of the Archive-It partner home page]

Once the target of the crawl has been identified and defined, the archivist adds the seed URL(s) to the relevant Archive-It collection.

The Bentley Historical Library currently has 9 collections in Archive-It:

Once the relevant collection for the target of a crawl has been identified, select the collection from the Archive-It Home page.

Once you have reached the Overview page for the appropriate collection, select the Seeds tab. Seed URLs can then be added by clicking the Add Seeds button, located at the top right of the seed list.

Clicking the Add Seeds button produces a pop up box with fields for configuring the seed URL, visibility, frequency, and seed type. 

Adding the seed URL(s): Multiple seeds can be added one URL per line to the text box, although this is only appropriate when the seed URLs belong to the same collection and have the same visibility, crawl frequency, and seed type. The way in which the seed URL is formatted will determine the scope of the crawl (how much of the site will be captured). The archivist may elect to capture the entire host (i.e. http://bentley.umich.edu/), a specific directory (i.e. http://bentley.umich.edu/exhibits/), or a single page (i.e. a letter written by Abbie Hoffman to John Sinclair, featured at http://bentley.umich.edu/exhibits/sinclair/ahletter.php). To thoroughly capture target websites, the Bentley Historical Library generally captures the entire host unless the target is a single directory located on a more extensive host or a specific page.  In order to capture the entire host, seed URLs should end with a /. It might be necessary to manually append forward slashes to URLs that do not end with them. Archive-It generates red warning icons when detecting HTTP status errors (i.e. 404, 500) and yellow warning icons for poorly-formed seeds (i.e. mistyped web addresses). The Archive-It help documentation includes further information about crawl scope and seed URLs.

Visibility: The visibility setting determines whether a given seed will be Public or Private. In almost all cases, the visibility should be set to Public.

Frequency: The frequency setting establishes how frequently a seed URL will be crawled. Most seeds are crawled annually, though seeds URLs for especially high priority/frequently changing sites may be set to a semiannual, quarterly, or monthly crawl frequency.

Seed Type: The seed type setting can be set to Standard, Standard Plus External Links (Standard+), One Page, or One Page Plus External Links (One Page+). Most seeds should be set to the Standard seed type. A Standard crawl will capture the seed URL, associated styling documents, embedded content (such as images and videos), and will do similarly for all other pages on the same domain or in the same directory as the seed URL (dependent upon the seed URL that was formed per the instructions above). A One Page crawl will crawl only the given page, styling documents, and embedded content associated with that page, not other pages located on the same host or in the same directory. The Standard+ and One Page+ options will function similarly to their respective seed types, with the addition of crawling pages one hop off from the seed (e.g., linked, not embedded, pages). Standard+ and One Page+ should be used only for sites that feature as aggregators of content located elsewhere, or sites that link to especially important content, as Standard+ and One Page+ seed types can quickly eat up document and data budgets.

As the foregoing discussion reveals, the accurate and effective configuration of crawl settings must be based on the archivist’s appraisal of content and understanding of the target site’s structure. The failure to consider these factors may lead to a capture that, on the one hand, is narrowly circumscribed and incomplete or, on the other, is unnecessarily broad and filled with superfluous information.

After the seed URL(s) have been added and the capture frequency set, the archivist should navigate to the individual pages to add metadata to contextualize the content.