01. Identification of Crawl Target

The appraisal and selection of content to be included in the Bentley Historical Library Web Archives are guided by the Library’s Web Archives Collection Development Policy (http://hdl.handle.net/2027.42/94163), which is reviewed on a regular, ongoing basis.

The Bentley Historical Library will contact individuals (including U-M faculty and student groups), organizations, and voluntary associations to inform them of any relevant web archiving activity and their right to opt out of the crawls or have content suppressed from public view.

In capturing websites, the Bentley Historical Library employs Archive-It's implementation of the Heritrix web crawler (also known as a spider or robot) to copy and preserve websites. A web crawler is an application that starts at a specified URL and then methodically follows hyperlinks to copy html pages and associated files (images, audio files, style sheets, etc.) as well as the website’s underlying structure.

The initiation of a web capture requires the archivist to specify one or more “seed” URLs that will be used by the web crawler to preserve the target site. Accurate and thorough website preservation requires the archivist to become familiar with a site’s content and architecture in order to define the exact nature of the target. This attention to detail is important because content may be hosted from multiple domains. For example, the University of Michigan’s Horace H. Rackham School of Graduate Studies hosts the majority of its content at http://www.rackham.umich.edu/ but maintains information on academic programs at https://secure.rackham.umich.edu/academic_information/programs/. To completely capture the Rackham School’s online presence, archivists needed to identify both domains as seed URLs.

At the same time, multiple domains present on a site may merit preservation as separate websites. For example, the University of Michigan’s Office of the Vice President of Research (http://research.umich.edu/) maintains a large body of information related to research administration (http://www.drda.umich.edu/) and human research compliance (http://www.ohrcr.umich.edu/). Although these latter sites could be included as secondary seeds for the Vice President of Research’s site, their scope and informational value led archivists to preserve them separately.

Page updated

Report abuse