I. Introduction

Quality assurance (QA) principles for web archives

Definition of QA

Quality Assurance (hereafter “QA”) is the process of verifying and/or making appropriate interventions to improve the accuracy and integrity of an archived web instance.

Capture, behavior, and appearance

While many different schema are used to describe characteristics of the web known to affect quality in an archival environment, all most generally map to the model in which quality is evaluated and assured in terms of 1) capture, 2) behavior, and 3) appearance, wherein:

  • Capture describes the extent of content accessed and logged by a web crawler; the relative completeness of that content to the extent accessible from the live web

  • Behavior describes the accuracy of actions and responses facilitated by the archived web instance to its live analog; the degree to which the same functions (navigational, document retrieval, etc.) are supported in both environments 

  • Appearance describes the aesthetic similarity between the archived web instance and its live analog; the effectiveness of replicating the “look and feel” of a web instance from its live environment

QA priorities for NYARC collections

Visual culture and visual literacy

Towards its stewardship of a uniquely visual medium used for the dissemination of largely visual information, NYARC places especially high priority on behavior and appearance metrics in the assurance of web archival quality. Comprehension of web-based art historical resources requires both text and context; the textual information disseminated among exhibition sites, artists’ websites, and auction catalogs, to name a few, must be viewed in the style, sequence, and with the necessary illustrations intended by their creators.

Definition of content

For the above reason, the following QA procedures and guidelines define archivable web “content” more broadly than the strictly textual elements that chiefly concern other large web archiving operations. Content must in NYARC’s context be understood to also include embedded and dynamic imagery, responsive web applications and scripts, and externally hosted downloadable documents.

Problematic content types

Each of the above and still more content types can present significant barriers to web archiving, and especially to achieving a high quality rendition of a live site within a new archival environment. For an inventory of the specific issues known to NYARC that may negatively affect quality, how to recognize them, and what if any strategies exist to ameliorate them, refer to the known quality problems and improvement strategies.

Social media

Social media encompasses a suite of problematic content types, and most especially client-side scripts, that have motivated much capture-related web crawling development and subsequent improvement. Since the development of the browser-based crawling technology Umbra, Archive-It partners can more reliably capture and render embedded and user interface elements from social media services like Facebook, Twitter, Instagram, and others, though this process is not without its own new problems and mitigation strategies.

NYARC’s current collecting scope does not include content from the above kind of services, with the exception of The Frick Collection’s WEBSTA account. Accordingly, these procedures and guidelines include no specific directions for archiving problematic content from popular social media platforms. Issues related specifically to popular blog (ie. Wordpress) and video (ie. YouTube, Vimeo) hosting services, however, are addressed insofar as they have affected web presences within NYARC’s scope.

Significant properties

Ultimately, compromises in total archival quality must be struck, lest the QA process backlog permanently arrest archived material from reaching its end-users. Lacking explicit prior knowledge of these future end-users’ needs, determining “how good is good enough” is a fraught, subjective proposition. It is therefore incumbent upon web archiving staff to articulate and mutually agree upon significant properties--content, functionality, and presentation elements that define the purpose and/or value--of web instances within NYARC’s scope. Popular content types and presentation styles known to impede the QA process and concurrently insignificant to NYARC’s broadest collecting scope are documented among known quality problems and improvement strategies. The degree to which other problematic content types are necessary to the behavior and appearance of a web instance must be determined on at least a collection-specific and preferably a seed-specific basis.

Archive-It capabilities and constraints

The Archive-It software suite is designed to maximize capture completeness, which makes significantly positive downstream effects on a web instance’s behavior and appearance in an archival environment. NYARC, in turn, maintains a large measure of control over issues of quality encountered at the scoping and crawling phases. Issues and respective mitigation strategies specific to rendering and managing web archival content known to NYARC are documented herein, but most typically require the intervention of a contracted Archive-It engineer and/or developer.

Capture tool capabilities and limitations

Archive-It’s default tools for discovering web content, writing that content to WARC format, and logging/reporting the activity, are the Heritrix web crawler and the browser-based crawling compliment Umbra. Together, these tools crawl the web as would a search engine, and provide a modicum of the client-side generated information necessary to activate the responsive scripts on websites that enable further retrieval.

Capture completeness is, then, a reflection of the Heritrix crawler’s success at discovering, accessing, and navigating all filepaths necessary to constructing a high quality archival reflection of a given seed’s live web instance. Only those very rare and narrowest of crawls--the product of very deliberate scoping and/or limitations of live content extent--will completely capture a live web instance on a first pass. For the vast majority of cases, a QA technician must initiate patch crawls in order to complete the archival record copy.

Umbra transcends some traditional obstacles to web content accessibility by providing the information required of a human browser to scripts that otherwise stand between the Heritrix crawler and a filepath or URL. Umbra, it must however be noted, was developed to specifically enhance crawl and capture completeness among several common social media services. It is not a universal solution to access and capture problems instigated by client-server information transfer.
Opportunities and directions for mitigating issues of access in the scoping and crawling stages are provided herein, including most effective use of Umbra when it is known to be helpful.

Patch crawling

Patch crawling--the process of discovering and incorporating web content erroneously omitted from a seed’s initial crawl--is a distinguishing feature NYARC's QA process. It is the most effective strategy to mitigate issues of archival capture completeness that have downstream effects upon the ways in which the archived web instances also behave and appear. To date, Archive-It is the only web archiving technology suite known to provide significant automation of this process. It is a process, however, that still requires human selection and frequently tedious management to overcome its own limitations and inefficiencies. To maximize its efficiency, specific directions for conducting and guidelines for managing this process are provided.

Deduplication and continuous improvement

Deduplication is the process of automatically omitting live web content unchanged between archival crawl periods from successive captures, replacing it in these successive archival iterations with the pre-existing archived content. Another service offered solely at present by Archive-It, deduplication enables great efficiency in managing data budgets, but furthermore frees the QA technician from reviewing and/or enhancing aspects of archived web instances that have previously been assured for quality. Unless and until a web instance in NYARC’s scope introduces altogether new kinds of features or redesigns its entire site, comprehensive QA performed at the time of its initial capture can preclude the need for significant devotion to QA in the future.