Format characterization is the programmatic assessment of files (bitstreams) using applications that identify data signatures common to a given file specification. Format characterization software identifies known patterns in the binary data of a file that indicate what kind of file it is. Characterization tools may also extract technical metadata embedded within the file. This allows the identification of a particular kind of file even if they have a missing or incorrect file extension, or an extension that is used for more than one kind of file. Often characterization tools include validation of a given file type against the published specification and may also calculate fixity information such as checksums. Many format characterization tools reference established registries of file specification information such as PRONOM which provide information about creating applications, format history, and technical requirements for the use of the file.
Due to the number and complexity of different file format specifications, there are many different specialized characterization tools, although there is a great degree of overlap in what kinds of files they can assess and what kinds of information they produce. There is also no one standardized output format that has emerged. To mitigate these difficulties, format specification tools are often embedded in larger applications that perform digital preservation actions or are contained within a wrapper application that runs multiple tools and standardizes their output. File Information Tool Set (FITS) is one example of a wrapper application.
Format characterization is a fundamental aspect of digital preservation because it enables the planning of preservation actions to make sure that a file is still useable, even if the applications that were originally used to open the file are no longer available.
These tools have all been used or evaluated at the UGA Libraries. Many of these tools parse multiple types of files. See linked documentation for a complete list of features.
Apache Tika1https://tika.apache.org/
Tika is a file identification and parsing platform written in Java. It can identify over 1000 files and is able to extract metadata and OCR-created text from a subset of those.
DROID (Digital Record Object Identification) is a file identification tool developed by The National Archives of the United Kingdom. There is a GUI version available if you wish to run it as a stand-alone application. It is also bundled into many digital preservation tools, such as FITS.
Exiftool1 https://exiftool.org/
ExifTool is capable of reading and editing metadata embedded in image files. It can be used to extract technical metadata such as resolution, time stamp, location, color profile, and bit depth and camera information recorded when an image is captured.
Jhove1 https://github.com/openpreserve/jhove
Identifies and validates many types of files
Jpylyzer https://jpylyzer.openpreservation.org/
Identifies and validates JPEG2000 (JP2) files
MediaInfo1 https://mediaarea.net/en/MediaInfo
Extracts and displays technical metadata for audio and video files
National Library of New Zealand Metadata Extractor1 https://github.com/DIA-NZ/Metadata-Extraction-Tool/tree/master/metadata-extractor
Extracts metadata from BMP, GIF,TIF and JPEG image formats as well as many types of word processing documents, WAV,MP3,BFW, and FLAC A/V files. It can also parse HTML and XML documents as well as extract metadata from ARC (web archive) files. Additional documentation can be found at https://meta-extractor.sourceforge.net/
VeraPDF (PDF/A validator) https://github.com/verapdf
Additional documentation: https://verapdf.org/ This tool identifies and extracts metadata from PDF files and validates for adherence to the PDF/A specification
[1] Included in File Information Tool Set (FITS)
“A Question of Character: How do we automatically recharacterize data at cloud scales?” by Jack O’Sullivan et. al., iPRES 2023. https://drive.google.com/file/d/1TLENSLRzE3w7WxEO5sGLzHz91tz1TBoj/view
How format characterization changes over time and why recharacterization is necessary.
“Creating a Holdings Format Profile and Format Risk and Digital Preservation Prioritization Matrix at the National Archives and Records Administration” by Leslie Johnston, iPRES, 2018. https://osf.io/zd7vx
Using format characterization information to evaluate the risk of NARA’s holdings.
“File Formats – Characterization and Validation” by Lavërim Shala and Ahmet Shala, IFAC Conference, 2016. https://www.sciencedirect.com/science/article/pii/S2405896316324880
Overview of how format identification, characterization and validation fit into digital preservation, along with a summary of available tools.
“Generating File Format Identification and Checksums with DROID (Module#ERCM001) by Brandon Hirsch, SAA Congressional Papers Section Electronic Records Committee, 2016. https://cprerc.files.wordpress.com/2016/07/ercm001_generating-file-format-identification-and-checksums-with-droid.pdf
Example format characterization workflow.
Community Owned digital Preservation Tool Registry (COPTR)
Library of Congress Sustainability of Digital Formats site
Library of Congress Recommended Formats Statement: Summary of Digital Format Preferences
National Archives Digital Preservation: format risk