The life of a herbarium sheet during the imaging process at the ALA
In More Detail...
2012 update: The details below have slightly evolved to maximize efficiency, but the basics remain. This project is completed, and scanned about 185,000 specimens. Imaging
continues as part of normal data entry on a daily basis. Along with viewable JPG, OCR text is now automatically extracted from images. With 4
part-time students, a long-term average of about 1000 specimens per day can be maintained, with 2000/day being possible where folders contain lots of specimens. A more complicated imaging project (paleontology, which involves
multiple images of specimens) is well underway, having evolved from this project.
Images and specimen data are now stored on a 5-petabyte redundant disc array at TACC, and backed up to tape on two tectonic plates.
First, a barcode is applied to a folder containing one or more herbarium sheets. Then, through the imaging interface, the Folder Barcode is scanned into Arctos and an Identification is entered. (Herbarium folders are organized by taxonomy.) A barcode is applied to a herbarium sheet in the folder, then scanned into the appication using a laser scanner, which acts as an input device. A unique identifier (generally the ALA Accession Number) is entered for the sheet. A button on the form saves data to a dedicated table within Arctos after all sheets have been scanned.
Data Checking and Loading
Each evening, a script runs to confirm taxonomy (from the Folder Identification), check for potential duplicates in Arctos, and verify that pre-determined data quality checks have been met. Those records that do not pass the tests are flagged, and an email is automatically sent to the imaging Google group. Arctos is then queried to find possible preexisting records (based on the identifier entered at initial entry). Those records' container locations are updated.
The records that do pass all checks and are not found in Arctos are moved into the Arctos Bulkloader and subsequently processed into Arctos. Arctos uses a hierarchical container model to track physical object locations. The specimens themselves are collection objects, herbarium sheets contain specimens, and folders contain sheets. Folder containers are labeled according to the Folder Identification created at the initial data entry step.
Sheets which have a barcode affixed are ready for photography . Sheets may be photographed at any point after the initial entry step; it is not necessary for data to enter Arctos first.
A Canon Mark II EOS-1 DS, several external hard drives, and a laser barcode reader are attached to an iMac. The camera is mounted in a fixed position over a fenced bed, which also contains a scale and a color standard, and triggered through the computer. Images are automatically transferred to the computer and saved to a folder upon which an ActionScript is enabled. Upon a file entering the folder, users are prompted for a file name. This name is created by scanning the barcode earlier attached to the herbarium sheet. Another ActionScript then moves the image to a temporary directory. A shell script runs every 30 minutes to move images from the temporary directory to two other physical hard drives.
Images can be taken at a rate of approximately 1500 per day depending on folder sizes and available staff.
from the camera are in Canon's proprietary .cr2 format. Adobe DNG
Converter is used via a shell script to transform the RAW images into DNG, a lossless open image standard.
A script runs constantly on the Imager to transfer converted DNG files to the Ranger supercomputer at Texas Advanced Computing Center (TACC) via SCP. SSH/HPN is utilized to maximize throughput capacity. Transferred images are logged and duplicates are not transferred.
Once per day, yet another script compares local and transferred files by name and filesize. Any images that did not transfer correctly are removed for another attempt. Once local and remote files match by name, filesize, and count, the images are moved to a "confirmed" directory.
Meanwhile, at TACC....
Images in "confirmed" directories are periodically purged to the iRODS system. iRODS generates checksums and transfers images to Ranch at TACC and the San Diego Supercomputing Center. This is automated and handled by the TACC staff.
High-resolution and thumbnail JPGs are created for every DNG on Ranger using ImageMagick. These will be transferred to a new computer at TACC beginning approximately 1 November 2008.
Linking Images to Specimens
Arctos queries TACC's iRODS directory structure nightly. Any new images are located and processed into Media. JPGs, when available, are also linked to TACC, and a Media Relationship is created between the high resolution JPG and the original DNG. Thumbnail JPGs are utilized as a preview for both the DNG and the JPG.