Here are the technical details of how we digitized the Epstein Files.
Optical character recognition (OCR) is the term for how computers attempt to recognize the text in a scanned document image. In their raw form, images are not searchable and they are very large files compared to truly digitized text.
We wrote computer code to perform optical character recognition (OCR) with the help of cloud based APIs.
Digitization alone is not enough. We needed away to search for keywords across multiple files, so we imported the 35,000+ text segments into a single database.
On average it takes 8 seconds to extract the text from a single page image. Processing the 35,000 images consumed approximately 80 hours of computing time ($$$).
No OCR is 100% accurate. Poor scan quality, ripped or faded pages, and redactions can all cause ambiguity in the recognized text. However as compared to other historical documents, the scan quality of these images is quite good, so we estimate the accuracy exceeds 99%. We encourage investigators to use the search feature as a starting point to locate items of interest; but to always examine the original scans before drawing any conclusions.