Epstein Files Digitization Process

Here are the technical details of how we digitized the Epstein Files.

What is OCR?

How did you digitize them?

WHat does Searchable Mean?

How much computing power did this take?

How accurate is it?

What is OCR?

Optical character recognition (OCR) is the term for how computers attempt to recognize the text in a scanned document image. In their raw form, images are not searchable and they are very large files compared to truly digitized text.

Raw scans are just images - large files with no embedded text data

OCR detects characters and allows us to extract a text file.

How did you digitize them?

We wrote computer code to perform optical character recognition (OCR) with the help of cloud based APIs.

WHat does Searchable Mean?

Digitization alone is not enough. We needed away to search for keywords across multiple files, so we imported the 35,000+ text segments into a single database.

Searching for names or places across thousands of files requires a database.

How much computing power did this take?

On average it takes 8 seconds to extract the text from a single page image. Processing the 35,000 images consumed approximately 80 hours of computing time ($$$).

How accurate is it?

No OCR is 100% accurate. Poor scan quality, ripped or faded pages, and redactions can all cause ambiguity in the recognized text. However as compared to other historical documents, the scan quality of these images is quite good, so we estimate the accuracy exceeds 99%. We encourage investigators to use the search feature as a starting point to locate items of interest; but to always examine the original scans before drawing any conclusions.

Page updated

Google Sites

Report abuse