Here are the technical details of how AI was used to correct the optical character recognition (OCR) errors in the scanned Project Blue Book UFO / UAP files. You can also download our code repository below.
Optical character recognition is the term for how computers attempted to recognize the text in a scanned document image. This is difficult for documents with:
A mixture of fonts especially handwritten and typed
Physically degraded pages (e.g. fading, creased, torn)
Poor scanner settings (e.g. skew, exposure, etc.)
When working with collections of thousands of documents, search capability is vital. Yet a single character error is enough to cause a keyword search to fail. For example, searching for “Annapolis” will not find “Ann polis” or “Annapo1js”. We estimate, there are over 9 million character errors in the National Archives OCR text for Project Blue Book.
Older OCR software also struggles with any text that defies the standard "left to right, row by row" reading order that is assumed in English language text. For multicolumn newspaper articles, it creates difficulty when searching for multi-word phrases when they span multiple lines in a column. This also renders forms, a potential high quality source of structured information which shows up throughout the Project Blue Book files, essentially useless as the form field titles no not appear adjacent to the form data in the scanned text. You can see a few examples of these issues below.
Our approach only looks at the OCR text - not the document image - and makes corrections. It acts like a very smart spell checker. Traditional approaches might just use a set of substitutions for common mistakes (e.g. $ --> S) but our AI approach uses the probability of letters appearing in a group. For example "ink" is much more likely than "jnk" even though the letters "i" and "j" may look similar. An alternative approach would be to attempt to re-scan documents or re-OCR the images but that approach is very time consuming and costly.
We have Python scripts to:
download the extracted OCR text for all the documents in particular collection from the National Archives
send each file's OCR text to Open AI via their API along with a custom systems prompt that has been tuned for post-OCR
post-process results (trim white space, estimate accuracy, etc.)
Of the 112,000+ pages digitally available 57,434 have extracted OCR text. Each page was downloaded from the National Archives and then, along with a custom prompt, processed using Open AI's gpt-4o-mini model. This consumed 42,424,927 input tokens (23,662,808 for the prompts plus 18,762,119 for transmitting the original OCR text) and returned the 17,289,505 output tokens which make up our corrected version of Project Blue Book. All told, it required 57,000+ API calls and ~80 hours of compute time ($$$).
All relevant code can be found at: https://github.com/johnesposito17/Project-BlueBook-AI-OCR-Correction/tree/main/OCR%20Evaluation
You will need your own API key(s) to run the scripts on our github. You can request a National Archives API key by emailing catalog_api@nara.gov. To run one of the OCR Correction scripts simply create a developer API key for the LLM your choice (i.e. Gemini, GPT, Deepseek, or Llama).
The total number of characters is in the original set is 62,042,527 (?) and the total AI edits are 9,824,970 (a 15.8% correction rate). However its impossible to know how many of those corrections are valid and how many were missed.
Based on our small sample of files with human transcription, we estimate that we reduce the error rate substantially.