Blue Book ai Restoration Process

Here are the technical details of how AI was used to correct the optical character recognition (OCR) errors in the scanned Project Blue Book UFO / UAP files. You can also download our code repository below.

What causes OCR errors?

Why does that matter?

What about "reading order"?

What is post-OCR correction?

How does our AI pipeline work?

How much computing power did this take?

Can I see your code?

How accurate is it?

What causes OCR errors?

Optical character recognition is the term for how computers attempted to recognize the text in a scanned document image. This is difficult for documents with:

A mixture of fonts especially handwritten and typed
Physically degraded pages (e.g. fading, creased, torn)
Poor scanner settings (e.g. skew, exposure, etc.)

Example of a mixture between handwritten and typed fonts in a Project Blue Book document that make it difficult to OCR.

Handwritten & Typed Fonts

A poorly scanned nearly illegible Project Blue Book page.

Bad scanner settings

Headers and Footers

Why does that matter?

When working with collections of thousands of documents, search capability is vital. Yet a single character error is enough to cause a keyword search to fail. For example, searching for “Annapolis” will not find “Ann polis” or “Annapo1js”. We estimate, there are over 9 million character errors in the National Archives OCR text for Project Blue Book.

What about "reading order"?

Older OCR software also struggles with any text that defies the standard "left to right, row by row" reading order that is assumed in English language text. For multicolumn newspaper articles, it creates difficulty when searching for multi-word phrases when they span multiple lines in a column. This also renders forms, a potential high quality source of structured information which shows up throughout the Project Blue Book files, essentially useless as the form field titles no not appear adjacent to the form data in the scanned text. You can see a few examples of these issues below.

A multi-column newspaper article from Project Blue Book that defies strict left-to-right reading order common among English language text.

Multi-column articles often get scanned row by row, jumbling the words across columns.

A map in Project Blue Book whose OCR cannot be effectively corrected by LLMs.

Maps and diagrams pose unique challenges.

A form using boxed enumerated fields that cannot be read strictly left-to-right, making Post OCR correction difficult.

Forms are potentially useful but often the field titles and field data are not grouped correctly because they cannot be read strictly left-to-right.

What is post-OCR correction?

Our approach only looks at the OCR text - not the document image - and makes corrections. It acts like a very smart spell checker. Traditional approaches might just use a set of substitutions for common mistakes (e.g. $ --> S) but our AI approach uses the probability of letters appearing in a group. For example "ink" is much more likely than "jnk" even though the letters "i" and "j" may look similar. An alternative approach would be to attempt to re-scan documents or re-OCR the images but that approach is very time consuming and costly.

An OCR text from the National Archives as part of Project Blue Book that contains many errors.

Input: the error-prone OCR text from the National Archives

An AI powered OCR correction of the Project Blue Book document.

Output: AI's best guess of how the text should read

How does our AI pipeline work?

We have Python scripts to:

download the extracted OCR text for all the documents in particular collection from the National Archives
send each file's OCR text to Open AI via their API along with a custom systems prompt that has been tuned for post-OCR
post-process results (trim white space, estimate accuracy, etc.)

How much computing power did this take?

Of the 112,000+ pages digitally available 57,434 have extracted OCR text. Each page was downloaded from the National Archives and then, along with a custom prompt, processed using Open AI's gpt-4o-mini model. This consumed 42,424,927 input tokens (23,662,808 for the prompts plus 18,762,119 for transmitting the original OCR text) and returned the 17,289,505 output tokens which make up our corrected version of Project Blue Book. All told, it required 57,000+ API calls and ~80 hours of compute time ($$$).

Can I see your code?

All relevant code can be found at: https://github.com/johnesposito17/Project-BlueBook-AI-OCR-Correction/tree/main/OCR%20Evaluation

You will need your own API key(s) to run the scripts on our github. You can request a National Archives API key by emailing catalog_api@nara.gov. To run one of the OCR Correction scripts simply create a developer API key for the LLM your choice (i.e. Gemini, GPT, Deepseek, or Llama).

How accurate is it?

The total number of characters is in the original set is 62,042,527 (?) and the total AI edits are 9,824,970 (a 15.8% correction rate). However its impossible to know how many of those corrections are valid and how many were missed.

Based on our small sample of files with human transcription, we estimate that we reduce the error rate substantially.

Page updated

Google Sites

Report abuse