MLK Assasination Records Restoration Process

Here are the technical details of how AI was used to correct the optical character recognition (OCR) errors in the scanned Martin Luther King Jr. files.

What causes OCR errors?

What is post-OCR correction?

Why does that matter?

How does our AI pipeline work?

How much computing power did this take?

How accurate is it?

Can I see your code?

What causes OCR errors?

Optical character recognition is the term for how computers attempted to recognize the text in a scanned document image. This is difficult for documents with:

A mixture of fonts especially handwritten and typed
Physically degraded pages (e.g. fading, creased, torn)
Poor scanner settings (e.g. skew, exposure, etc.)

Handwritten & Typed Fonts

Poor exposure settings

Hand written notes

What is post-OCR correction?

Our approach only looks at the OCR text - not the document image - and makes corrections. It acts like a very smart spell checker. Traditional approaches might just use a set of substitutions for common mistakes (e.g. $ --> S) but our AI approach uses the probability of letters appearing in a group. For example "ink" is much more likely than "jnk" even though the letters "i" and "j" may look similar. An alternative approach would be to attempt to re-scan documents or re-OCR the images but that approach is very time consuming and costly. This example shows a memo about James Earl Ray's weapon (a Remington rifle) before and after post-OCR correction.

Input: the error-prone OCR text from the National Archives

Output: AI's best guess of how the text should read

Why does that matter?

The document collection contains 243,496 pages, grouped across 6,301 PDF files. When working with collections of thousands of documents, search capability is vital. While there is a search feature on the National Archives site, the search functionality is limited to keywords in the PDF file names. For example, a search for the word "Memphis" only returns a single result: 00459961_lawyer_arrives_in_memphis_104-10129-10400.pdf

Even if the National Archives supported search capability within the text of the document a single character error is enough to cause a keyword search to fail. For example, searching for “Memphis” will not find “Me mphis” or “Memph1s”. We estimate, there are millions of character errors in the National Archives OCR text for MLK Assassination Records.

By storing the full AI corrected text in a single database a search for "Memphis" now turns up 2,330 hits.

How does our AI pipeline work?

Automatically download all 6,200+ PDFs from the National Archives
Extract each file's OCR text from the PDFs
Send to Open AI 's gpt-4o-mini model via their API along with a custom system prompt that has been tuned for post-OCR correction
Post-process the results (trim white space, fix hyphenation patterns, etc.)

How much computing power did this take?

Processing the 243,496 pages consumed 66,111,336 input tokens (the prompts plus the original OCR text) and returned the 14,986,540 output tokens which make up our corrected version of the MLK Assassination Records. All told, it required ~70 hours of compute time ($$$).

How accurate is it?

The total number of characters is in the original set is 62,042,527 (?) and the total AI edits are 9,824,970 (a 15.8% correction rate). However its impossible to know how many of those corrections are valid and how many were missed.

Based on our small sample of files with human transcription, we estimate that we reduce the error rate substantially.

Can I see your code?

All relevant code can be found at: https://github.com/johnesposito17/Project-BlueBook-AI-OCR-Correction.git

You will need your own API key(s) to run the scripts on our github. You can request a National Archives API key by emailing catalog_api@nara.gov. To run one of the OCR Correction scripts simply create a developer API key for the LLM your choice (i.e. Gemini, GPT, Deepseek, or Llama).

Page updated

Google Sites

Report abuse