Here are the technical details of how AI was used to correct the optical character recognition (OCR) errors in the scanned Martin Luther King Jr. files.
Optical character recognition is the term for how computers attempted to recognize the text in a scanned document image. This is difficult for documents with:
A mixture of fonts especially handwritten and typed
Physically degraded pages (e.g. fading, creased, torn)
Poor scanner settings (e.g. skew, exposure, etc.)
Our approach only looks at the OCR text - not the document image - and makes corrections. It acts like a very smart spell checker. Traditional approaches might just use a set of substitutions for common mistakes (e.g. $ --> S) but our AI approach uses the probability of letters appearing in a group. For example "ink" is much more likely than "jnk" even though the letters "i" and "j" may look similar. An alternative approach would be to attempt to re-scan documents or re-OCR the images but that approach is very time consuming and costly. This example shows a memo about James Earl Ray's weapon (a Remington rifle) before and after post-OCR correction.
The document collection contains 243,496 pages, grouped across 6,301 PDF files. When working with collections of thousands of documents, search capability is vital. While there is a search feature on the National Archives site, the search functionality is limited to keywords in the PDF file names. For example, a search for the word "Memphis" only returns a single result: 00459961_lawyer_arrives_in_memphis_104-10129-10400.pdf
Even if the National Archives supported search capability within the text of the document a single character error is enough to cause a keyword search to fail. For example, searching for “Memphis” will not find “Me mphis” or “Memph1s”. We estimate, there are millions of character errors in the National Archives OCR text for MLK Assassination Records.
By storing the full AI corrected text in a single database a search for "Memphis" now turns up 2,330 hits.
Automatically download all 6,200+ PDFs from the National Archives
Extract each file's OCR text from the PDFs
Send to Open AI 's gpt-4o-mini model via their API along with a custom system prompt that has been tuned for post-OCR correction
Post-process the results (trim white space, fix hyphenation patterns, etc.)
Processing the 243,496 pages consumed 66,111,336 input tokens (the prompts plus the original OCR text) and returned the 14,986,540 output tokens which make up our corrected version of the MLK Assassination Records. All told, it required ~70 hours of compute time ($$$).
The total number of characters is in the original set is 62,042,527 (?) and the total AI edits are 9,824,970 (a 15.8% correction rate). However its impossible to know how many of those corrections are valid and how many were missed.
Based on our small sample of files with human transcription, we estimate that we reduce the error rate substantially.
All relevant code can be found at: https://github.com/johnesposito17/Project-BlueBook-AI-OCR-Correction.git
You will need your own API key(s) to run the scripts on our github. You can request a National Archives API key by emailing catalog_api@nara.gov. To run one of the OCR Correction scripts simply create a developer API key for the LLM your choice (i.e. Gemini, GPT, Deepseek, or Llama).