Suppose you had a large .pdf that was scanned from a printed document (so that the PDF only has pictures of letters rather than knowing the letters themselves) and you'd like to extract the actual text. You essentially want to do an OCR on the images in the .pdf. Here's how to do it with free tools.
- Explode the pdf into one file per page. You need to do this since Google Docs has a limit on the size of a PDF it will OCR (approx 2Meg or so, if I recall).
- Upload the entire set of pages into Google Docs, letting it do OCR on each one.
- Download the entire set of google docs as a .zip archive
- Expand the .zip and extract the data you need, or recombine into a single pdf for convenience.
Details -- How to explode a pdf into one file per page
- The filenames of the pages should be xxx_0001, xxx_0002 rather than xxx_1, xxx_2 -- otherwise, you'll have problems with the pages being out-of-order. Fortunately, many tools will do this by default.
- There are several closed-source programs to split pdfs, but I prefer pdftk (http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) since it's freely available for both windows and Linux. You can download it directly from the above website, or install it using the normal Ubuntu or cygwin tools if you're using Ubuntu or cygwin (http://www.cygwin.com).
How to split into pages (pdftk):
will create pg_0001.pdf, pg_0002.pdf, etc.
Details -- How to upload
- Just create a directory containing nothing but the split files, and then from https://docs.google.com, upload the directory. It may take a while, but don't sweat.
- If you're uploading using the Chrome browser, you can move the pg_*.pdf files into a subdirectory containing only them and then upload that directory (which has the advantage of automatically putting them into a Google Docs collection); otherwise, just select all of the files you want to upload and do it in a single upload anyway.
- Note that each page in Google Docs has the original scanned text on the top and the OCR'd version on the bottom. This way, if there's confusion about the bottom half, a non-blind person can look at the top half to see what's going on. Most useful!
- Download the collection twice -- both as a set of .txt files and as a set of .pdf files.
- Unzip the pdf archive to an empty directory and then use "pdftk *.pdf cat output merged.pdf" to merge 'em back into a single file. (This results in a single pdf that isn't suitable for printing since it is twice as long as the original document, but is otherwise quite useful since it contains the images -- which screenreaders ignore -- alternating with the scanned text.)
Details -- How to clean up
- Since pdftk's "burst" feature always creates by default files called pg_####.pdf, once you've extracted the data you need, just delete the individual pg_####.pdf documents from google docs.
- Just open the collection, click on the top selection checkbox and trash 'em.