Assignment 2
This assignment can be done alone or in pairs (due 30 Nov)
Build a corpus of pdfs of 5 articles or texts/books dealing with a similar subject of your own choice which do not include a text layer. A good place to look for these might be in library catalogs or from archive.org.
Using the tutorial we explored in class for working with tesseract (ocrmypdf), perform the OCR on these pdfs (command : ocrmypdf) and then extract the textual layer from 5 articles (command: texttopdf).
If you are unable to install ocrmypdf to your own computer, contact me and I can have this done relatively quickly.
This should result in 5 txt files you can study.
With Acrobat Pro, run OCR on the very same texts. If you do not have Acrobat Pro on your own computer, you can use the computers in the library or at the Center for Digital Scholarship to complete this step.
This should also result in 5 txt files you can study.
Alternatives: put the pdfs in Google drive and then open them with Google docs. This also performs OCR on them. You can also spot check parts of the text with the new version of Telegram which instead of typing in text you can scan text and send it as a text message.
Use any of the methods from the course to provide an analysis of the contents.
Guiding questions:
Do you notice differences between the OCR quality between the two tools, tesseract, Acrobat Pro and/or Google docs? If so, what are the differences? You can take portions of the files and visualize them in diffchecker.com, for example as one way of showing the differences.
Why do you think would different OCR systems produce different results?
Over successive attempts does the OCR performance change?
Does the OCR quality make a difference for the kind of analysis you carry out? for example, different kinds of distant reading?