read pdf text in python
Some PDF files are readable, so to extract content, one can simply use python library to read it instead of running computer vison / ocr on the image of the pdf.
#https://pymupdf.readthedocs.io/en/latest/installation.html
#pip install pymupdf
import fitz
doc = fitz.open(r'C:\temp\xxx.pdf')
page = doc.load_page(0) # loads the pdf page by page number (0-based)
#doc.load_page(0)
#page = doc[0] # access the page object
# Use one of the following strings for opt to obtain different formats [2]:
# text: (default) plain text with line breaks. No formatting, no text position details, no images.
# blocks: generate a list of text blocks (= paragraphs).
# words: generate a list of words (strings not containing spaces).
# html: creates a full visual version of the page including any images. This can be displayed with your internet browser.
# dict / json: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() for details of its structure.
# rawdict / rawjson: a super-set of dict / json. It additionally provides character detail information like XML. See TextPage.extractRAWDICT() for details of its structure.
# xhtml: text information level as the TEXT version but includes images. Can also be displayed by internet browsers.
# xml: contains no images, but full position and font information down to each single text character. Use an XML module to interpret.
text = page.get_text()
print(text)