read pdf text in python

Some PDF files are readable, so to extract content, one can simply use python library to read it instead of running computer vison / ocr on the image of the pdf.


#https://pymupdf.readthedocs.io/en/latest/installation.html

#pip install pymupdf

import fitz


doc = fitz.open(r'C:\temp\xxx.pdf') 


page = doc.load_page(0)  # loads the pdf page by page number (0-based)

#doc.load_page(0)

#page = doc[0]  # access the page object



# Use one of the following strings for opt to obtain different formats [2]:

# text: (default) plain text with line breaks. No formatting, no text position details, no images.

# blocks: generate a list of text blocks (= paragraphs).

# words: generate a list of words (strings not containing spaces).

# html: creates a full visual version of the page including any images. This can be displayed with your internet browser.

# dict / json: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() for details of its structure.

# rawdict / rawjson: a super-set of dict / json. It additionally provides character detail information like XML. See TextPage.extractRAWDICT() for details of its structure.

# xhtml: text information level as the TEXT version but includes images. Can also be displayed by internet browsers.

# xml: contains no images, but full position and font information down to each single text character. Use an XML module to interpret.

text = page.get_text()

print(text)