BEN CHEN's Homepage

read pdf text in python

Some PDF files are readable, so to extract content, one can simply use python library to read it instead of running computer vison / ocr on the image of the pdf.

#https://pymupdf.readthedocs.io/en/latest/installation.html

#pip install pymupdf

import fitz

doc = fitz.open(r'C:\temp\xxx.pdf')

page = doc.load_page(0) # loads the pdf page by page number (0-based)

#doc.load_page(0)

#page = doc[0] # access the page object

# Use one of the following strings for opt to obtain different formats [2]:

# text: (default) plain text with line breaks. No formatting, no text position details, no images.

# blocks: generate a list of text blocks (= paragraphs).

# words: generate a list of words (strings not containing spaces).

# html: creates a full visual version of the page including any images. This can be displayed with your internet browser.

# dict / json: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() for details of its structure.

# rawdict / rawjson: a super-set of dict / json. It additionally provides character detail information like XML. See TextPage.extractRAWDICT() for details of its structure.

# xhtml: text information level as the TEXT version but includes images. Can also be displayed by internet browsers.

# xml: contains no images, but full position and font information down to each single text character. Use an XML module to interpret.

text = page.get_text()

print(text)

Page updated

Google Sites

Report abuse