unstructured layout detection notes

the unstructured library uses unstructured_inference library to run OCR and layout detection models for detecting elements (heading, table, text, etc.) from a pdf / image file.

For pdf file, the default parser uses fitz / pymupdf for reading the text and text location but it doesn't really know the layout element type, or grouping the text results, or ordering the results properly.

A better way is using computer vision model to read pdf as an image, and detect the layout and content. The unstructured_inference seems to even merge the computer vision detection result with the pdf parser result to achieve better outcome.

Typically you can just specify the model names and the 'hi_res' strategy to enable layout detection, and it will automatically download the models from huggingface model repository to a local .cache to run. Sometimes, if the SSL port is blocked by firewall, you may need to manually search and download the models from Huggingface website, and point to the model via parameters / environment variables.

The params and variables are messy... but here is a note about using the detectron2 model with a local cache.

by the way, the default yolox seems better in recognizing table structure.

Here are just the code snippets and relations.

detectron2_faster_rcnn_R_50_FPN_3x

get_model() doesn't accept any other parameter other than model name, and the init params

These two environment parameters are accepted, so you point to e.g. a detectron2 model and provide its init params

Below shows how and where the two parameters are picked up

UNSTRUCTURED_DEFAULT_MODEL_NAME = ''

UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH

The objects relations from high level partition_pdf down to the model initialization

== unstructured/unstructured/partition/pdf.py ==

partition_pdf (hi_res_model_name)

--> partition_pdf_or_image(hi_res_model_name)

--> _partition_pdf_or_image_local(hi_res_model_name)

== unstructured-inference/unstructured_inference/inference/layout.py ==

--> process_file_with_model(model_name)

== unstructured-inference/unstructured_inference/models/base.py ==

--> get_model(model_name)

model_name / os.environ.get("UNSTRUCTURED_DEFAULT_MODEL_NAME") / DEFAULT_MODEL = 'yolox'

initialize_param_json = os.environ.get("UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH")

initialize_params = json.load( UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH) can pass on model_path

model.initialize(**initialize_params)

== unstructured-inference/unstructured_inference/models/detectron2onnx.py ==

--> class UnstructuredDetectronONNXModel(UnstructuredObjectDetectionModel):

def initialize(

self,

model_path: str,

label_map: Dict[int, str],

confidence_threshold: Optional[float] = None,

) ...

the model path can be passed in from the ini params

the label_map can be the DEFAULT_LABEL_MAP

the confidence_threshold is defaulted 0.5

--> supplement_page_layout_with_ocr()

== unstructured-inference/unstructured_inference/models/tables.py ==

tables_agent: UnstructuredTableTransformerModel = UnstructuredTableTransformerModel()

tables_agent.initialize("microsoft/table-transformer-structure-recognition")

then merge layout

note, Hugging Face uses that model ID to construct a unique local folder name in your cache directory

"microsoft/table-transformer-structure-recognition" → models--microsoft--table-transformer-structure-recognition

If you have downloaded the model using Hugging Face’s transformers library (from_pretrained()), it will be under the transformers/ subfolder within .cache/huggingface/.

C:\Users\<YourUsername>\.cache\huggingface\transformers\models--microsoft--table-transformer-structure-recognition

Page updated

Google Sites

Report abuse