Text Extraction

This section explains how the 3GPP TS 38.331 PDF is converted into structured, machine readable text that supports the later retrieval and question answering pipeline.

Libraries Used

pdfplumber → preferred for text + tables + structured page information
PyMuPDF (fitz) → fast extraction with text blocks and image info
PyPDF2 → fallback extractor for basic page text and metadata
json → stores structured extraction results
pathlib.Path → handles file paths cleanly

Workflow

Input PDF

Start with the source standard document: 3GPP TS 38.331.pdf
The script first checks whether the PDF exists in the same folder as the extractor script. If it is missing, the program stops immediately.

Page-Level Extraction

Once a method succeeds, the PDF is processed page by page
For each page, the script extracts the text content
Depending on the library, it may also extract:
- page dimensions
- tables
- image count
- block level structure
This means the extraction is not just raw text dumping, it also keeps useful structural information from the document.

Metadata Capture

The script stores document-level metadata such as:
- number of pages
- PDF metadata from the file itself
It also stores page-level metadata such as:
- page number
- text
- and, depending on the method, width, height, tables, num_images, blocks.

Storage

The extracted output is saved into two files:
- extracted_data.json for structured machine-readable storage
- extracted_text.txt for plain readable text output
The JSON file is what later feeds the retrieval pipeline.

CODE : https://github.com/abhayshashidhara/evidence-aware-3gpp-ran-qa/blob/main/text_extraction.py

Page updated

Report abuse