This section explains how the 3GPP TS 38.331 PDF is converted into structured, machine readable text that supports the later retrieval and question answering pipeline.
pdfplumber → preferred for text + tables + structured page information
PyMuPDF (fitz) → fast extraction with text blocks and image info
PyPDF2 → fallback extractor for basic page text and metadata
json → stores structured extraction results
pathlib.Path → handles file paths cleanly
Input PDF
Start with the source standard document: 3GPP TS 38.331.pdf
The script first checks whether the PDF exists in the same folder as the extractor script. If it is missing, the program stops immediately.
Page-Level Extraction
Once a method succeeds, the PDF is processed page by page
For each page, the script extracts the text content
Depending on the library, it may also extract:
page dimensions
tables
image count
block level structure
This means the extraction is not just raw text dumping, it also keeps useful structural information from the document.
Metadata Capture
The script stores document-level metadata such as:
number of pages
PDF metadata from the file itself
It also stores page-level metadata such as:
page number
text
and, depending on the method, width, height, tables, num_images, blocks.
Storage
The extracted output is saved into two files:
extracted_data.json for structured machine-readable storage
extracted_text.txt for plain readable text output
The JSON file is what later feeds the retrieval pipeline.