Explanatory Note: How Unstructured Claim Documents are Transformed into Structured, Auditable Intelligence

Medical malpractice claim letters contain dense, nuanced clinical and legal information. To convert these narratives into structured, analysable data—while preserving their meaning, chronology and medico‑legal significance—the TA‑MedMal methodology applies a multi‑stage processing pipeline. Each stage enhances semantic clarity, reduces noise, and supports reproducible downstream analysis.

1. Text Preparation and Normalisation

Tokenisation

Documents are broken into ordered tokens—words, phrases and semantic units—while preserving sequence so that clinical and legal meaning is retained.

Synonym Handling

Equivalent expressions (“doctor”, “physician”, “dr”) are mapped to a unified form. This reduces vocabulary noise and stabilises downstream scoring.

Spell Checking and Normalisation

A medico‑legal dictionary corrects OCR and typographical errors without altering domain‑specific terminology. Case, punctuation and spacing are standardised to produce a clean token stream.

From Words to Concepts 

After preparing the text, the system shifts from analysing words to analysing meaning. Clinically and legally important expressions are converted into stable, numerical concepts that behave consistently across all documents. 

Concept‑Level Representation  

Earlier versions of TA‑MedMal analysed text directly, which meant that small wording differences (“missed diagnosis”, “failure to diagnose”, “diagnostic delay”) produced fragmented signals. The current methodology converts all clinically and legally meaningful expressions into numerically defined canonical concepts.

This shift from raw text to stable concept identifiers makes the system faster, more consistent, and more interpretable. It ensures that similar ideas are treated the same way across all documents, strengthens downstream scoring, and enables clearer visual summaries such as the colour‑coded keyword treemap.

2. Detection of Compound Concepts

Many medico‑legal ideas are expressed as multi‑word phrases. These are identified using curated lists and statistical co‑occurrence analysis, then converted into semantic n‑grams such as:

This ensures they behave as single conceptual units in later analysis.

Negation Detection

Negation phrases (“no evidence of…”, “not documented…”) are extracted as explicit tokens to prevent false positives in disease or misadventure detection.

3. Contextual Disambiguation

Abbreviation Expansion

Ambiguous abbreviations (e.g., “CA”, “MS”, “PE”) are expanded using a context‑aware medical dictionary and surrounding cues. For example, “CA” near “oncology” is interpreted as cancer, whereas near “bone scan” it is interpreted as calcium.

Outcome vs Cause Differentiation

Terms such as “peritonitis”, “sepsis”, and “paraplegia” may describe outcomes rather than negligence causes. Local proximity rules classify these tokens accordingly, ensuring that downstream scoring emphasises the alleged mechanism of harm rather than its consequences.

4. Concept Recognition and Semantic Tagging

Tokens are mapped to a curated medico‑legal vocabulary spanning:

Each concept is assigned a saliency weight reflecting its clinical or legal importance. This weighting supports prioritisation and interpretability in downstream analysis.

5. Positional and Relational Analysis

Because meaning depends on structure, token order is preserved throughout.

Proximity Rules

Related terms appearing near each other (e.g., “failure” + “diagnose” + “breast cancer”) are combined into higher‑order themes. These inferred concepts form the building blocks of mechanism‑of‑harm fingerprints used in similarity scoring and clustering.

6. mTf‑Idf: Modified Term Frequency–Inverse Document Frequency

The system uses a domain‑adapted version of Tf‑Idf to identify the most informative terms in each document. The modifications include:

The result is a sparse, stable vector representation of each document that captures its unique medico‑legal signature. These vectors form the basis of the system’s similarity engine, which measures how closely cases align in terms of misadventure patterns, disease and procedure concepts, and narrative structure. Because the geometry is built on canonical concepts, saliency weighting and vector normalisation, it produces highly stable distances between cases. This stability underpins the system’s clustering algorithm, which groups cases into precise, non‑overlapping segments that reflect genuine clinical and medico‑legal themes rather than statistical artefacts. 

7. Topic Modelling for Latent Theme Discovery

Latent Dirichlet Allocation (LDA) can be incorporated to identify broad, cross‑cutting themes across the corpus. It is particularly useful for highlighting diffuse or unexpected patterns that span multiple specialties or misadventure types.

Primary Role of Clustering

The system’s similarity‑based clustering is the principal segmentation method. It uses the mTf‑Idf vector space to create high‑precision, non‑overlapping clusters with clear clinical and medico‑legal coherence. Cluster cores remain stable under threshold variation, and boundaries are reproducible because they are driven by canonical concepts and saliency‑weighted mechanisms of harm. This makes clustering the preferred tool for thematic reporting, portfolio segmentation and visualisation, with LDA available as a complementary technique where broader thematic exploration is required.

8. Output and Application

The final output is a structured, multi‑layered representation of each claim, supporting:

This transforms complex medico‑legal narratives into auditable, reproducible intelligence that supports learning, prevention, underwriting precision and safer clinical practice.


MBB June 2026