pyslums

GSA 2025: Extracting Cognitive SLUMS Scores from Unstructured Clinical Notes in the National Veterans Affairs Database

NLP Methods

Patients and Methods

The Veterans Healthcare Administration in the United States contains records from 140+ hospitals and community centers across the United States. Any veteran who has ever incurred a service-related injury is eligible for VHA care. In 2024, the VHA served approximately 9 million veterans annually and averaged over 300,000 healthcare appointments every day.

Reference Data Set

To develop the natural language processing method, we identified clinical documents containing a SLUMS cognitive exam score using a Structured Query Language program. Each document contained a single instance of the keyword “SLUMS” and a number within a 500 character window. No ICD-9 or ICD-10 codes were used. Our data set consisted of a sample of 1,275 notes.

Two researchers independently annotated the dataset, reaching a Cohen’s Kappa of 0.82. A third researcher adjudicated discrepancies to create the final reference labels.

Label Design

The desired output labels were developed iteratively in close collaboration with clinicians and other subject matter experts.

The following edge cases were considered.

Notes often contain multiple SLUMS scores from tests taken over time

The note may mention SLUMS but not contain a score, e.g. a blank note template or in order to note patient refusal

Scores may not be out of 30, e.g. if an exam is not finished, or if patients have auditory or visual impairments that prevent administering the full exam

Scores may be given qualitatively (e.g. “low”)

Scores may be given as a range (e.g. “low teens”)

For downstream analysis,

Interpretation of the SLUMS score depends on the patients’ educational level, shifting from 26 to 27 as the cut-off for normal scores

For our development, we did not consider educational history. Annotators were asked to mark all edge cases above as N/A.

Reference Data Set

From the 1,275 scores in the reference dataset, 899 contained a single SLUMS score and 376 contained missing, multiple, invalid (typo), or qualitative scores.

Metrics

For simplicity, we looked only at the score itself and not the denominator. In order to evaluate model performance, the following errors were considered.

We predict a score, but there is no score (more precisely, the reference label is “N/A” )

We predicted a score, but the prediction is wrong

We do not predict a score, even though there is a score

From there, we draw the following definitions of TP, FP, TN, and FN used in the rest of the metrics.

True negative: Predicted N/A, and true was N/A

True positive: Predicted score and it was the right (non-null) score

False negative: Predicted N/A, but there was a real score

False positive: Predicted a score, but the real score N/A

Algorithm

We developed a rule-based system incorporating regular expression–based pattern matching and data cleaning procedures.

We standardized the text to lowercase. We incorporated multiple rules to catch the N/A conditions, retaining rule-ids for reference and debugging as well as future rule development.

Figure 1. Flowchart of regex algorithm. Bolded text is an example of the text captured by the given regex rule.

An approximate English description of the rules (for actual details, refer to the source code):

N/A rule: No number between SLUMS and /30, to capture blank templates e.g. “SLUMS _/30”

N/A rule: “Perform” and “SLUMS”, to capture recommendations e.g. “Perform SLUMS recommended”

N/A rule: “What day of the…” to capture notes containing the scores for each individual question e.g. “SLUM […] What day of”

Score rule: after SLUMS, two digits out of 30, but not if MOCA or MMSE mentioned  e.g. “SLUMS 12/9/10: 22/30”

Score rule: before SLUMS, two digits out of 30 e.g. “16/30 on the SLUMS”

Score rule: after SLUMS, two digits, but NOT if “age” is between the keyword and digits e.g. “SLUMS 16/24” or “Last SLUMS 23 on 12/20”

Score rule: before SLUMS, two digits “on/on the” e.g. “scored 23 on the SLUMS”

N/A rule: SLUMS followed by “if applicable” or similar phrases

After reaching the desired precision, no further rules were made to avoid overfitting on the reference data set.

Results

The algorithm was developed on all notes and optimized for precision (a.k.a. positive predictive value) over recall (a.k.a. sensitivity). In other words, the algorithm tends to produce N/As and only makes predictions when confident. We report standard metrics of accuracy, precision, recall, and F1 score (a.k.a. the harmonic mean of the precision and recall, often used to account for class imbalance). On the reference dataset, it achieves 83.0% accuracy, 99.6% precision, and 64.4% recall, with an F1 score of 78.2%.

When run on a further 2.96 million unannotated notes, the algorithm processed more than 20,000 notes per minute running on a single laptop CPU that was also running other programs.

Page updated

Google Sites

Report abuse