NLP Methods
Patients and Methods
The Veterans Healthcare Administration in the United States contains records from 140+ hospitals and community centers across the United States. Any veteran who has ever incurred a service-related injury is eligible for VHA care. In 2024, the VHA served approximately 9 million veterans annually and averaged over 300,000 healthcare appointments every day.
Reference Data Set
To develop the natural language processing method, we identified clinical documents containing a SLUMS cognitive exam score using a Structured Query Language program. Each document contained a single instance of the keyword “SLUMS” and a number within a 500 character window. No ICD-9 or ICD-10 codes were used. Our data set consisted of a sample of 1,275 notes.
Two researchers independently annotated the dataset, reaching a Cohen’s Kappa of 0.82. A third researcher adjudicated discrepancies to create the final reference labels.
Label Design
The desired output labels were developed iteratively in close collaboration with clinicians and other subject matter experts.
The following edge cases were considered.
Notes often contain multiple SLUMS scores from tests taken over time
The note may mention SLUMS but not contain a score, e.g. a blank note template or in order to note patient refusal
Scores may not be out of 30, e.g. if an exam is not finished, or if patients have auditory or visual impairments that prevent administering the full exam
Scores may be given qualitatively (e.g. “low”)
Scores may be given as a range (e.g. “low teens”)
For downstream analysis,
Interpretation of the SLUMS score depends on the patients’ educational level, shifting from 26 to 27 as the cut-off for normal scores
For our development, we did not consider educational history. Annotators were asked to mark all edge cases above as N/A.
Reference Data Set
From the 1,275 scores in the reference dataset, 899 contained a single SLUMS score and 376 contained missing, multiple, invalid (typo), or qualitative scores.
Metrics
For simplicity, we looked only at the score itself and not the denominator. In order to evaluate model performance, the following errors were considered.
We predict a score, but there is no score (more precisely, the reference label is “N/A” )
We predicted a score, but the prediction is wrong
We do not predict a score, even though there is a score
From there, we draw the following definitions of TP, FP, TN, and FN used in the rest of the metrics.
True negative: Predicted N/A, and true was N/A
True positive: Predicted score and it was the right (non-null) score
False negative: Predicted N/A, but there was a real score
False positive: Predicted a score, but the real score N/A
Algorithm
We developed a rule-based system incorporating regular expression–based pattern matching and data cleaning procedures.
We standardized the text to lowercase. We incorporated multiple rules to catch the N/A conditions, retaining rule-ids for reference and debugging as well as future rule development.
Figure 1. Flowchart of regex algorithm. Bolded text is an example of the text captured by the given regex rule.
An approximate English description of the rules (for actual details, refer to the source code):
N/A rule: No number between SLUMS and /30, to capture blank templates e.g. “SLUMS _/30”
N/A rule: “Perform” and “SLUMS”, to capture recommendations e.g. “Perform SLUMS recommended”
N/A rule: “What day of the…” to capture notes containing the scores for each individual question e.g. “SLUM […] What day of”
Score rule: after SLUMS, two digits out of 30, but not if MOCA or MMSE mentioned e.g. “SLUMS 12/9/10: 22/30”
Score rule: before SLUMS, two digits out of 30 e.g. “16/30 on the SLUMS”
Score rule: after SLUMS, two digits, but NOT if “age” is between the keyword and digits e.g. “SLUMS 16/24” or “Last SLUMS 23 on 12/20”
Score rule: before SLUMS, two digits “on/on the” e.g. “scored 23 on the SLUMS”
N/A rule: SLUMS followed by “if applicable” or similar phrases
After reaching the desired precision, no further rules were made to avoid overfitting on the reference data set.
Results
The algorithm was developed on all notes and optimized for precision (a.k.a. positive predictive value) over recall (a.k.a. sensitivity). In other words, the algorithm tends to produce N/As and only makes predictions when confident. We report standard metrics of accuracy, precision, recall, and F1 score (a.k.a. the harmonic mean of the precision and recall, often used to account for class imbalance). On the reference dataset, it achieves 83.0% accuracy, 99.6% precision, and 64.4% recall, with an F1 score of 78.2%.
When run on a further 2.96 million unannotated notes, the algorithm processed more than 20,000 notes per minute running on a single laptop CPU that was also running other programs.