Towards Automated Prediction of Goals of Care Conversations in Clinical Notes

Patients articulate their goals, values, and preferences about end-of-life care during goals of care conversations (GOCC), so that their care can be guided by these goals should they become incapacitated and unable to express these preferences for themselves. The problem is that goals of care information can become buried and difficult to find within clinical notes in the medical record. The overall purpose of our project is to create the training and testing dataset for a Natural Language Processing algorithm which will automatically identify clinical notes which document GOCCs between patients and their providers. 

These conversations are incredibly important for health care providers to make informed decisions regarding their patients and end-of-life care. Automatically identifying these source notes would support health care providers in acting on patient preferences. In the future, we envision building a dashboard or tool in the electronic health record that will gather all the information about patients’ preferences regarding end-of-life care in a single place to support clinician decision-making. This might decrease the time and effort clinicians spend finding this essential information and support patients being the central decision-makers regarding their own care.

Video Project Overview

Methods

This retrospective cohort study examines patients admitted for ischemic stroke to four Indiana health systems in 2016-2018. This project uses two datasets. The first dataset is a secondary data analysis qualitatively coded by Comer et al. [1] to identify excerpts of all clinical notes which described GOCCs. The second dataset contains all clinical notes for the index of hospitalization for patients in the Comer dataset retrieved through the Indiana Network for Patient Care (INPC) and one hospital’s data warehouse.

The two datasets were merged to create a corpus of clinical notes in which GOCCs were annotated. Significant data preparation was needed to merge the datasets and included the following steps: 1) transforming the level of analysis from patient-level to GOCC-level, 2) extracting all clinical notes’ content from repository of text files, 3) normalizing clinical notes’ contents for white space variability, xml tags, and non-standardized character encodings, and 4) mapping GOCC excerpts to entire clinical note content.

Transforming the level of data analysis consisted of operationalizing consecutive instances of GOCCs per patient as unique variables using a Python script. This enabled GOCCs to be mapped to their clinical source note on a one-to-one basis. The content of clinical notes retrieved from the INPC were contained in a repository of individual text files with a file name reference listed in the dataset. Using Python, these file contents of clinical note texts were extracted from this repository using the column of file name references in the dataset and inputted into a new column. After normalization, two research assistants manually matched GOCC excerpts to their clinical source note. If a GOCC was found within a clinical note, this was indicated with a 1 and the GOCC excerpt was added into a separate column in the dataset. The resulting dataset displays the absence or presence of a GOCC within a clinical note and what that GOCC was.

Using CLAMP, a natural language processing software [2], the text files of clinical notes containing GOCCs were converted into a xmi format and annotated to indicate where a GOCC lied within the clinical note. This step is essential so an NLP algorithm can be developed from the annotated xmi files to train a model for automated prediction of GOCCs.

Results

Chart review was completed by Comer et al. [1] for 1613 patients admitted for ischemic stroke across four health systems where 617 were found to have documented GOCCs. Further, exclusion criterion was applied to eliminate patients who had documentation of discharge planning conversations only, which falsely inflated the count of patients with GOCCs under the original data collection tool. Exclusion criteria reduced the total number of patients for analysis to 1309.


Clinical notes contained within the index of hospitalization of the Comer dataset originated from four hospitals, henceforth referred to as Hospitals A, B, C, and D. All clinical notes for the index hospitalizations were retrieved for Hospitals A-D from the INPC (n= 25,375); notes for Hospital D were also retrieved through that hospital’s data warehouse (n=3016). Once duplicate notes were removed from Hospital D, 3016 notes were included in the corpus representing admissions for the 1309 patients. The entirety of this process is depicted in Figure 1.


309 of 1309 patients were found to have documented GOCCs. There was a range of 1-17 GOCCs per patient, with an average of 4.25 GOCCs per patient with at least one GOCC. One of these notes was found to contain three separate instances of a GOCC and was thereafter split for the total of 1440 documented GOCCs.


118 (8.33%) of the 1440 GOCCs were able to be matched to their clinical note. Table 1 shows the frequency of clinical notes containing a GOCC that were matched to their source note, separated by hospital site and data source. Hospital D accounted for the greatest percentage of clinical notes containing a GOCC with 75% (n=54) of the documented GOCCs being matched.

Figure 1. Visual Progression of GOCC Matching

Table 1. Frequency of GOCCs matched to their clinical note

A unigram and bigram analysis were conducted on the subset of unique GOCCs after removing punctuation and Natural Language Toolkit (NLTK) stop words. These analyses were used to visualize the frequency of successive and co-occurring words within the GOCCs. In Figure 2, the repeated “patient” keywords (patient, pt, patients) show these conversations are patient centered in nature and the location in which GOCCs happen with the word “bedside.” The word “family” as the most frequently occurring term in Figure 2 indicates the importance of the patient’s family contribution in these conversations. The bigram analysis in Figure 3 provides evidence that these conversations are centered around patient preferences with the indicative phrases “would like” and “would want.” The phrases “comfort measures” and “comfort care” could also indicate GOCCs are aimed towards patients transitioning into comfort care measures.

Figure 2. Unigram Analysis of GOCC content

Figure 3. Bigram Analysis of GOCC content.

Conclusion

From the research methods used in the project, it was learned that reusing retrospective Electronic Health Record (EHR) data for secondary use it was not intended for is often hard and comes riddled with problems. This dataset was transformed and re-encoded multiple times between the user interface that clinical notes were coded for GOCCs in, to the EHR of their database, and then to the INPC. These various data transfers often distorted how the text of the clinical notes was represented, requiring extensive data normalization. While it was thought the project would be simple in structure where the two datasets could be matched using an automated program, manual matching efforts were necessary due to these data inconsistencies and abnormalities. This manual effort took approximately 31 hours between two research assistants to complete. Although the overall goal of creating the training and testing dataset was accomplished, this process would have been easier if the project started with coding GOCCs directly in the dataset retrieved from the INPC rather than coding in an external user interface and translating these notes to the INPC.

The high match rate in GOCCs from Hospital D in Table 1 is expected as Hospital D’s data warehouse contains a higher percentage of unstructured, free text notes in comparison to the data pulled from the INPC. Each site that participates in the INPC can choose which data they submit to the health data sharing exchange for all their patients, this data most often occurring as structured note types. Of the GOCCs where clinical notes were retrieved through the INPC, 8.77% (n=66) were matched to their source note. The low GOCC match rates in Hospitals A-C indicate they may not be submitting the note type(s) that contain GOCCs to the INPC. From this data, it is a bad assumption to believe that a provider would be able to discern patients' end-of-life care preferences using the information available in the Indiana Health Information Exchange (IHIE). Even if these clinical notes are sent to the INPC, it cannot be determined if this ethically valuable information is able to be found for use in clinical settings. Considering how GOCCs are buried within clinical notes and regularly not indexed properly within an EHR, it cannot be expected of providers to find these preferences. Additionally, it is not a safe assumption that a GOCC at one hospital will be available at another since GOCCs are not reliably accessible. It is dangerous for patients to trust their preferences are being respected when providers are unable to obtain documentation to execute care. For this reason, having physical copies of GOCC documentation in Physician Orders for Life Sustaining Treatment (POLST) forms is important. Even then, paper forms of these preferences are insufficient in a digital age where EHR’s are embedded into patient care.

Future Work

Resulting NLP systems might support clinical decision support systems that automatically identify and extract GOCC documentation, potentially increasing discoverability of this information and decreasing clinician’s time and cognitive burden. In the future, we envision building a dashboard or tool in an EHR that will gather all the information about patients’ preferences regarding end-of-life care in a single place to support clinician decision-making.


We are currently working with NLP experts to see if the relatively small training dataset is sufficient to power a machine learning algorithm. If an algorithm is not possible due to the low match rate, the current progress could be used to inform a downstream project. The Principal Investigator, Dr. Umberfield, on this project is moving from the Regenstrief Institute to Mayo Clinic’s Department of Artificial Intelligence and Informatics. At the Mayo Clinic with the AI/Tech + Aging Pilot Award Grant [3], this project can be expanded in a cross-site collaboration in tandem with Dr. Comer at the Regenstrief Institute. With a cross-site collaboration, the annotation plan of Comer et al. [1] could be used within the Mayo Clinic’s database to remove the INPC as a middleman between datasets. This would reduce the effort required for data normalization and ensure all data is available between a patient’s admission and discharge as it is centralized in one location.

Acknowledgements

I would like to thank my mentor Dr. Elizabeth Umberfield for her time throughout this past year in taking me on as her intern. It has been a great experience getting to collaborate with her and being able to support her work with my contributions. Dr. Umberfield's is part of the Indiana Training Program in Public & Population Health (PPH) Informatics at Fairbanks School of Public Health and Regenstrief Institute, supported by the National Library of Medicine of the National Institutes of Health under award number T15LM012502. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, Indiana University, or Regenstrief Institute. Dr. Umberfield and I would like to thank these institutes for their funding for the project and subsequent publication under award number T15LM012501. I would also like to recognize Dr. Amber Comer and the use of her dataset in our secondary analysis to create an NLP predictive model. 

References

1. Comer AR, Williams LS, Creuzfield C, Holloway R, Torke AM. Goals of Care Conversations after Severe Stroke. Journal of Pain and Symptom Management. Under Review.

2. Soysal, E., Wang, J., Jiang, M., Wu, Y., Pakhomov, S., Liu, H., & Qi, W. (2017). CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines. Journal of the American Medical Informatics Association, ocx132. https://doi.org/10.1093/jamia/ocx132

3. New Funding & Support Offered for Tech Development. (n.d.). Retrieved March 25, 2022, from https://leadingage.org/cast/new-funding-support-offered-tech-development