Disease detection at the speed of life: Near real-time disease surveillance at population scale
Abstract:
Large amounts of health and environmental data across heterogenous populations are needed rapidly identify vulnerable populations and provide near real-time situational readiness for public health threats. However, the effective development of near real-time population health surveillance remains hindered by numerous challenges. Data complexity and regulatory hurdles related to health data privacy prevent pooling of data across health care institutions in the US. Integration of diverse types of social and environmental determinants of health data across space and time requires advanced analytical methods and computational workflows. Computational limitations prevent scaling algorithms to the population level and have hindered the development and deployment of population health research tools. In her presentation, Dr. Hanson will critically examine some of these obstacles, drawing on current projects to illustrate innovative solutions. She will also propose new strategies to expedite progress in near real-time population health surveillance, emphasizing the need for interdisciplinary collaboration. This discussion aims to provide insights for leveraging these complex datasets effectively, thereby enhancing their impact on population health.
Bio:
Dr. Heidi Hanson is the Group Lead of the Biostatistics and Biomedical Informatics Group in the Computing and Computational Sciences Directorate at Oak Ridge National Laboratory. Her training and experience in the fields of demography, statistics, biomedical informatics, and life course epidemiology allow her to bring a unique set of expertise to building computational tools to identify populations at high risk for disease. She is currently the lead on DOE-National Cancer Institute (NCI) Modeling Outcomes using Surveillance Data and Scalable Artificial Intelligence (MOSSAIC) program, focused on advancing computing, predictive machine learning/deep learning (ML/DL) models, and large language models for near real time extraction of information from health records for NCI-supported cancer research. The MOSSAIC team was awarded the NCI Director’s Award for Data Science and an R&D 100 Award for the products they have developed to automate cancer surveillance. She also leads the "Data-Driven Population Health Surveillance at Scale for Pandemic Readiness" project, which is advancing a suite of innovative computational tools designed to enhance biopreparedness through the efficient integration of real-world data.
Abstract:
Focus:
Near real-time analytics of disease and its spread at scale
Data-driven Clinical decision tools
National health security
Individual health
MOSSAIC Project: https://www.olcf.ornl.gov/tag/mossaic/
Collaboration between ORNL and NIH
ADMIRRAL: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/admirral
IMPROVE: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/improve
ATOM: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/atom
CANDLE: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/candle
Real-world surveillance
Surveillance Epidemiology End-Results (SEER) Registries: https://seer.cancer.gov
850k/year cancer diagnoses collected
Many regional registries collecting data
Lots of manual extraction
2 year lag in reporting
MOSSAIC Challenge
Using AI to bring cancer surveillance to near-real time
>90% of cancers histologically reported (pathology reports from tumor slide observations; reports are thousands of words, require domain expertise to understand)
Can use ML models to read pathology reports and code them into tabular records:
Collect data from multiple SEER registries: 6m pathology reports
Mostly text but starting a pilot on images
Combination of multiple models that extract different features and use different documents
Cancer categorization:
Malignancy
Phenotype
Pediatric cancer classification
System deployed at SEER sites, covering 48% of US population
Auto-extraction:
Model predicts own confidence, presents uncertain predictions to human experts
Where confidence is very high, model auto codes. Done with 23-27% of pathology reports with >98% accuracy.
Collaboration with Veterans Affairs (VA) registry to adapt model to VA’s own data to make predictions out of sample with good accuracy
Prevailing challenges:
Computational limitations
Hospitals produce 50 PB of data, 97% goes unused
DOE compute facilities provide lots of compute power, CITADEL secure facility allows them to work with health data: https://www.olcf.ornl.gov/tag/citadel/
Data complexity, regulatory hurdles
Data sources, types, schemas and quality are very heterogeneous
Active work on harmonized data models
Using the North American Association of Central Cancer Registries (NAACCR) data model
Others: Sentinel, PCORnet, i2b2, OMOP
Automatic Classification for Common Data Model:
Bert-based NLP model
Multimodal ensembles for identifying recurrent disease
Hard task since recurrence not commonly tracked in registries
Integration of diverse social/environmental determinants of health
Distribution of diseases is biased in space and sub-populations
Especially true for rare diseases, where the sample size is small in any local dataset
Focus on privacy-preserving federated learning
Tradeoff between privacy and accuracy
Analysis of data within the individual registries can focus more on accuracy
Federation of results across registries must maintain privacy
Near real-time analysis
Many risk factors for disease are socio-environmental
Need to understand these drivers
SEER Residential History Data
LexusNexus data for where individuals have lived 1995-2020
Can connect to individuals pollution exposure
Requires collaboration across many different domains of experts
Medium-Range goals
EHRLICH: surveillance for biopreparedness
Agent-based models, parameterized by live data feeds, expansion of FrESCO data harmonization/ingest infrastructure
Improved text information retrieval models via neural attention mechanisms
C-HER: centralized repository for environmental determinants of health data
Integrating 73 datasets
Integrated Health Security Surveillance Response Tools
Dual purpose tools for precision medicine AND population health
Data management, early warning, etc.