Disease detection at the speed of life: Near real-time disease surveillance at population scale

Heidi A Hanson, Oak Ridge National Lab

Abstract:

Large amounts of health and environmental data across heterogenous populations are needed rapidly identify vulnerable populations and provide near real-time situational readiness for public health threats. However, the effective development of near real-time population health surveillance remains hindered by numerous challenges. Data complexity and regulatory hurdles related to health data privacy prevent pooling of data across health care institutions in the US. Integration of diverse types of social and environmental determinants of health data across space and time requires advanced analytical methods and computational workflows. Computational limitations prevent scaling algorithms to the population level and have hindered the development and deployment of population health research tools. In her presentation, Dr. Hanson will critically examine some of these obstacles, drawing on current projects to illustrate innovative solutions. She will also propose new strategies to expedite progress in near real-time population health surveillance, emphasizing the need for interdisciplinary collaboration. This discussion aims to provide insights for leveraging these complex datasets effectively, thereby enhancing their impact on population health.

Bio:

Dr. Heidi Hanson is the Group Lead of the Biostatistics and Biomedical Informatics Group in the Computing and Computational Sciences Directorate at Oak Ridge National Laboratory. Her training and experience in the fields of demography, statistics, biomedical informatics, and life course epidemiology allow her to bring a unique set of expertise to building computational tools to identify populations at high risk for disease. She is currently the lead on DOE-National Cancer Institute (NCI) Modeling Outcomes using Surveillance Data and Scalable Artificial Intelligence (MOSSAIC) program, focused on advancing computing, predictive machine learning/deep learning (ML/DL) models, and large language models for near real time extraction of information from health records for NCI-supported cancer research. The MOSSAIC team was awarded the NCI Director’s Award for Data Science and an R&D 100 Award for the products they have developed to automate cancer surveillance. She also leads the "Data-Driven Population Health Surveillance at Scale for Pandemic Readiness" project, which is advancing a suite of innovative computational tools designed to enhance biopreparedness through the efficient integration of real-world data.

Abstract:

Focus:
- Near real-time analytics of disease and its spread at scale
- Data-driven Clinical decision tools
- National health security
- Individual health
MOSSAIC Project: https://www.olcf.ornl.gov/tag/mossaic/
- Collaboration between ORNL and NIH
- ADMIRRAL: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/admirral
- IMPROVE: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/improve
- ATOM: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/atom
- CANDLE: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/candle
Real-world surveillance
- Surveillance Epidemiology End-Results (SEER) Registries: https://seer.cancer.gov
  - 850k/year cancer diagnoses collected
  - Many regional registries collecting data
  - Lots of manual extraction
  - 2 year lag in reporting
- MOSSAIC Challenge
  - Using AI to bring cancer surveillance to near-real time
  - >90% of cancers histologically reported (pathology reports from tumor slide observations; reports are thousands of words, require domain expertise to understand)
  - Can use ML models to read pathology reports and code them into tabular records:
    - Collect data from multiple SEER registries: 6m pathology reports
    - Mostly text but starting a pilot on images
    - Combination of multiple models that extract different features and use different documents
  - Cancer categorization:
    - Malignancy
    - Phenotype
    - Pediatric cancer classification
- System deployed at SEER sites, covering 48% of US population
  - Auto-extraction:
    - Model predicts own confidence, presents uncertain predictions to human experts
    - Where confidence is very high, model auto codes. Done with 23-27% of pathology reports with >98% accuracy.
  - Collaboration with Veterans Affairs (VA) registry to adapt model to VA’s own data to make predictions out of sample with good accuracy
- Prevailing challenges:
  - Computational limitations
    - Hospitals produce 50 PB of data, 97% goes unused
    - DOE compute facilities provide lots of compute power, CITADEL secure facility allows them to work with health data: https://www.olcf.ornl.gov/tag/citadel/
  - Data complexity, regulatory hurdles
    - Data sources, types, schemas and quality are very heterogeneous
    - Active work on harmonized data models
    - Using the North American Association of Central Cancer Registries (NAACCR) data model
    - Others: Sentinel, PCORnet, i2b2, OMOP
    - Automatic Classification for Common Data Model:
      - Bert-based NLP model
      - Multimodal ensembles for identifying recurrent disease
        Hard task since recurrence not commonly tracked in registries
  - Integration of diverse social/environmental determinants of health
    - Distribution of diseases is biased in space and sub-populations
    - Especially true for rare diseases, where the sample size is small in any local dataset
    - Focus on privacy-preserving federated learning
      - Tradeoff between privacy and accuracy
      - Analysis of data within the individual registries can focus more on accuracy
      - Federation of results across registries must maintain privacy
      - Near real-time analysis
    - Many risk factors for disease are socio-environmental
    - Need to understand these drivers
    - SEER Residential History Data
      - LexusNexus data for where individuals have lived 1995-2020
      - Can connect to individuals pollution exposure
  - Requires collaboration across many different domains of experts
- Medium-Range goals
  - EHRLICH: surveillance for biopreparedness
    - Agent-based models, parameterized by live data feeds, expansion of FrESCO data harmonization/ingest infrastructure
    - Improved text information retrieval models via neural attention mechanisms
    - C-HER: centralized repository for environmental determinants of health data
      - Integrating 73 datasets
  - Integrated Health Security Surveillance Response Tools
    - Dual purpose tools for precision medicine AND population health
    - Data management, early warning, etc.