A3.AI (www.a3.ai) is a nonprofit applied AI R&D organization. This presentation provides the overview of a year-long pro bono machine learning project started in July 2020, and shares preliminary findings to-date: data analysis, feasibility, feature engineering, and a privacy-preservation proof-of-concept.
Aims
- Create machine learning models to predict a patient’s risk of severe clinical outcomes if infected with COVID-19. Examples of severe outcomes include hospitalization, ICU stay, ventilation, ECMO (heart-lung bypass) and mortality. These personalized risk scores and associated risk factors can help citizens make informed work and lifestyle choices, augment clinical prognosis by physicians. They can help health care organizations coordinate care and optimize resources, and help public health agencies devise strategies for planning, responding and reopening.
- Lay the foundation of a novel privacy-preserving, decentralized, collaborative machine learning platform: Collaborative Learning with Obfuscation, Aggregation and Knowledge Distillation (CLOAK). Enable organizations to jointly train machine learning models across data silos without sharing sensitive data or models. This modularized framework analyzes data sharing requirements and vulnerabilities, model accuracy goals and computing infrastructure constraints, adapts advanced ML and cryptography techniques to use cases at hand. It optimizes Privacy-Utility-Efficiency choices. The COVID-19 solution will be the first large scale reference implementation.
Data
De-identified personal level data is provided by COVID-19 Research Database Consortium.
Base Population includes:
- 90 million patients 7 years of medical claims history, over 3 billion claim lines
- 40 million patient’s outpatient EHR records
- 240 million people’s Social Data
- Death Registry
COVID-19 Population includes
- 150,000 patients as of 08/2020, with latest data added weekly.
Approach
Health insurance claims (key attributes: longitudinal records of ICD diagnoses, CPT procedures, NDC drugs), augmented with social behavioral and EHR data, will be used to predict individual COVID-19 patient’s clinical trajectory.
We use Machine Learning, Deep Learning and descriptive analytics to not only validate the risk factors identified by healthcare experts, but also discover previously unknown patterns. This requires overcoming data, computing and privacy constraints.
Feature engineering
Medical data is notoriously sparse and noisy. We employ various feature engineering techniques, including hand-crafting 150 features (e.g. underlying health conditions). Another data challenge is high dimensionality. There are over 100,000 distinct diagnosis, procedure and drug codes. As categorical features these are too fine-grained for model training. Instead of common grouping schemes with rigid clinical code classifications, we build embeddings from billions of claims data points by projecting clinical concepts into 50-100 dimension vector spaces. This not only reduces the features’s dimension but also better captures the nuanced “relatedness”, resulting in more predictive features.
Algorithms
Traditional machine learning algorithms such as Random Forest will be used as the baseline. Deep Learning with lighter-weight Transformer architecture will also be explored to reduce reliance on expert-guided feature engineering and uncover new patterns.
Privacy-preserving Collaborative Machine Learning
Based on the COVID-19 solution requirements and constraints, Differential Privacy, Secure Multiparty Computation, and Machine Learning Teacher Ensemble techniques will be applied.