Projects

Information extraction from cancer pathology reports:

Surveillance, Epidemiology, and End Results is a nation-wide program administered by NIH to track trends in cancer incidence across the USA. Cancer pathology reports are collected from pathology laboratories across the USA and maintained by dedicated cancer registries. Cancer registrars manually process these reports and extract pertinent information to monitor cancer incidence trends. Automating this task is critical for a cost-effective, nation-wide, and up-to-date cancer surveillance program. I have improved the state-of-the-art convolution neural networks (CNN) model for information extraction from cancer pathology reports in two major ways in my postdoctoral research. First, I collaborated to develop a probabilistic deep learning model for information extraction tasks. The state-of-the-art CNN model for information extraction is prone to overfitting when training examples are scarce and have limited uncertainty quantification capabilities. In this collaborative work, we built a deep kernel learning (DKL) model with a shallow-wide convolutional neural network (CNN) feature extractor to overcome these limitations. DKL model is obtained by feeding a neural network (NN) feature extractor into a Gaussian process (GP) classifier and training the resulting model with stochastic gradient descent in a variational inference framework. We showed that our probabilistic approach has a substantial benefit over the state-of-the-art CNN model, especially in the low training-example regime. Our model is relevant even in the sizeable training-example regime, as the cancer pathology reports have huge class-imbalance and rare cancer cases stay in low training-example regimes regardless. Second, I built a reduced-order model of the shallow CNN model for reliable and interpretable information extraction deployment. The main insight behind the reduction is that information extraction from cancer pathology reports requires only a small number of domain-specific text segments to perform the task. The shallow CNN model is well-suited to identify these key short text segments from the labeled training set; however, the identified text segments remain obscure to humans. This study filled this gap by developing a model reduction tool to make a reliable connection between CNN filters and relevant text segments by discarding the spurious associations. We reduced the complexity of shallow CNN representation by approximating it with a linear transformation of n-gram presence representation with a non-negativity and sparsity prior on the transformation weights to obtain an interpretable model.

Resources: paper, paper, slides


Cancer Prognosis using histopathology images:

Providing cancer prognosis has important implications in cancer treatment and monitoring. A growing body of research has shown that host immune response to the tumor has prognostic and predictive significance in many malignancies, including breast cancer, cutaneous melanoma, colon cancer, and lung adenocarcinoma. Tumor-infiltrating lymphocyte (TIL) scoring systems are introduced based on these researches for the cancer prognosis. TIL scoring systems for different tumor types vary widely in detail, scope, accuracy, and resource requirements. Improving the prognosis beyond these traditional ways remains an active research area. As digital glass scanner technologies have become more reliable, histopathological whole slide images (WSIs) are increasingly available in large numbers. Computational pathology has seen the widespread application of deep learning in cancer diagnostic and prognosis tasks because of this increased digital pathologies availability. However, their utility has remained limited in cancer prognosis due to the giga-pixel size of whole-slide images and lack of pixel-level annotations due to poor understanding of factors that affect prognosis. I plan to address these issues by proposing a new framework for building a weakly semi-supervised prognosis model. I plan to utilize the large volume of case-level annotations to learn the morphological features to distinguish the high-risk and low-risk cases and exploit the small volume of available pixel-level annotations to overcome any ill-posedness in the weakly supervised model. This project’s research objectives and milestones include training a weakly semi-supervised deep survival model using histopathology images and identifying morphological patterns in histopathology images that contain prognosis information using counterfactual analysis. In my work, I plan to integrate these two-steps into a single end-to-end differentiable model to advance this field. This project’s long-term goal is to develop a novel attribution model to link the morphological features of histopathology images to tumor-infiltrating lymphocyte subtypes in tumor micro-environment utilizing data from other emerging spatial proteomics technologies such as CO-Detection by indEXing (CODEX) multiplexed images.