This page provides a summary of some of my past and recent research work (updated July. 2020).
My research aims to develop methods for analyzing and extracting information from large complex datasets. In my current and past work, I have collaborated with researchers in neurology, solar physics, immunology, and genetics and have developed methods for analyzing data in a wide variety of applications, including high frequency oscillations (HFOs) occurring in intracranial EEG measurements in epileptic patients, solar images of active regions and sunspots, and biomedical data including single-cell RNA-sequencing (scRNA-seq) measurements, mass cytometry data, gut microbiome data, single nucleotide polymorphisms (SNP) measurements, and Hi-C chromatin conformation measurements. In all of these systems, the underlying processes are complex and nonlinear, requiring the development of new models and methods for analysis.
Click on the link to jump to the relevant section.
Data Visualization
Figure: Comparison of the PHATE visualization to PCA, t-SNE, and diffusion maps on (A) artificial tree data in 40 dimensions colored by branch and (B) scRNA-seq data from embryoid bodies colored by time sample. PHATE is the only method that captures both the local and global structure of the data in 2 dimensions.
Data visualization is a useful tool for interpreting data which is necessary for scientific discovery. The high dimensionality of many datasets makes it difficult to visualize and interpret the data. In many big biological data technologies, the data are also very noisy. While dimensionality reduction and clustering methods are commonly used to denoise and visualize the data, existing methods such as t-SNE, diffusion maps, and PCA are not adapted to visualizing high-dimensional progression or transition structures.
Many datasets exhibit many different progressions. For example, all the cells in the human body develop from a single cell, bone marrow cells are constantly changing states, frames from a video exhibit smooth transitions, and many other examples exist. In many cases, the dynamics that characterize these progressions or transitions between states are of greater interest than the states themselves. I am developing a new data visualization technique called PHATE that is particularly well-suited for revealing these progression structures. In PHATE, we first use the manifold learning technique of data diffusion to learn the lower-dimensional structure. This denoises the data and aids in dimensionality reduction. However, this procedure often encodes different trajectories in different diffusion dimensions which makes it difficult to visualize multiple trajectories simultaneously. Additionally, the diffusion process can introduce artifacts at boundaries in the data. PHATE corrects both of these issues by transforming the diffused probabilities into a diffusion potential which are then embedded into two or three dimensions for visualization via multidimensional scaling. This has the effect of encoding the trajectories in fewer dimensions and reducing boundary artifacts which improves the visualization.
An advantage of PHATE over other visualization methods such as t-SNE is that it is metric preserving. Thus relative positions of cells or data points within the visualization are meaningful as PHATE reveals the actual geometry of the data. From the visualization, we can then extract information about cell signaling associated with the various trajectories. This will give valuable insight in cell development and progression. Note that while PHATE is designed to preserve transition structures, it does not impose any structural assumptions on the data. Thus PHATE can be used to visualize other structures as well such as clusters.
We are applying PHATE to facebook network data, facial images, and multiple biological datasets including bone marrow mass cytometry data, gut microbiome data, SNP data, Hi-C data, and new single-cell RNA-sequencing (scRNA-seq) data obtained from embryoid bodies (EB) by Professor Natalia Ivanova’s lab. With PHATE, we identify previously unknown branches within bone marrow scRNA-seq data and developmental branches within the new EB data. Biological experiments are currently being designed to validate these results. Given its success at identifying biological structure, we also plan to use PHATE to identify branching structures in other data such as cancer data. Analyzing these structures may lead to innovative methods of treatment and prevention.
Relevant Publications and Media:
K.R. Moon, D. van Dijk, Z. Wang, S. Gigante, D. Burkhardt, W. Chen, K. Yim, A. van den Elzen, M.J. Hirn, R.R. Coifman, N.B. Ivanova, G. Wolf, S. Krishnaswamy, "Visualizing Transitions and Structure for Biological Data Exploration," Nature Biotechnology, vol. 37, no. 12, pp. 1482-1492, Dec. 2019. (Link, bioRxiv, code)
Youtube video showing PHATE on the Frey Face dataset: full or short version
W. Zhang, J. Rhodes, A. Garg, J. Takemoto, X. Qi, S. Harihar, C.T. Chang, K.R. Moon, A. Zhou, "Label-free discrimination and quantitative analysis of oxidative stress induced cytotoxicity and potential protection of antioxidants using Raman micro-spectroscopy and machine learning," Analytica Chimica Acta, vol. 1128, pp. 221-230, Sept. 2020. (Link)
Y. Zhao, M. Amodio, B. Vander Wyk, B. Gerritsen, M.M. Kumar, D. van Dijk, K.R. Moon, X. Wang, A. Malawista, M.M. Richards, M.E. Cahill, A. Desai, J. Sivadasan, M.M. Venkataswamy, V. Ravi, P. Kumar, S.H. Kleinstein, S. Krishnaswamy, R.R. Montgomery, "Single cell immune profiling of dengue virus patients reveals distinct immune signatures and intact immune responses to Zika virus," PLOS Neglected Tropical Diseases, vol. 14, no. 3, March 2020. (Link)
M. Shin, K. Yim, K.R. Moon, H. Park. S. Mohanty, J. Kim. R. Montgomery, A. Shaw, S. Krishnaswamy, I. Kang, "Dissecting alterations in human CD8+ T cells with aging by high-dimensional single cell mass cytometry," Clinical Immunology, vol. 200, pp. 24-30, March 2019. (Link)
Extension to Dynamical Data
PHATE assumes that the data are static and does not directly exploit any time structure that may exist in dynamical data. We therefore developed DIG, a visualization method for multivariate time series data that
extracts an information geometry from a diffusion framework. The resulting embedding is noise resilient and presents a faithful visualization of the true structure at both local and global scales with respect to time and the overall structure of the data.
Relevant publications:
A. Duque, G. Wolf, K.R. Moon, "Visualizing high dimensional dynamical processes," IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Oct. 2019. (Link, arXiv, code)
Computing an Inverse Function from Visualizations
Common approaches to dimensionality reduction and visualization, including PHATE, use kernel methods for manifold learning. However, these methods typically only provide an embedding of fixed input data and cannot extend easily to new data points. On the other hand, while autoencoders naturally compute feature extractors that are both extendable to new data and invertible (i.e. reconstructing original features from latent representation), they are limited in their ability to represent intrinsic geometry in comparison to kernel-based manifold learning techniques. In GRAE, we combined the features of both approaches by imposing a geometry regularization to the autoencoder bottleneck that forces the learned latent representation to be similar to a selected kernel-based manifold learning technique (e.g. PHATE or t-SNE). This results in an embedding function that is extendable to new points, invertible, and results in improved reconstruction results over standard autoencoders. A potential application of this work is in using the embeddings to generate points in the original space that may have desirable characteristics of points that are nearby (i.e. style transfer).
Relevant publications:
A. Duque, S. Morin, G. Wolf, K.R. Moon, "Extendable and invertible manifold learning with geometry regularized autoencoders," 2020. (arXiv)
Supervised Data Visualization
Most dimensionality reduction and visualization methods are unsupervised and do not take class labels into account even when they are available. The result is that unsupervised methods tend to emphasize the dominating structure in the data. In some settings, we are interested in visualizing the structure of the data that is relevant to the supervised learning problem. We developed a novel supervised visualization technique based on random forest proximities and diffusion-based dimensionality reduction called RF-PHATE. RF-PHATE outperforms other existing supervised dimensionality reduction methods, both quantitatively and qualitatively.
Relevant publications:
J.S. Rhodes, A. Cutler, G. Wolf, K.R. Moon, "Supervised visualization for data exploration," 2020. (arXiv)
Estimating Information Theoretic Measures
Many data analysis problems can be solved using information theoretic measures. For example, mutual information measures can be used to learn a network or structure within the data, estimate the relationship strength between variables such as genes in single cell data, and select features for improved classification. Divergence measures (e.g. the Kullback Leibler divergence) can be used to test the hypothesis that two sets of samples come from the same probability distribution (i.e. a generalized mean comparison test), cluster data with distributions as features, and estimate bounds on the Bayes error for benchmarking classification problems. Entropy measures can be used for image registration, anomaly detection, and estimating the intrinsic dimension for improved dimensionality reduction. Additionally, density estimation is of independent interest as the probability density can be used in anomaly detection and to characterize metastable states for clustering.
An application that is of particular interest is related to the problem of classification, where the goal is to learn a classifier that minimizes the average probability of error for future samples. However, there exist many different classifiers of varying complexity and it is not known a priori which one will perform the best on a given data set. Additionally, the suitability of the measured feature or parameter space to the classification task at hand is often unknown. For example, in biology, there are many different technologies for measuring cell attributes including flow cytometry, mass cytometry, bulk RNA-sequencing, Hi-C chromatin measurements, single-cell protein measurements, and single-cell RNA-sequencing methods to name a few. Each of these technologies have different costs and, in some cases, measure very different attributes. It is often unknown which technology is best suited in terms of cost and precision for detecting various diseases including cancer. Thus a benchmark is needed for these classification tasks that measures the predictive capabilities of a given feature space.
A common approach to derive such a benchmark is to apply a large corpus of classifiers to the data and choose the classifier with the lowest test error. This can be computationally intensive, especially if some of the classifiers require the selection of tuning parameters. Additionally, many classifiers can overfit the data, especially when the dimension is large, resulting in a poor generalization error.
A better benchmark for classification is the Bayes error. The Bayes error is the lowest average probability of error that any classifier can achieve on a given feature space; thus it is classifier agnostic and only depends on the feature space. Unfortunately, direct estimation of the Bayes error is difficult due to its dependence on the non-smooth min function. However, there are many bounds on the Bayes error that are related to a family of smoother information theoretic quantities known as divergence functionals, which are more easily estimated. Divergence functionals are integral functionals of two probability densities. Some common divergences include the Kullback-Leibler divergence, the Hellinger distance, and the Renyi divergences. Thus good bounds on the Bayes error can be estimated for benchmarking purposes by estimating the appropriate divergence functionals.
In practice, the underlying densities are unknown and so the densities and information measures must be estimated from data. A common approach is to fit the densities to a parametric model such as a Gaussian distribution. However, these models are often a poor fit for high-dimensional data. Additionally, numerical integration may be required which is computationally intensive.
Given these weaknesses in parametric models, I used nonparametric estimation techniques to estimate information measures including divergence functionals. Until recently, little was known about the mean squared error (MSE) convergence rates of existing nonparametric divergence estimators. In my work, I analyzed the MSE convergence rate of standard kernel density plug-in estimators. I showed that the bias of these estimators, as a function of the sample size, scales exponentially with the dimension of the data, resulting in heavily biased estimators for even relatively small dimension. In these same works, we counter this curse of dimensionality by deriving ensemble methods to obtain estimators that achieve the optimal convergence rate when the densities are sufficiently smooth. I also extended the theory of these ensemble estimators by deriving its asymptotic distribution. This enables us to construct confidence intervals and p-values, which are crucial in scientific research.
I have used these estimators to estimate various information theoretic measures of scientific data.
Relevant Publications:
W. Zhang, J. Rhodes, A. Garg, J. Takemoto, X. Qi, S. Harihar, C.T. Chang, K.R. Moon, A. Zhou, "Label-free discrimination and quantitative analysis of oxidative stress induced cytotoxicity and potential protection of antioxidants using Raman micro-spectroscopy and machine learning," Analytica Chimica Acta, vol. 1128, pp. 221-230, Sept. 2020. (Link)
K.R. Moon, K. Sricharan, K. Greenewald, A.O. Hero III, "Ensemble Estimation of Information Divergence," Entropy (Special Issue on Information Theory in Machine Learning and Data Science), vol. 20, no. 8, pp. 560, July 2018. (Link, arXiv)
K.R. Moon, K. Sricharan, A.O. Hero III, "Ensemble estimation of mutual information," IEEE International Symposium on Information Theory (ISIT), pp. 3030-3034, June 2017. (Link, long version at arXiv)
K.R. Moon, V. Delouille, and A.O. Hero III, "Meta learning of bounds on the Bayes classifier error," IEEE Signal Processing and SP Education Workshop, pp. 13-18, Aug. 2015. (Link, arXiv)
K.R. Moon and A.O. Hero III, "Multivariate f-divergence estimation with confidence," Advances in Neural Information Processing Systems (NIPS), pp. 2420-2428, Dec. 2014. (Link, arxiv)
K.R. Moon and A.O. Hero III, "Ensemble estimation of multivariate f-divergence," IEEE International Symposium on Information Theory (ISIT), pp. 356-360, June 2014. (Link, long version at arxiv)
Unsupervised Learning on Sunspot Images
This work is in collaboration with researchers from the Royal Observatory of Belgium and is focused on finding patterns within sunspot and active region images using signal processing and unsupervised learning techniques. Sunspots are dark areas seen in white light images of the Sun. They correspond to regions of locally enhanced magnetic field known as active regions, which are visible in magnetogram images. The morphology of sunspot groups and the associated active region is correlated with solar flare incidence. Sunspot groups are commonly classified by eye to aid in solar flare prediction. However, visual classification introduces bias stemming from the artificial and subjective nature of the discrete categorization. Some studies have attempted to reproduce these classification schemes through supervised learning techniques. While this has resulted in reduced human bias, this does not reduce the bias inherent in the classification scheme.
We have thus focused on unsupervised learning techniques. Our method uses an image patch analysis and matrix factorization approach that has led to natural groupings of sunspot images that coincides with physical features such as size of the sunspots and the distribution of magnetic field values. The divergence estimators we have derived have been useful in this analysis as well.
Open problems in this area include expanding our analysis to a time-series of images and connecting it directly to solar flares.
Relevant Publications:
K.R. Moon, V. Delouille, J.J. Li, R. De Visscher, F. Watson, and A.O. Hero III, "Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization," Topical Issue on Statistical Challenges in Solar Information Processing, Journal of Space Weather and Space Climate, vol. 6, A3, Jan. 2016. (Link, arxiv)
K.R. Moon, J.J. Li, V. Delouille, R. De Visscher, F. Watson, and A.O. Hero III, "Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis," Topical Issue on Statistical Challenges in Solar Information Processing, Journal of Space Weather and Space Climate, vol. 6, A2, Jan. 2016. (Link, arxiv)
K.R. Moon, J.J. Li, V. Delouille, F. Watson, and A.O. Hero III, "Image patch analysis and clustering of sunspots: A dimensionality reduction approach," IEEE International Conference on Image Processing (ICIP), pp. 1623-1627, Oct. 2014. (Link, arxiv)
Other Projects
High Frequency Oscillations in Epilepsy Patients
This project is in collaboration with researchers in the Neurology department at the University of Michigan. This work investigates the relationship between high frequency oscillations (HFOs) and epileptic events in epilepsy patients. We have used the divergence and entropy estimators to estimate intrinsic dimension and estimate bounds on the Bayes error for a classification problem.
Relevant Publications:
S.V. Gliske, K.R. Moon, W.C. Stacey, and A.O. Hero III, "The intrinsic value of HFO features as a biomarker of epileptic activity," IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6290-6294, Mar. 2016. (Link, arXiv)
Data Imputation in Single-Cell RNA-Sequencing
Recently, single-cell RNA-sequencing (scRNA-seq) technologies have emerged which measure tens of thousands of gene expression levels in thousands of cells. This higher resolution of gene expression measurements at the single-cell level promise to provide valuable insights in gene-gene relationships at the cell level. However, this technology suffers from prevalent undersampling of molecules and dropout of low-expression genes which can confound biological relationships. Due to the complexity and noise levels of scRNA-seq data, standard methods of data imputation such as matrix completion have been ineffective.
In my postdoctoral research, I have worked on two projects focused on data imputation for single-cell data. One project is in collaboration with Professor Dana Pe’er’s lab at the Memorial Sloan Kettering Cancer Center. In this approach, we model the data as having been sampled from a low-dimensional manifold. Key aspects of manifold models are that the dimension of the manifold is significantly lower than that of the ambient space and that transitions along the manifold are smooth. The manifold model is a reasonable assumption for scRNA-seq data as many genes are co-expressed (resulting in fewer degrees of freedom) and cells transition smoothly from state to state as they develop and differentiate. Any deviations from the manifold model can then be viewed as noise and the data can be denoised by projecting the data onto the manifold. To do this, the underlying manifold must first be learned. A popular and effective method of manifold learning uses a diffusion process, which learns both the local and global structure of the data. The data is then imputed by projecting the data samples onto the learned manifold. This approach, which we call MAGIC, has shown great success in recovering known gene relationships that are completely unrecognizable before imputation. Based on this success on known relationships, MAGIC can confidently recover previously unknown gene relationships in scRNA-seq data.
The second project uses an autoencoder to perform imputation in addition to clustering and visualization. Autoencoders are unsupervised deep neural networks that build a (typically) lower-dimensional representation of the data that can be used to accurately reconstruct the original data. Regularizers are applied at various layers to control the results. By carefully choosing the architecture and regularizers, the output of an autoencoder is a denoised version of the original data. This denoising property can be exploited to perform imputation. Furthermore, via a careful choice of regularizers, we construct an autoencoder that also automatically clusters the data and creates a representation that is suitable for visualization.
Relevant Publications:
M. Amodio, D. van Dijk, K. Srinivasan, W. Chen, H. Mohsen, K.R. Moon, A. Campbell, Y. Zhao, X. Wang, M. Venkataswamy, A. Desai, V. Ravi, P. Kumar, R. Montgomery, G. Wolf, S. Krishnaswamy, "Exploring Single-Cell Data with Multitasking Deep Neural Networks," Nature Methods, vol. 16, pp. 1139-1145, Oct. 2019. (Link, bioRxiv, code)
D. van Dijk, R. Sharma, J. Nainys, K. Yim, P. Kathail, A. Carr, C. Burdsiak, K.R. Moon, C. Chaffer, D. Pattabiraman, B. Bierie, L. Mazutis, G. Wolf, S. Krishnaswamy, D. Pe'er, "Recovering Gene Interactions from Single-Cell Data Using Data Diffusion," Cell, vol. 174, no. 3, pp. 716-729, July 2018. (Link, bioRxiv)
Detrending Mass Cytometry Data
Another complex source of technical noise in biological data arises from the fact that the measurement noise may be time-dependent. For example, as cell markers are measured in a mass cytometer, the measurement bias may drift during the measurement process. This drift in the bias may also result in biologically unrelated differences between samples. The current approach for correcting this machine bias in mass cytometry data is known as bead normalization. In this approach, small metal beads are included in several channels during the measurement process. The time trend in each bead channel is estimated and this time trend is subtracted from the other channels. There are several limitations with this approach. First, the inclusion of beads can be expensive in large-scale studies. Second, generally only a maximum of six bead channels are used to normalize 30 or more channels. The time trend for non-bead channels is interpolated. However, the time trends of the bead channels often differ greatly from each other, suggesting that it is unlikely that the time trend can be interpolated. Third, the beads used for normalization are very different from biological samples and there is evidence that the bead trends underestimate the actual time trends introduced by the machine in the biological data. Given these limitations, I am currently developing a normalization method that uses only the data and does not require beads. We are also extending our method to mass cytometry imaging data, which suffer from similar artifacts.