Research

I am currently an associate professor of clinical Biostatistics (in Psychiatry), Columbia University. My research interests broadly include machine learning and time series. My recent work focuses on latent feature extraction in high-dimensional structured data and brain imaging. My postdoctoral works focus on longitudinal trajectory estimation and clustering via dimension reduction for high-dimensional longitudinal brain imaging data. In my dissertation, I focused on independent component analysis (ICA), a statistical method to separate a mixture of autocorrelated signals.

Multivariate Mediation Analysis

TBA...

Longitudinal High-dimensional Data Analysis - LFPCA

Magnetic resonance imaging (MRI) is commonly used in the study of brain structure. Many studies are based on measurements of tissue volumes within a number of predefined regions of interest (ROIs), which require regional definitions before the analysis is conducted. In disease studies, this can be difficult without sufficient prior knowledge about what and how various regions will be affected. Alternatively, voxel-based morphometry (VBM) is a complementary technique that measures local brain volume changes in a normalized space and thus does not suffer from these limitations [1]. In longitudinal VBM analysis, currently existing statistical methods are based on voxel-level statistical testing, which often fails to detect longitudinal changes because of substantial registration or non-biological error. To overcome the limitations of current methods, we consider a data-driven analysis to provide a more complete statistical framework for the analysis of high-dimensional longitudinal brain images, specifically in the context of Regional Analysis of Volumes Examined in Normalized Space (RAVENS) imaging. A key insight is the ability of LFPCA to uncover interesting directions of variation in the presence of error from registration to a template. Previously, registration errors were handled via either extremely aggressive smoothing during the post-registration processing or by improved normalization algorithms. While improved algorithms are certainly a desirable goal, all normalization algorithms must be tuned and suffer from bias/variance trade offs. Our results suggest the possibility of employing less aggressive normalization.

Seonjoo Lee, Vadim Zipunnikov, Brian S. Caffo, Daniel S. Reich, Dzung L. Pham (2012)

"Statistical Image Analysis of Longitudinal RAVENS Images: Methodology and Case Study"

Submitted to Biostatistics. [Manuscript] [Figures]

Time-lag Independent Component Analysis

Although researchers have had great success in developing efficient ICA algorithms, there has not been much research on ICA model validation. We investigated the performance of ICA algorithms under various mixing conditions. Currently existing ICA algorithms are built on assumptions about the nature of the mixing (instantaneous or convolutive) and the nature of the sources (independent or autocorrelated). We investigated the performance of ICA algorithms under various mixing conditions. As a part of validation, I proposed a convolutive ICA algorithm for echoic mixing cases. Our simulation studies show that the performance of ICA algorithms is highly dependent on the mixing conditions and temporal independence of the sources. Most instantaneous ICA algorithms fail to separate autocorrelated sources, while convolutive ICA algorithms seems to highly depend on the model specification and approximation accuracy of unmixing filters.

Seonjoo Lee, Brain S. Caffo, Balajie Lakshmanan, and Dzung L. Pham (2012).

"Independent Component Analysis of the Mixture with Timelag and its Evaluation"

Submitted to Biometrics. [Manuscript].

Color Independent Component Analysis

ICA is an effective data-driven technique for extracting the source signals from their mixtures. It aims to solve the blind source separation problem by expressing a set of observed mixed signals as linear combinations of independent latent random variables. The majority of the existing ICA methods are based on independence measures of the sources, either parametrically or nonparametrically. However, marginal density based ICA methods, the most common form of ICA, do not contain information about the correlation structures within the source signals. By not accounting for intra-correlation information, interesting signals

can sometimes be left unidentified. In fMRI studies, for example, the experiment-stimulus related signals and physiological signals such as heart beat or breathing are usually periodic. Therefore, colored noise structures within the signals are embedded in the fMRI data. In order to exploit correlation structure information within the sources, I formulated ICA in the spectral domain. To model spectral densities, I have two approaches: 1) a parametric approach involving

autoregressive moving average (ARMA) models and 2) a nonparametric approach using the logspline density estimation. In my dissertation, I developed both methods, established their theoretical properties, and illustrated the advantages over established ICA methods via simulation and real fMRI and EEG data analysis.

Seonjoo Lee, Haipeng Shen, Young Truong, Mechelle Lewis and Xuemei Huang (2011),

"Independent Component Analysis Involving Autocorrelated Sources

with an Application to Functional Magnetic Resonance Imaging"

Journal of the American Statistical Association, Volume 106, Issue 495 [link]

Poisson Factor Analysis with Normalization for miRNA-seq Data Analysis - PSVDOS

The miRNA-sequencing dataset has a new data format with read counts of pre-specified miRNAs from a cell. Statistical challenges are encountered due to special features of NextGen sequencing data: the data are read counts that are extremely skewed and non-negative; the total number of reads vary dramatically across samples that require appropriate normalization. However, statistical tools developed for microarray expression data, such as principal component analysis, are sub-optimal for analyzing sequence data. We proposed a family of Poisson factor models that explicitly takes into account the count nature of sequencing data, and automatically incorporates sample normalization through the use of offsets. We developed an efficient algorithm for estimating the Poisson factor model, entitled Poisson Singular Value Decomposition with Offset (PSVDOS).

Seonjoo Lee, Polly Chugh, Haipeng Shen and Dirk Dittmer, (2012)

"Poisson Singular Value Decomposition for Non­normalized miRNA­sequencing Data"

Submitted to Bioinformatics. [Manuscript] [Supplements]