Experience

Biostatistics and Modelling for Bioscience

Within DSM, I am part of of Biotechnology Center, which perform research and improvement to provide solution to both business units within the company and external clients. My task is to provide statistical support to the assigned business projects. Currently I am involved in one of the largest business project within DSM related to bio-based processes, as well as several corporate research projects related to enzyme activities in food application. My responsibilities are:

  1. to provide optimized design experiment to ensure that the data generated from the corresponding experiment can answer the research questions within valid statistical statements
  2. to perform statistical analysis and informative report and visualization of the data
  3. to deliver the result of the analysis to facilitate the interpretation of the data by scientists from multi-disciplinary fields: biochemistry, fermentation, downstream processing and genetics.

Post-doctoral Research - Data Analysis for Clinical Research

In my post-doctoral research, my main task is to set up a fast, robust and data analysis pipeline tailored for metabolomics analysis, especially on characterization and quantification of phospholipids involved in metabolic diseases such as: MEGDEL syndrome, Leukodysthrophy and Bart syndrome. The project is part of the Leukotreat project, funded by FP9 European research consortium. A freely available pre-processing package XCMS was modified and validated to automate and to enhance data pre-processing for various projects performed within the Academisch Medical Centrum Amsterdam and collaboration with other universities in the Netherlands. On top of the automated data processing pipeline, a clustering method to separate the peaks from different phospholipids in three-dimensional LC-MS chromatograms was developed.

PhD Thesis: Proteomics and Multivariate Statistics

Proteomics

Most of my PhD project was dedicated to design and develop algorithms to improve the in-house data processing pipeline of highly complex LC-MS data set. The work focused on optimization of the time alignment algorithms, which was one of the main bottle-neck in processing complex LC-MS chromatograms. The modification of Correlation Optimized warping in combination with Component Detection Algorithm (CODA) pre-selected, high-quality mass traces was published in Journal of Analytical Chemistry 2008. The improvement of time alignment algorithms was continued by adapting Dynamic Time Warping and Parametric Time Warping to work on CODA selected mass traces. The thorough assessment of three methods can be found in this paper. Several generic methods such as reference selection algorithm, local and global evaluation methods were developed to assess the quality of time alignment and to compare the performance of different algorithms. These new algorithms were evaluated on the existing data sets such as serum and urine samples having different analytical and biological variability.


Multivariate Statistics

The last part of the thesis involved post-processing pipeline of LC-MS data set. The work compared the performance of widely used feature selections algorithm with respect to different condition of data sets such as number of samples and true difference between groups. The behavior of these algorithm were assessed to gain insight and provide advice to biologists/biochemists as the users of these methods on which kind of data sets and on which research question the methods performed best.

The methods being compared are univariate t-test with multiple testing correction of benjamini-hochberg, Nearest-shrunken Centroid based on probability score, Principal Component Discriminant Analysis, Partial Least Squares Discrimiant Analysis and Linear Support Vector Machine combined with Reduced Feature Elimination. Forty urine samples were spiked in eight level (5 samples per level) measured by LC-MS. The spikes detected in LC-MS result in 153 peaks. Each feature selection method is evaluated based on theirs False Discovery Rates and True Positive Rates.