I have worked with Susan Bridges, Zhaoua Peng and many others on several computational biology research projects. The nature of my role in these projects has ranged from algorithm development, data analysis and system administration. In terms of collaborations, the projects have ranged in size from just two or three people to groups encompassing several different research groups from several universities.
Susan Bridges, Shane Sanders, Nan Wang, from USM, many other researchers and I developed a tool to identify new transcribed genomic regions. The basic idea is to generate proteomics data using mass spectrometry. These sequences are reverse translated and mapped to the genome. Sequences which do not map to known genes are potentially new genes or alternate splice forms. The process is called proteogenomic mapping. I adapted an existing implementation of the Aho-Corasick algorithm to efficiently map the reverse translated sequences to genome. I also implemented an algorithm to predict new genes based on the reverse translated sequences.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-Seq) is a recent technological innovation which allows researchers to quickly and inexpensively identify DNA segments associated with a particular epigenetic feature, such as a histone modification or transcription factor binding site. I have worked with Zhaoua Peng to identify the relationship between the histone modification H3K27me3 and gene expression using ChIP-Seq and gene expression microarray data. Our results in rice endosperm clearly show a relationship between the presence of H3K27me3 and expression suppression.
The epigenetic code hypothesis suggests that epigenetic features interact in complex ways and that the combination of features present at a particular genomic locus affect transcription of surrounding genes. Abiola and I have coupled ChIP-Seq data with the Bayesian network structure learning algorithms we have developed in an attempt to better understand the relationships among some of the features (variables) which compose the epigenetic code. Initial results suggest that the exact, optimal algorithms recover more relationships among features identified in the literature than approximate, greedy learning algorithms.
All of these projects have different system administration concerns. The proteogenomic mapping tool was written in Java. Our development process was very iterative, especially when fine-tuning the behavior of the gene prediction algorithm. I setup and maintain a public mercurial version control repository on bitbucket for the source code. In addition to the quantitative data analysis for the rice endosperm project, we also used GBrowse and JBrowse by GMOD to visually collect qualitative information about important genes and their ChIP-Seq profiles. I installed and continue to maintain servers with these applications, as well as the ChIP-Seq data analysis software. The epigenetic code project entails working with tens of gigabyte-sized datasets. I have both written scalable machine learning algorithms capable of handling such large datasets and scripts which automate running the algorithms.