Home

Sadaf Tafazoli | Ph.D. Student

Curious about Trees and Words

Research Interest | Machine Learning , Data Mining , Time Series Analysis

University of California, Riverside

Contact: stafa002@ucr.edu

CS 234 Project Progress

Phase 1: Looking for a Research Project Idea!

Looking for an interdisciplinary project idea, that connect my research interests with computational molecular biology
Studied a technique for time series representation of DNA (DNA to 1D time series )

Studied a technique for a 3D time series representation of DNA (DNA to 3D time series ) [1]

[1] E. Keogh, S. Lonardi, V. B. Zordan, S.H. Lee and M. Jara.(2005). Visualizing the Similarity of Human and Chimp DNA. (Multimedia Video). http://www.cs.ucr.edu/~eamonn/DNA/

dnaDivX.avi

Studied a technique for clustering DNA sequences using an encoding technique called Intelligence Icon[2].

[2] Keogh, E., Wei, L., Xi, X., Lonardi, S., Shieh, J. and Sirowy, S., 2006, December. Intelligent icons: Integrating lite-weight data mining and visualization into GUI operating systems. In Data Mining, 2006. ICDM'06. Sixth International Conference on (pp. 912-916). IEEE.

This project Idea seems interesting:

Baum-Welch. Implement the Baum-Welch algorithm for HMM described in class. Test the training of the HMM on a biological relevant dataset (e.g. TF binding sites), and test in on a chrosomome. Collect experimental time measurements.

I did some study on HMM, Forward-Backward Algorithm, Viterbi Algorithm and Baum-Welch algorithm

Project idea : implementing an HMM algorithm for solving the problem of recognizing CpG Island described in class, using Viterbi algorithm.

Phase 2: Finding CpG islands using Viterbi algorithm

I did g some experiment with hmmlearn library with some synthetic data.
hmmlearn implements the Hidden Markov Models (HMMs). The HMM is a generative probabilistic model, in which a sequence of observable X variables is generated by a sequence of internal hidden states Z. The hidden states are not observed directly. The transitions between hidden states are assumed to have the form of a (first-order) Markov chain. They can be specified by the start probability vector π and a transition probability matrix A. The emission probability of an observable can be any distribution with parameters θ conditioned on the current hidden state. The HMM is completely determined by π, A and θ. There are three fundamental problems for HMMs:

Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.

The first and the second problem can be solved by the dynamic programming algorithms known as the Viterbi algorithm and the Forward-Backward algorithm, respectively. The last one can be solved by an iterative Expectation-Maximization (EM) algorithm, known as the Baum-Welch algorithm.
in this library they have an implemented Viterbi algorithm that can be used to estimate the optimal sequence of hidden states, given the model parameters and observed data. for my project I am going to implement a Viterbi algorithm that learns its parameters from a training data, and then predict the hidden states.

2.1 Data Set

To make a sample training and test data set I used EMBOSS Cpgplot software to annotate a nucleotide sequence.

2.2 Making an HMM (Hidden Markov Model)

At this phase I made an HMM for CpG island problem. In the CpG island problem there are two hidden states ( a state that represent an island with High CpGs , and another state that represent an island with Low CpG) and in each state there can be four outcome: A, C, G, T .

I calculated the emission probabilities for each state and also I calculated transition probabilities between hidden states based on the training data. with having these probabilities(this HMM) we can now predict the region of CpG island a given sequence using Viterbi algorithm .

2.3 Viterbi Algorithm implementation

I implemented Viterbi algorithm and I tested it on both real and fake data.