Sadaf Tafazoli | Ph.D. Student
Curious about Trees and Words
Research Interest | Machine Learning , Data Mining , Time Series Analysis
University of California, Riverside
Contact: stafa002@ucr.edu
CS 234 Project Progress
Phase 1: Looking for a Research Project Idea!
- Looking for an interdisciplinary project idea, that connect my research interests with computational molecular biology
- Studied a technique for time series representation of DNA (DNA to 1D time series )
- Studied a technique for a 3D time series representation of DNA (DNA to 3D time series ) [1]
[1] E. Keogh, S. Lonardi, V. B. Zordan, S.H. Lee and M. Jara.(2005). Visualizing the Similarity of Human and Chimp DNA. (Multimedia Video). http://www.cs.ucr.edu/~eamonn/DNA/
![](https://www.google.com/images/icons/product/drive-32.png)
- Studied a technique for clustering DNA sequences using an encoding technique called Intelligence Icon[2].
[2] Keogh, E., Wei, L., Xi, X., Lonardi, S., Shieh, J. and Sirowy, S., 2006, December. Intelligent icons: Integrating lite-weight data mining and visualization into GUI operating systems. In Data Mining, 2006. ICDM'06. Sixth International Conference on (pp. 912-916). IEEE.
- This project Idea seems interesting:
Baum-Welch. Implement the Baum-Welch algorithm for HMM described in class. Test the training of the HMM on a biological relevant dataset (e.g. TF binding sites), and test in on a chrosomome. Collect experimental time measurements.
I did some study on HMM, Forward-Backward Algorithm, Viterbi Algorithm and Baum-Welch algorithm
- Project idea : implementing an HMM algorithm for solving the problem of recognizing CpG Island described in class, using Viterbi algorithm.
Phase 2: Finding CpG islands using Viterbi algorithm
- I did g some experiment with hmmlearn library with some synthetic data.
- hmmlearn implements the Hidden Markov Models (HMMs). The HMM is a generative probabilistic model, in which a sequence of observable X variables is generated by a sequence of internal hidden states Z. The hidden states are not observed directly. The transitions between hidden states are assumed to have the form of a (first-order) Markov chain. They can be specified by the start probability vector π and a transition probability matrix A. The emission probability of an observable can be any distribution with parameters θ conditioned on the current hidden state. The HMM is completely determined by π, A and θ. There are three fundamental problems for HMMs:
- Given the model parameters and observed data, estimate the optimal sequence of hidden states.
- Given the model parameters and observed data, calculate the likelihood of the data.
- Given just the observed data, estimate the model parameters.
- The first and the second problem can be solved by the dynamic programming algorithms known as the Viterbi algorithm and the Forward-Backward algorithm, respectively. The last one can be solved by an iterative Expectation-Maximization (EM) algorithm, known as the Baum-Welch algorithm.
- in this library they have an implemented Viterbi algorithm that can be used to estimate the optimal sequence of hidden states, given the model parameters and observed data. for my project I am going to implement a Viterbi algorithm that learns its parameters from a training data, and then predict the hidden states.
2.1 Data Set
- To make a sample training and test data set I used EMBOSS Cpgplot software to annotate a nucleotide sequence.
2.2 Making an HMM (Hidden Markov Model)
- At this phase I made an HMM for CpG island problem. In the CpG island problem there are two hidden states ( a state that represent an island with High CpGs , and another state that represent an island with Low CpG) and in each state there can be four outcome: A, C, G, T .
- I calculated the emission probabilities for each state and also I calculated transition probabilities between hidden states based on the training data. with having these probabilities(this HMM) we can now predict the region of CpG island a given sequence using Viterbi algorithm .
2.3 Viterbi Algorithm implementation
- I implemented Viterbi algorithm and I tested it on both real and fake data.