Convolutional Neural Network and Probabilistic model for Transcript Factor Binding Sites motif finding
--Jinli Zhang. GGB program. PhD student
Convolutional Neural Network
The 1D convolutional neural network has gained its popularity in natural language processing. The training data can be used to learn high-dimensional vectors, which contain patterns, which can be used as classifiers to predict classes of the unseen input. The pattern learning process is very similar to DNA motif finding - discover the string pattern concealing in DNA fragments enriched from ChIP-seq data.
In this project, I will reimplement CNN and a probabilistic model to find DNA motifs. Comparing the prediction precision from these two different models.
Project progress
Jan/04-Jan/15
Reading paper about convolutional neural network and motif finding, learned how the probabilistic model was used to finding the motif.
Graph convolutional networks for epigenetic state prediction using both sequence and 3D genome data. Jack Lanchantin, Yanjun Qi. Bioinformatics.,36, i659–i667
Chapter 4 Exhaustive search. Neil C. Jones and Pavel Pevzner, An Introduction to Bioinformatics Algorithms, MIT Press, 2004.
Jan/16 - Jan/31
download datasets(bed file) from Encodes, prepare .fasta files as input for PWD matrix and CNN.
watching a Youtube video about how Gibbs sampling was used to do the motif finding.
Feb/1 - Feb/13
implementing Gibbs sampling and Convolutional neural network on small testing data.
reading how Hidden Markov Model was used to do the motif finding.
implementing Expectation and maximization on small testing data and watching youtube https://www.youtube.com/watch?v=kq5NAd4pnkU&t=587s
Feb/14 - Feb/27
Finishing expectation and maximization for motif finding
watch youtube video https://www.youtube.com/watch?v=0Zuqytgf6yY&t=1286s for CNN methods.
implementing CNN
Feb/27- end
reading a Hidden Markov Model model from one github. https://github.com/hamzarawal/HMM-Baum-Welch-Algorithm
implementing Hidden Markov Model
wrapping up the projects.