Project Proposal

Background

As biological datasets increase in size, conventional analysis sometimes are not enough to extract biologically relevant information from the data. To handle this problem, machine learning methods have become popular. In the recent years, deep learning in particular has shown promising results in the analysis of various forms of biological data. Three forms of deep learning that are often seen implemented are the Deep Neural Network (DNN), Convolutional Neural Network (CNN), and Recurrent Neural Networks (RNN). These models have shown good results in multiple topics including natural language processing and image recognition, but have only just begun to be applied to computational biology.

Problem

Even though deep learning has shown to be promising in computational biology, it is very much a black box and not always well understood. There have been a few recent surveys on deep learning in computational biology1,2, but we feel that the surveys either do not sufficiently cover all the common methods or do not sufficiently explain the methods in a way that they are no longer a black box tool. It is also not a topic we will be covering in class. There is often the problem of understanding how and why a certain deep learning approach works the way it does. We believe that better understanding the complex nature of deep learning methodologies can help scientists extract the biologically relevant findings of their deep learning models.

Specific Aim

Our specific aim is to create a survey that will give not only just a basic introduction to deep learning, but also an in-depth look at DNNs, CNNs, and RNNs. We will cover not only how these models are trained and why they work, but also how to interpret the model in order to fully explore the data captured by the model.

Implementation Strategy

We will implement our survey in two stages of literature search before finally presenting our findings. Our first literature search will focus on the nature of deep learning. We will explore what separates deep learning from machine learning. We will also take the three black box models mentioned in the specific aim and build an understanding of how to build and train the model, how to interpret the model, and finally why this model actually works the way it does. With a better understanding of deep learning and how to apply it, we will then seek out specific journal articles in computational biology where deep learning has been applied in an impactful way within the field of computational biology, for example Basset3 used a CNN to capture DNA binding motifs as well as in silico saturation mutagenesis through the implementation and interpretation of a CNN. In another example, Gomez-Bombarelli et al. used an autoencoder as a generative model for automated drug design.4 We will then publish our findings on our website as well as present it to the rest of our class.

References

  1. Angermueller, C., Pärnamaa, T., Parts, L. and Stegle, O., 2016. Deep learning for computational biology. Molecular systems biology, 12(7), p.878.
  2. Ekins, S., 2016. The Next Era: Deep Learning in Pharmaceutical Research. Pharmaceutical research, 33(11), pp.2594-2603.
  3. Kelley, D.R., Snoek, J. and Rinn, J.L., 2016. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research, 26(7), pp.990-999.
  4. Gómez-Bombarelli, R., Duvenaud, D., Hernández-Lobato, J.M., Aguilera-Iparraguirre, J., Hirzel, T.D., Adams, R.P. and Aspuru-Guzik, A., 2016. Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415.