Deep Learning for Biologists: A Tutorial

Introduction

What is this tutorial about?

You have probably found your way to this tutorial because you have heard about the power of deep learning to solve complex problems involving lots of data. Deep learning approaches have produced a slew of remarkable applications in the past few years. We can caption images with text using deep learning. We can enhance low-res images using deep learning. We can write Shakespeare and blog posts using deep learning. We can style images to look like Van Gogh's Starry Night.

With these advances, biology researchers have begun to ask whether these methods could be helpful in solving big data problems in biology. However, the process of importing deep learning methods, which have often been designed for text or image data, requires some amount of translation.

To use deep learning in biology, researchers need to address:

  • Whether deep learning is appropriate for a particular problem of dataset
  • How to map the problem and data into a deep learning approach
  • How to understand the output from these models and use the approach to answer questions

We wrote this tutorial to help biologists and collaborating computer scientists address these questions and ultimately understand when and how to apply deep learning to biological problems.

Why do we need a tutorial that focuses on deep learning for biology?

We wrote this tutorial because we noticed that most online tutorials, videos and presentations on deep learning were directed at computer scientists who likely already had significant background in machine learning and programming. In addition, the examples presented in these tutorials often draw heavily from image analysis- categorizing images by their content or captioning images with text- or text analysis. While these examples can help illustrate the inner workings of deep learning approaches and showcase how effective deep learning can be in practice, they may not be useful for biologists who wish to work with biological data. Based on these observations, we decided to write a guide to help biologists understand how deep learning works, focusing on the core mechanics and key parameters without getting too far into complex math or programming. Ideally, this presentation will help biologists determine whether to use deep learning for their analysis.

In addition, computer scientists who are familiar with deep learning may be interested in discovering how to apply these techniques to biological problems, since there are lots of interesting, public biological data sets to work with and biologists looking for collaborators with computational backgrounds. However, introductory biology material tends to be directed toward first-year undergraduate students, focusing on fundamental biological processes, rather than on the datasets and computational problems that form the basis for CS/biology collaborations. In the field of deep learning, a computer scientist may be deeply familiar with applications in deep learning for text or image analysis, but unfamiliar with how to translate biological data into appropriate inputs for deep learning models, selecting appropriate models for this data, and translating the output into something relevant for domain scientists.

We aim to help bridge this gap by providing:

1. A high-level overview of machine learning, neural networks and deep learning

2. A high-level overview of several fundamental biological data types, for which deep learning approaches may be appropriate

3. In-depth examination of one deep learning application in biology

Sounds great! Where should I begin?

Machine learning basics

Basic principals of how computers 'learn' from data to solve problems

Neural network basics

Basic principals for how neural nets 'learn' from data

Deep learning capabilities and limitations

What can and can't we do with deep learning?

Deep learning models in brief: Fully connected DNNs, CNNs, RNNs, Autoencoders

Different models for different data and problems

Peeking inside the 'black box'

How can we 'see' what is happening between input and output?

Data and model selection: a few important biological data types

A basic primer for computer scientists to understand important biological data types and how deep learning may be applicable

In-depth example

We will walk through a paper which uses CNNs for predicting DNA accessibility