A Deep Dive into a Convolutional Neural Network Example

Predicting Gene Accessibility using CNNs

For genes to be transcribed, they must be accessible for transcription factor proteins to be able to bind to the DNA. Mutations in the genetic code can vastly change the DNA accessibility, which in turn can affect gene expression. Understanding how these mutations perturb genetic mechanisms can lead to more targeted medicine and personalized treatment. However, the current inability to efficiently interpret noncoding variants in the genome has slowed this progress. In “Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks”, the authors address this challenge by implementing a convolutional neural network to learn the activity and accessibility of DNA from sequence data.

Their CNN used three convolutional layers using rectified linear units (ReLU) and max pooling, followed by two fully connected hidden layers. Lastly, a single sigmoid transformation layer provides the model's output. The model is trained using DNAse-Seq data from 164 different cell lines. As described in the "Biological Data in Deep Learning" section, DNAse-Seq data captures the accessible DNA by cleaving it with the enzyme DNAse I. Data was collected from the ENCODE Project Consortium and the Roadmap Epigenomics Consortium. The input composed of the 600 base pair area around sequences that appeared in at least one cell line. The sequence was encoded in one-hot vector format and used for the input. The output is a vector of length 164 that predicts the probability that the sequence is accessible in each cell line.

After training, the model outperformed the state of the art method based upon a Support Vector Machine model. This supports the strength of the CNN model when it comes to learning patterns from DNA sequences.

The authors then attempted to interpret parts of their model by analyzing the kernel weights of the first convolutional layer. By analyzing the 300 filters of the first convolutional layer of Basset, the authors noticed that a large amount of known annotated motifs captured by the filters. Another thing that was captured by these feature maps were many areas of high GC enrichment, indicating potential CpG sites. Transcription start sites of genes often have CpG sites to control regulation. When these areas are highly methylated, the DNA becomes less accessible and the gene is down regulated. Otherwise, if the site is not heavily methylated, transcription factors can bind to the DNA and allow transcription to occur. Another interesting finding from these feature maps were novel potential motifs that have not been annotated. Exploration into the analysis of these motifs is a possible future direction of research.

The authors expanded on this research by taking a trained model and changing single nucleotides to a different nucleotide in order to observe the change in accessibility of the DNA. The assign a loss score and a gain score based on the largest possible decrease and largest possible increase in accessibility based on how the trained model scores the altered sequence. What they could infer from this data is the effect of a single nucleotide polymorphism (SNP) mutation on the sequence. A high gain score can indicate that the mutation may lead to a gain of function if the mutation allows a certain gene to be more accessible. On the other hand, a high loss score may indicate a mutation that would lead to a loss of function, where the DNA would become less accessible and gene regulation could decrease or disappear. This experiment, saturation mutagenesis, can take a long time to preform in wet lab because an oligonucleotide must be made for every mutation at every position. Their model preforms in silico saturation mutagenesis in just a few minutes.

Using the gain and loss scores, the authors incorporated data from Genome Wide Association Studies (GWAS), which are commonly used to identify non-coding variants leading to disease or other phenotype change. Using their predictions, they believe they can identify SNPs in the non-coding DNA regions found by GWAS that are most promising for future research. To test this, they compared their model's loss and gain scores against already annotated causal GWAS SNPs. In particular, they showed that the known mutation for vitiligo, which is found in a gene desert that is millions of base pairs from the effected gene, was assigned a very large gain score. The mutation from T to C creates a motif for a master transcription factor called CTCF. This transcription factor is known for altering the physical structure of the genomic region.

Finally, the authors wanted to show that a pretrained model could efficiently predict on new data sets. To do this, they removed 15 of the cell lines from the training process and trained their CNN model. Then for the remaining 15 data sets they removed, the sampled an equal number of samples from the training set as negative samples. They showed that using this pretrained model as a head start allowed a faster training (single pass) for any future CNN models with this data.

In this paper, the authors show that a CNN model can accurately predict DNA accessibility as well as be used to find the critical nucleotides that control DNA accessibility. With being able to identify non-coding variants and the critical SNPs at a higher resolution than any previous method, the authors believe their model can lead to more identifications of important non-coding variants and the SNPs involved, as well as linking these non-coding variants to disease or physiological phenotype.

Reference

Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research. 2016;26(7):990-999. doi:10.1101/gr.200535.115.