The central dogma of biology states that DNA is transcribed to mRNA, which is then translated into proteins. We know that different genes are expressed at different levels, and these expression levels can vary from cell to cell. These differences in gene expression are what make cells behave differently, even when they have the same DNA 'code'.
RNA-Seq is a method in which we can quantify gene expression in a cell sample [1]. Based on the idea that the level of mRNA is directly correlated with the level of protein made from that gene, RNA-Seq attempts to quantify the abundance of mRNA. This effectively gives us a snapshot of how 'active' each gene is in a particular cell type, or under a particular condition.
RNA-Seq begins with the extraction of mRNA from the cells. Since mRNA is transcribed from DNA and is translated into proteins, we can use the amount of mRNA to infer the gene expression (i.e. the more mRNA that is found, the higher the gene expression and the more of the respective protein is made).
The mRNA is fragmented and converted into cDNA (complementary DNA). This cDNA represents the DNA sequence from which this mRNA was transcribed. These small sequences can then be mapped back onto a reference genome to determine which gene a read comes from. Genes that have a higher expression should have a higher number of mapped reads since there was a larger amount of mRNA from the cell. The abundance of reads can be stored as gene expression as well as the fold change, or how many more times one gene is expressed than another.
The abundance of gene expression is commonly used for phenotype classification in disease as well as gene inference. This experiment often studies the difference between healthy and diseased samples, such as a sample from healthy patients verse samples from patients with cancer, or they may analyze the gene expression differences between samples from cancer patients against the same samples with some drug treatment. Once these deferentially expressed genes are identified, they can be annotated with functional terms or with pathways to give some insight into what cellular mechanisms are changing between the case and control samples.
It is believed that there are a small subset of landmark genes which are correlated to the much larger set of target genes [2]. By knowing the abundance of these landmark genes, either as raw count abundance or through fold change, many prediction models attempt to infer the total expression profile of these target genes. A deep learning example of this type of problem is addressed by D-GEX by Chen et al [2]. Their model uses a MLPNN in order to predict gene inference of target genes from the set of landmark genes. In particular, the authors use expression data from 943 landmark genes to generate an output representing the expression levels of 9,520 target genes.
In order for genes to be transcribed into RNA, and therefore 'expressed' as described above, transcription factor proteins and other co-factors must come together at the transcription start site. These proteins regulate gene expression through activation and repression. Chromatin immunopreciptiation with high throughput sequencing (ChIP-Seq) is an experimental assay designed to study where these transcription factor proteins bind on the DNA to try to infer which genes they control [3].
ChIP-Seq begins by shearing the DNA into small fragments. A specific antibody is selected depending on the targeted protein to be studied. This antibody binds to the targeted protein and allows that protein, along with the bound DNA, to be isolated through immunoprecipitation.
The DNA is then purified and sequenced using high-throughput sequencing. The reads are then mapped back onto a reference genome. If a protein binds to a specific area in the sequence, the amount of reads will be larger and it will create a peak when looking at the read distribution across the sequence. By comparing these peaks against a control experiment that generates the background signal, the significant peaks can be identified to determine where the chosen transcription factor binds.
By analyzing repeated patterns in the sequences found at the peaks, studies have been able to find the short sequences that are believed to be the binding motifs for different transcription factors.
Since sequence data is ordered, it is important to keep the sequential information. However, these “letters” or bases must be represented numerically. Therefore, they must be encoded into a format that can numerically represent each base. The most common method for this is one-hot vector encoding, where each letter is represented by a zero vector with a single one in a unique position. Since we have four different letters (A, C, T, G), our vectors will have a length of four and the position of the one will indicate which base it is representing.
Another high throughput sequencing technology that focuses on identifying regulatory functions is DNAse-Seq [4]. In the chromatin structure, DNA is often tightly wound around histone proteins, making it inaccessible for protein binding. For transcription to occur, the DNA must be unwound from these histones, allowing transcription factors to bind to the open sequence.
In this assay, DNA is isolated from the sample cells and digested with DNase I. This enzyme is able to cleave accessible DNA into smaller fragments. Like the previous workflows, these fragments are sequenced using high throughput sequencing and mapped back onto a reference genome. Mapped reads will create peaks along the genome, and through comparing these signals to a background input, significant regions can be found.
DanQ is a deep learning approach that combines ChIP-Seq and DNAse-Seq data [5]. It does this by binning the genome into 200 base pair bins and finding targets that contain a significant peak in both a ChIP-Seq assay and a DNAse-Seq assay. This provides a binary vector that indicates when a transcription factor binds near accessible DNA, inferring that there may be a functional interaction for a given transcription factor at that sequence.
DanQ takes the 1000 base pair sequence around the 200 base pair bin as the input. The sequence is transformed using one-hot vector encoding. It uses a convolution layer with rectifier activation acts as a motif scanner across the input matrix. As the model trains, these convolutional filters are trained to position weight matrices that will produce a signal when they find respective motif in the input sequence. Max pooling is used to reduce the size of the output matrix.
The subsequent BRNN layer is used to consider the orientations and spatial distances between the motifs. LTSM units are used to speed up training time. The outputs are passed to fully connected layer using rectified linear units. The final layer then uses a sigmoid activation function to create a vector that serves as probability predictions of functional marks to be compared to the true target vector.
The second fundamental principle in biology is that a protein sequence determines its shape, and in turn its shape determines its function. Proteins are constructed from amino acids, each of which are usually represented by a unique letter. Similar to DNA and RNA, proteins can be represented by a sequence of letters, however proteins have an alphabet of size 20 rather than just 4. The order of these amino acids plays an important role in the structure of the protein. There are four levels of protein structure:
1. Primary – the one-dimensional sequence of amino acids
2. Secondary – the amino acids form localized motif structures such as alpha-helices, beta-sheets, and turns.
3. Tertiary – the full three-dimensional shape of the protein
4. Quaternary – the joining of multiple subunits if the protein has more than one subunit
Many computational methods attempt to predict the tertiary structure of a protein based on the primary structure (i.e. the amino acid sequence). Determining the three-dimensional structure of a protein can identify the functional domains, or areas, of the protein (i.e. the areas where it binds to other molecules). However, since the amino acid molecules can be rotated along multiple axis and their location in three-dimensional space can shift, the search space of possible structures is massive. Additionally, different amino acids have different areas and levels of charge, and modeling the interaction forces between amino acids on a quantum mechanic level is computationally expensive. Therefore, most computational prediction methods are based on heuristics and may not achieve optimal results.
On the other hand, we know that the structure of proteins is dependent on the sequence, there have also been attempts to use the sequence to jump straight to function. One such example comes from Xueliang Liu in his paper where he proposes a novel RNN model that takes the protein sequence as an input and predicts the protein function as an output [6]. Since protein sequences are variable in length, an RNN model is very appropriate for protein sequence analysis. His model is a bi-directional RNN (BRNN) and uses Long Short-Term Memory units to help the model train faster. The bi-directional aspect of this model allows for the scanning of the amino acid sequence from left to right as well as from right to left, giving it past and future context.
Biomedical images are often used in the treatment of patients. Different technologies, such as MRI, PET, and CT scans, can provide physicians with insight into patient injuries and disease markers. However, depending on the imaging technology and the physiological condition, there may be visual similarity between different classes and visual variance between images of the same class.
The image on the right is taken from a study by Qing Li et al [7]. The image represents high-resolution computed tomography (HRCT) of interstitial lung disease (ILD). ILD represents a large group of diseases in the lung parenchyma. In the image, the top row shows images of normal healthy lungs. The second row shows an image from patients with emphysema. The third row shows examples of lungs with ground-glass opacity. The fourth row shows lungs with fibrosis. The final row shows lungs containing micronodules.
Based on these images, it becomes clear that there can be similarities between different classes and differences in the same class. This creates a problem for physicians, especially if there are multiple problems or classes within the same image. Therefore, deep learning approaches have started being used in order to classify biomedical images. However, their goal is not only to be able to classify these images but to use their model to also find discriminatory features between the classes.
The study by Qing Li et al. proposes the use of a CNN model for the classification of ILD images generated from HRCT technology [7]. The use this model to handle images of a static size, as well as to generate feature maps which in turn can be interpreted as a feature selection. They use one convolutional layer followed by three fully connected layers. Given an input as an HRCT image of the lung, they trained their model to be able to classify an image as one of the five classes mentioned above. Once trained, they used the kernels from the convolutional layer to visualize the features that the model found to be significant.