To create an automated bioinformatic pipeline to analyze and provide statistics on the differences in RNA Sequencing data between the naturally derived dermal papilla versus the induced pluripotent stem cell derived dermal papilla.
This tool was used to RNA Sequencing Analysis by assembling transcripts, estimating their abundances, and testing for differential expression and regulation in RNA-Seq samples.
It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.
This design uses Enrichr, developed by the Ma’ayan lab at Mt. Sinai. Enrichr takes a gene list as input and determines which pathways the list of genes is highly enriched in using a large database of several libraries from different places such as KEGG and DAVID. By taking the total Enrichr output, which is a large list of libraries with associated pathways, p-values, and enrichment scores, the next step will be to scrape the total data and parse for pathways with the highest enrichment scores. Once these pathways are determined, we will then use Transfac and JASPAR to determine which of these genes are transcription factors that regulate other genes. Using this information and the highly enriched pathways will give us a sense of what genes are highly up and down regulated in our original gene lists. In addition to determining transcription factors, we will use overlapping genes found with every pathway to determine which pairs of genes in our original gene list are frequently found together. This will allow us to build correlation heat maps that will determine which genes are in similar pathways and how they function correlationally.
This design uses the Database for Annotation, Visualization and Integrated Discovery (DAVID) for gene set enrichment analysis. The system determines which genes are highly related based on many different libraries of gene association data. For this project, we have focused on Gene Ontology (GO) and KEGG Pathways for analysis. GO is a curated database that assigns genes to terms that are within the categories of cellular components, molecular functions, or biological processes. The terms are assembled into a directed acyclic graph (DAG), such that a hierarchical relationship can be seen between the various categories. KEGG pathways is a curated database of known metabolic pathways. From the initial RNA-seq data, we extract the top 100 upregulated genes and downregulated genes, which make up our gene set. The gene set is passed into DAVID for enrichment analysis using GO and KEGG which returns a set of annotated clusters of enriched gene groups. From the clusters returned by DAVID, we are able to construct the hierarchical relationships seen in GO using GOATOOLS and find the pathways of potential interest from KEGG.
This graph demonstrates the correlation heat map of top 100 genes downregulated in hair and upregulated in skin.
Higher frequency of genes overlapping in the same pathway is indicative of a more intense red color.
Leaves of the graph correspond to specific upregulated processes in the input gene list. Upper levels of the graph are more general cellular processes associated with the input gene list.