Data Exploration

"Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended [1]"

Within galaxy website, we will be using a tool called Deseq2.

"Deseq2 is a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression [2]"

PCA Analysis (page 1)

This PCA plot shows statistical visualization of Euclidean distance. In the performed PCA, we can see that undifferentiated meristem tissue (with replicates) are highly correlated and so are young leaf tissues. When looking at principal component 2 we still see very correlated association within meristems and young leaf samples, but we begin to see variance among mature leaf samples.

Heatmap (page 2)

This data analysis is to show sample-to-sample distances. This is shown to represent the similarity between the samples of each tissue type. The darker the color shown in the heatmap, the higher the correlation is between the samples. As shown by the PCA figure, the SAM samples show high similarity, represented by the dark blue color between each other compared to the leaf samples, which can be seen by the blue color lighter than the SAM samples. As well, the only only sample showing a difference within its group was Mature Leaf Rep2, and this is visualized again in this heat map.

Gene Dispersion plot (page 3)

This data analysis is called key components of dispersion estimate. It indicates a mean of normalized counts compared to variance. As the dispersion decreases the mean normalized count increases. The red line is the best-fitted line with expected values of counts. Each dot on the graph is an expressed gene. Roughly the points should be around the fit line.

Histogram (page 4)

This figure shows a Histogram analysis which shows p-value for each gene being compared. There is a high significance in over 20,000 genes that are likely p<0.05. This is an unadjusted p-value histogram, adjusting these p-values would be beneficial to see more conservative p-values. Taking the extra step to adjust these p-values is recommended by performing a permissive or conservative p-value adjustment.

Microarray plot (page 5)

Shows differentially expressed genes. Each dot is a gene and blue dots on the positive are significantly upregulated and on the negative are significantly down regulated.

DeSeq CCV W82.pdf

Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, Chilton J, Clements D, Coraor N, Grüning BA, Guerler A, Hillman-Jackson J, Hiltemann S, Jalili V, Rasche H, Soranzo N, Goecks J, Taylor J, Nekrutenko A, Blankenberg D. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018 Jul 2;46(W1):W537-W544. doi: 10.1093/nar/gky379.
Love, M.I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550 (2014). https://doi.org/10.1186/s13059-014-0550-8

Page updated

Report abuse