In this section, I will include examples of materials that we have seen in the class as for analysis. The data analysis for FastQC is done to view the quality of our sequencing data. The statistical analysis for my sequences was performed with Galaxy which is a GUI software, which is built off of the R program. We then used the DESeq2 program within Galaxy. Comparisons were done between Peacevine cherry tomato, Wild-type (Wt) and some tomato meristems as well as leaves.
For my data, I have performed it on Galaxy however when I went to check the files that I thought I have downloaded, it is apparently corrupted. SO, for the Galaxy analysis I will include examples from my Solanum analysis instead
Col-0_Leaf_Rep1_1.fq.gz
Basic statistics - sequence length = 100 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Mean = 36
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (47%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 22.38%
Overrepresented sequences - 5 sequences listed, Warning
Adapter content - Good
The Fastqc data we have received for the 3x tomato sequences have similar quality compared to the data we have analyzed for the Arabidopsis sequences. We are required to still trim most of the sequences but overall quality are very good.
These sequences are good enough for RNA-seq analysis since some of the error warnings are characteristic of RNA-seq, and the 'Per base sequence content' can be improved by trimming. Specific trimming measures will be discussed further in the following section.
Principal Component Analysis
This figure visually represent the Euclidean distance showing the correlation between the datasets that we have inserted. To explain this, the closer the point it is to each other, the higher the correlation. This also means that the further our data points are from each other then the more different they are to one another. From this, I can observe and say that the SAM files are more similar to one another than the Leaf files.
Heatmap
This data analysis is to show sample-to-sample distances. This is shown to represent the similarity between the samples of each tissue type. The darker the color shown in the heatmap, the higher the correlation is between the samples. As shown by the Principal Component figure, the SAM samples show high similarity or correlation, represented by the dark blue color between each other compared to the leaf samples, which can be seen by the blue color lighter than the SAM samples.
Gene Dispersion plot
This data analysis is called key components of dispersion estimate. From the first look at the figure to see red trend line, we can observe that a set of data in blue that is in both sides of red line and more black data points beyond that. This figure shows the dataset's mean of normalized counts compared to variance. The trend should start high which is the high number of mean and get lower variance . Roughly the points should be around the fit line.
The data points of mean normalized counts and dispersion follow the trend that dispersion decreases as mean of normalized count increases.
Histogram
This figure is a Histogram analysis which shows p-value for each gene being compared. See a higher significance in over 14,000 genes that are likely under the 0.05 threshold. This is an unadjusted p-value histogram, adjusting these p-values would be beneficial to see more conservative p-values (less than 14,000, which would show us what is changing the most between these groups). Some data transformation may be required to represent a much less skewed data. Taking the extra step to adjust these p-values is recommended.
Microarray plot
Looks like a fish bone. This data analysis is showing fold change: how much bigger or smaller has the transcripts changed in comparison between leaf vs. meristem, upregulation or downregulation of certain genes are represented by the log fold change. As we move across the x-axis the number of transcripts changes and we start to see smaller fold changes. More counts can be seen as we move further down the axis, differences are going to be more significant. From what I can observe, I can see that there seems to be more points downregulated however that is up for debate.