CPT | SPRING 2023 - FastQC Analysis

reads quality analysis

Picture obtained from https://pxhere.com/

Obtained from:

LIL LEAF GROUP

Fast QC Glycine max

FastQC was conducted on all of the sequencing reads provided in the RawData/Glycine_Max folder by running the job script using the HPC. Each team member ran this job in the directory titled /share/bitcpt/S23/UnityID/Soy within the HPC. The following job script was providedd by the instructors and used to obtain the fastqc data for each sequence.

Job Script

#!/bin/tcsh

#BSUB -J fastqc_Soy_LilLeaf #job name

#BSUB -n 20 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o fastqc.out.%J #output file

#BSUB -e fastqc.err.%J #error file

# For running fastqc on all my Soy samples

# Run in working directory /share/bitcpt/S23/UnityID/Soy

# Must run this in working directory with subdirectory named /fastqc

module load conda

conda activate /usr/local/usrapps/bitcpt/fastqc

# -t specifies number of threads

fastqc /share/bitcpt/Spring2022/RawDataGlycine_max/* -t 20 -outdir ./fastqc

FastQC Reports LIL LEAF GROUP (young leaves)

Once all of the files were visualized in each team members directory titled /share/bitcpt/S23/UnityID/Soy/fastqc , the Young Leaf files were then transferred to each individuals personal computer using Globus Connect. There are two file types for each sequence read, a zip file and a html file. Only the html files were transferred, and after confirming the transfer, the files were opened using the preferred search engine. These files can then be analyzed to determine if the sequencing reads obtained are of good quality. If the quality of the reads in poor towards the beginning or the end of the sequence, it can be trimmed to create more comprehensible data. The following analysis modules were evaluated for each Young Leaf sequence:

Basic Statistics

This category does not tell us anything about the quality of the data, but some composite information about the data, such as the file name, sequence length and GC %. An example to this can be seen in the image to the right :

Per Base Sequence

This category presents as a graph, with the y-axis telling us the quality scores of each base call, with green indicating good quality calls, yellow indicating an acceptable quality call, and red indicating poor quality calls. An example of this is shown in the image to the left. For each of the Young Leaf Soy samples analyzed, all base calls were within the green zone, indicating they were all quality calls.

Per Sequence Quality Scores

This is another graph saying the quality distribution score of the overall sequence. A peak around 36 indicates good quality. If the peak is at 27 or lower, it indicates poorer quality reads. An example is shown to the right. All of the samples analyzed peaked at or close to 36, which is expected and indicates they are of good value.

Per Base Sequence Content

This category looks at the difference between the different bases in a sequence run. According to Babraham Bioinformatics, libraries producing using random hexamer primers such as what is used in RNA-Seq, "inherit an intrinsic bias in the positions at which reads start. Therefore if the beginning of the graph appears hectic this is no cause for concern." In all of the samples analyzed for the Young Leaf soy data, they all had large differences in the first 10-15 nucleotides. This is not concerning because with RNA-seq library prep there is expected to be a non-uniform distribution of bases for the first 10-15 nucleotides. An example of one can be seen on the left.

Per Sequence GC Content

It is expected that sequences will contain an equal amount of GC to AT content, however for RNA-seq, introns are removed which are typically AT rich regions, thus the GC% is usually higher than anticipated. This is no cause for concern. In all of the samples, they appeared slightly more narrow and have a higher peak for GC content, but not in a way that warranted looking into. An example can be seen on the right.

Per Base N Content

A base pair is marked as N when the analyzing machine is unable to determine the base call. A high percent of N's indicates that the analysis machine was unable to determine the base call, which leads to concern for sequence alignment. All of the samples analyzed did not appear to have any N base calls, which means that the analyzing machine was able to identify all the base calls in each sequence read. An example of this can be seen on the left

Sequence Length Distribution

We expect all the sequence reads to be mostly the same length. Looking back to the basic statistics section we know that all of the sequences analyzed had a 100 bp length. This section showed that the sequence length of each sample peaked at 100, which is expected. An example of this can be seen on the right.

Sequence Duplication Levels

This category is more so for DNA-seq which is what fastqc was originally created for. A warning in this category is not a cause for concern as long as the graph looks similar to the one on the left.

Overrepresented Sequences

This category is looking to see if any sequence is overrepresented in the analyzed sequence as a large amount of overrepresentation can indicate contamination, but also an overexpression of a gene when working with RNA and specialized tissues. All of the sequences analyzed have no overrepresentation.

Adaptor Content

This last section is looking to see if the Kmers in the sequence have an even coverage throughout the length of the reads. Uneven coverage can indicate issues with contamination. None of the soy sequences analyzed showed any adaptor %.

Overall, it can be concluded that the sequence reads for the Young Leaf soy data is of high quality and does not need to be trimmed.

FastQC Reports FOR MERISTEM

Overall, all the replicates for mature leaves samples were consistent and the obtained Phred score was >35. Therefore, all the samples can be used for downstream analysis.

FastQC Reports BIG BOY LEAF (MATURE LEAVES)

Old Leaf Soybean Reads:

Rep 1, R1:

Sequence Length: 100 bp
Phred Scores: 36-37
%GC: 44
Trimming needed? No

Rep 1, R2:

Sequence Length: 100 bp
Phred Scores: 36
%GC: 45
Trimming needed? No

Rep 2, R1:

Sequence Length: 100 bp
Phred Scores: 35
%GC: 44
Trimming needed? No

Rep 2, R2:

Sequence Length: 100 bp
Phred Scores: 36
%GC: 45
Trimming needed? No

Rep 3, R1:

Sequence Length: 100 bp
Phred Scores: 36
Trimming needed? No

Rep 3, R2:

Sequence Length: 100 bp
Phred Scores: 36
Trimming needed? No

Old Leaf Soybean Reads:

Rep 4, R1:

Sequence Length: 100 bp
Phred Scores: 36-37
%GC: 44
Trimming needed? No

Rep 4, R2:

Sequence Length: 100 bp
Phred Scores: 36
%GC: 45
Trimming needed? No

Rep 5, R1:

Sequence Length: 100 bp
Phred Scores: 36-37
%GC: 45
Trimming needed? No

Rep 5, R2:

Sequence Length: 100 bp
Phred Scores: 36-37
%GC: 45
Trimming needed? No

Overall, the results for Mature leaves are consistent between replicates and a high Phred score was shown. Therefore, all samples can be used downstream in the analysis.

Page updated

Report abuse