In this step I am going to be using Fast QC to make sure that the data I have is of good quality and meet the necessary requirements for work
FastQC is a program that aims to provide a simple way and comprehensive way to analyze and do some quality control checks on raw sequence data coming from high throughput sequencing pipelines such as Illumina and PacBio.
# Load conda into mobaxterm or terminal and activate
module load conda
conda activate /usr/local/usrapps/bitcpt/fastqc
# View help manual by writing
fastqc -h
# Script for running the job:
#!/bin/tcsh
#BSUB -J fastqc_At_GroupName #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastqc.out.%J #output file
#BSUB -e fastqc.err.%J #error file
fastqc /share/bitcpt/Fall2022/RawData/Arabidopsis_thaliana * -t 20 -outdir ./fastqc
# Copy the script from Dr Sjogren into “At” directory
cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh
# repeat copying the script into Tom directory and change name
cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh Tom.fastqc.sh
# cp function can be used to rename copied folder by putting the new name at the end of the script
# if you copied the file using the function but did not add a name change. you can rename the file using the mv function
# The script can be modified by using vi script
Vi Tom.fastqc.sh
# the file was changed to be accurate to the tom files
# submit work for both At and Tom
Bsub <At.fastqc.sh
Or
Bsub <Tom.fastqc.sh
# check directory for error files being generated or use script <Bjobs> to check for work in progress
# Inspect the error file for errors and things to correct if the job was terminated
more fastqc.err.JOB#
Complete this step via Globus File Manager
Col-0_Leaf_Rep1_1.fq.gz
Basic statistics - sequence length = 100 bp
Per base sequence quality - Good = all green
Per sequence quality scores - Mean = 36
Per base sequence content - Good= linearizes after the initial noise
Per sequence GC content - Good because it still follows a bell curve (47%)
Per base N content - horizontal at 0, good
Sequence length distribution - Good
Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 22.38%
Overrepresented sequences - 5 sequences listed, Warning
Adapter content - Good
The Fastqc data we have received for the 3x tomato sequences have similar quality compared to the data we have analyzed for the Arabidopsis sequences. We are required to still trim most of the sequences but overall quality are very good.
These sequences are good enough for RNA-seq analysis since some of the error warnings are characteristic of RNA-seq, and the 'Per base sequence content' can be improved by trimming. Specific trimming measures will be discussed further in the following section.