Final Student Portfolio

Fastqc Analysis - Examples given for an Arabidopsis leaf

Overview

In this step I am going to be using Fast QC to make sure that the data I have is of good quality and meet the necessary requirements for work

What is FastQC ?

FastQC is a program that aims to provide a simple way and comprehensive way to analyze and do some quality control checks on raw sequence data coming from high throughput sequencing pipelines such as Illumina and PacBio.

Running Fastqc

# Load conda into mobaxterm or terminal and activate

module load conda

conda activate /usr/local/usrapps/bitcpt/fastqc

# View help manual by writing

fastqc -h

# Script for running the job:

#!/bin/tcsh

#BSUB -J fastqc_At_GroupName #job name

#BSUB -n 20 #number of nodes

#BSUB -W 2:0 #time for job to complete

#BSUB -o fastqc.out.%J #output file

#BSUB -e fastqc.err.%J #error file

fastqc /share/bitcpt/Fall2022/RawData/Arabidopsis_thaliana * -t 20 -outdir ./fastqc

# Copy the script from Dr Sjogren into “At” directory

cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh

# repeat copying the script into Tom directory and change name

cp /share/bitcpt/Fall2022/scripts/At.fastqc.sh Tom.fastqc.sh

# cp function can be used to rename copied folder by putting the new name at the end of the script

# if you copied the file using the function but did not add a name change. you can rename the file using the mv function

# The script can be modified by using vi script

Vi Tom.fastqc.sh

# the file was changed to be accurate to the tom files

# submit work for both At and Tom

Bsub <At.fastqc.sh

Bsub <Tom.fastqc.sh

# check directory for error files being generated or use script <Bjobs> to check for work in progress

# Inspect the error file for errors and things to correct if the job was terminated

more fastqc.err.JOB#

Saving files from HPC

Complete this step via Globus File Manager

Fastqc Analysis - Examples given for an Arabidopsis leaf

Col-0_Leaf_Rep1_1.fq.gz

Basic statistics - sequence length = 100 bp

Per base sequence quality - Good = all green

Per sequence quality scores - Mean = 36

Per base sequence content - Good= linearizes after the initial noise

Per sequence GC content - Good because it still follows a bell curve (47%)

Per base N content - horizontal at 0, good

Sequence length distribution - Good

Sequence duplication levels - Bad (Percent of seqs remaining if deduplicated = 22.38%

Overrepresented sequences - 5 sequences listed, Warning

Adapter content - Good

Tomato - Fastqc Analysis

The Fastqc data we have received for the 3x tomato sequences have similar quality compared to the data we have analyzed for the Arabidopsis sequences. We are required to still trim most of the sequences but overall quality are very good.

Conclusions

These sequences are good enough for RNA-seq analysis since some of the error warnings are characteristic of RNA-seq, and the 'Per base sequence content' can be improved by trimming. Specific trimming measures will be discussed further in the following section.

Solanum lycopersicum FastQC

Page updated

Report abuse

Final Student Portfolio

Overview

What is FastQC ?

Running Fastqc

Saving files from HPC

Fastqc Analysis - Examples given for an Arabidopsis leaf

Tomato - Fastqc Analysis

Conclusions

Get in touch at (mmohamm8@ncsu.edu)