This is for informational purposes. Fastp trimming will be done by instructors.
Sometimes, our raw sequence data needs to be cleaned and trimmed before using. To find out what kind of cleaning and trimming we need to do, we should run fastqc and read the fastqc reports.Â
Our fastqc reports indicated several quality aspects that need attention.Â
Per-base sequence content at the front (beginning) of reads
Unidentified over-represented sequences
Poly-X tails
Sequence duplication
#!/bin/tcsh
#BSUB -J fastp #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastp_%J.out #output file
#BSUB -e fastp_%J.err #error file
/usr/local/usrapps/bitcpt/fastp/bin/fastpÂ
-i RNA-seq_input_1.fq.gz
-I RNA-seq_input_2.fq.gz
-o RNA-seq_output_1.fp.fq.gzÂ
-O RNA-seq_output_2.fp.fq.gzÂ
--json summary_of_results.jsonÂ
--html summary_of_results.htmlÂ
--length_required 50Â
--detect_adapter_for_peÂ
--trim_poly_g --trim_poly_xÂ
--qualified_quality_phred 15
--unqualified_percent_limit 40
-i RNA-seq_input_1.fq.gz
-I RNA-seq_input_2.fq.gz
1) Data
-o RNA-seq_output_1.fp.fq.gzÂ
-O RNA-seq_output_2.fp.fq.gz
2) Summary of results
--json summary_of_results.jsonÂ
--html summary_of_results.htmlÂ
1) Trims poly-x tails withÂ
--trim_poly_g --trim_poly_xÂ
2) Automatically detects adapter sequences and trims
--detect_adapter_for_pe
3) Trims poor quality base calls
--qualified_quality_phred 15
4) Removes poor quality reads
--unqualified_percent_limit 40
5) Removes reads that are too short
--length_required 50Â
Here's an example of the script with Col-0_Leaf_Rep1 sample:
#!/bin/tcsh
#BSUB -J fastp_At-hardcode-example #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastp_At-hardcode-example_%J.out #output file
#BSUB -e fastp_At-hardcode-example_%J.err #error file
/usr/local/usrapps/bitcpt/fastp/bin/fastpÂ
-i /share/bitcpt/S23/RawData/Arabidopsis_thaliana/Col-0_Leaf_Rep1_1.fq.gz
-I /share/bitcpt/S23/RawData/Arabidopsis_thaliana/Col-0_Leaf_Rep1_2.fq.gz
-o /share/bitcpt/S23/cleandata/Arabidopsis_thaliana/Col-0_Leaf_Rep1_1.fp.fq.gzÂ
-O /share/bitcpt/S23/cleandata/Arabidopsis_thaliana/Col-0_Leaf_Rep1_2.fp.fq.gzÂ
--json ${OUT}/Col-0_Leaf_Rep1.jsonÂ
--html ${OUT}/Col-0_Leaf_Rep1.htmlÂ
--length_required 50Â
--detect_adapter_for_peÂ
--trim_poly_g --trim_poly_xÂ
--qualified_quality_phred 15
--unqualified_percent_limit 40
Input file directory: /share/bitcpt/S23/RawData/Arabidopsis_thaliana/
Input RNA-seq file 1: Col-0_Leaf_Rep1_1.fq.gz
Input RNA-seq file 2: Col-0_Leaf_Rep1_2.fq.gz
These are names that you find and insert into code
Output file directory: /share/bitcpt/S23/cleandata/Arabidopsis_thaliana
 Output RNA-seq file 1: Col-0_Leaf_Rep1_1.fp.fq.gz
 Output RNA-seq file 2: Col-0_Leaf_Rep1_2.fp.fq.gz
You come up with the names. They should match the input for consistency. I added in an 'fp' to denote these are fastp trimmed
 Output json summary file: Col-0_Leaf_Rep1.json
 Output html summary file: Col-0_Leaf_Rep1.hmtl
You also come up with these names. They should match input for consistency.Â
This example is 'hard coded'. This means all of the variables are written out directly (inputs, outputs). Hard coding requires a lot of typing and therefore increases our chances of having a typo. In the next section we'll talk about setting variables to save time and minimize typos.Â
Instead of hard coding variables (inputs and outputs) we can set them as variables. Then we only have to change the variables once per fastp command. The fastp command itself doesn't change each time we run it.Â
Variables are set with the set command.Â
set variable_name=variableÂ
for example: set sample=Col-0_Leaf_Rep1
Then call the sample variable with ${sample}
#!/bin/tcsh
#BSUB -J fastp_At-example #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastp_At-Leaf_%J.out #output file
#BSUB -e fastp_At-Leaf_%J.err #error file
# Set directory variables. These stay the same in this script so we only need to set them once
set IN=/share/bitcpt/S23/RawData/Arabidopsis_thaliana
set OUT=/share/bitcpt/S23/cleandata/Arabidopsis_thaliana
# Set sample. This changes each time we run fastp. Â
set sample=Col-0_Leaf_Rep1
# The fastp command. This stays the same.
/usr/local/usrapps/bitcpt/fastp/bin/fastpÂ
-i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gzÂ
-o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gzÂ
--json ${OUT}/${sample}.json --html ${OUT}/${sample}.htmlÂ
--length_required 50Â
--detect_adapter_for_peÂ
--trim_poly_g --trim_poly_xÂ
--qualified_quality_phred 15
--unqualified_percent_limit 40
#!/bin/tcsh
#BSUB -J fastp_At #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastp_At_%J.out #output file
#BSUB -e fastp_At_%J.err #error file
# Set the fastp software directory
set fastp=/usr/local/usrapps/bitcpt/fastp/bin/fastp
# Set our input directory and our output directory
set IN=/share/bitcpt/S23/RawData/Arabidopsis_thaliana
set OUT=/share/bitcpt/S23/cleandata/Arabidopsis_thaliana
##################################
# Leaf Rep 1Â
##################################
# Need to update the sample name each time!
# File structure: Col-0_Leaf_Rep1_1.fq.gz
set sample=Col-0_Leaf_Rep1
# So we don't need to edit the command
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Leaf Rep 2
##################################
set sample=Col-0_Leaf_Rep2
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 1
##################################
set sample=Col-0_SAM_rep1_L002
${fastp} -i ${IN}/${sample}_R1.fastq.gz -I ${IN}/${sample}_R2.fastq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 2
##################################
set sample=Col-0_SAM_rep2_L002
${fastp} -i ${IN}/${sample}_R1.fastq.gz -I ${IN}/${sample}_R2.fastq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 3
##################################
set sample=Col-0_SAM_rep3_L002
${fastp} -i ${IN}/${sample}_R1.fastq.gz -I ${IN}/${sample}_R2.fastq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:57 Gm_OldLeaf_Rep1_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:57 Gm_OldLeaf_Rep2_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:57 Gm_OldLeaf_Rep3_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:57 Gm_OldLeaf_Rep4_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:57 Gm_OldLeaf_Rep5_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:57 Gm_SA_Rep1_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:58 Gm_SA_Rep2_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:58 Gm_SA_Rep3_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:58 Gm_SA_Rep4_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.8G Mar 20 15:58 Gm_SA_Rep5_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:58 Gm_YoungLeaf_Rep1_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:58 Gm_YoungLeaf_Rep2_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:58 Gm_YoungLeaf_Rep3_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 2.0G Mar 20 15:58 Gm_YoungLeaf_Rep4_1.fq.gz
-rw-r-----. 1 casjogre bitcpt 1.9G Mar 20 15:58 Gm_YoungLeaf_Rep5_1.fq.gz
#!/bin/tcsh
#BSUB -J fastp_Gm #job name
#BSUB -n 20 #number of nodes
#BSUB -W 2:0 #time for job to complete
#BSUB -o fastp_Gm_%J.out #output file
#BSUB -e fastp_Gm_%J.err #error file
# Set the fastp software directory
set fastp=/usr/local/usrapps/bitcpt/fastp/bin/fastp
# Set our input directory and our output directory
set IN=/share/bitcpt/S23/RawData/Glycine_max
set OUT=/share/bitcpt/S23/cleandata/Glycine_max
##################################
# Old Leaf Rep 1Â
##################################
# Need to update the sample name each time!
# File structure: Gm_OldLeaf_Rep1_1.fq.gz
set sample=Gm_OldLeaf_Rep1
# So we don't need to edit the command
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Old Leaf Rep 2
##################################
set sample=Gm_OldLeaf_Rep2
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Old Leaf Rep 3
##################################
set sample=Gm_OldLeaf_Rep3
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Old Leaf Rep 4
##################################
set sample=Gm_OldLeaf_Rep4
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Old Leaf Rep 5
##################################
set sample=Gm_OldLeaf_Rep5
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Young Leaf Rep 1
##################################
set sample=Gm_YoungLeaf_Rep1
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Young Leaf Rep 2
##################################
set sample=Gm_YoungLeaf_Rep2
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Young Leaf Rep 3
##################################
set sample=Gm_YoungLeaf_Rep3
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Young Leaf Rep 4
##################################
set sample=Gm_YoungLeaf_Rep4
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# Young Leaf Rep 5
##################################
set sample=Gm_YoungLeaf_Rep5
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 1
##################################
set sample=Gm_SALeaf_Rep1
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 2
##################################
set sample=Gm_SALeaf_Rep2
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 3
##################################
set sample=Gm_SALeaf_Rep3
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 4
##################################
set sample=Gm_SALeaf_Rep4
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40
##################################
# SAM Rep 5
##################################
set sample=Gm_SALeaf_Rep5
${fastp} -i ${IN}/${sample}_1.fq.gz -I ${IN}/${sample}_2.fq.gz -o ${OUT}/${sample}_1.fp.fq.gz -O ${OUT}/${sample}_2.fp.fq.gz --json ${OUT}/${sample}.json --html ${OUT}/${sample}.html --length_required 50 --detect_adapter_for_pe --trim_poly_g --trim_poly_x --qualified_quality_phred 15 --unqualified_percent_limit 40