The folowing is from a bash shell script which uses multiple platforms to map next-generation sequencing read data to a reference genome. The key feature is the soft-coding of input files meaning that any file in the working directory matching the search terms will be read into the function. Designed for use with Drosophila melanogaster but suitable for a wide range of species with known reference genomes. Also adaptable to RNAseq data. I've omitted lines of code for loading modules as these are mostly idiosyncratic, but generally come in the form module load bio/1.5 module load stampy1.0 etc. Also omitted are commands to remove old files. The full script is available here https://zenodo.org/record/809328. The publication is here https://f1000research.com/articles/5-2644/v3.
1. Cleans NGS reads (fastq), maps to reference genome to form bam file.
2. Removes duplicate reads and poor quality mapping.
3. Generates quality statistics and graphs.
4. Assumes reference genome has been indexed.
Script is activated from the Linux SGE server with the qsub ngs_read_mapper.sh function. To run locally, load module commands should be omitted. Activate script with caffeinate -i sh ngs_read_mapper.sh > log_ngs_read_mapper.txt
The settings below specify the language, job name, number of compute nodes to use, which working directory, and server queue to use.
#!/bin/sh
#$ -N map_ngs
#$ -pe openmp 10
#$ -S /bin/sh
#$ -cwd
#$ -j y
#$ -q bioinf.q
. /etc/profile.d/modules.sh
Record the date and time, state information about the operating system, log everything, quit with any error, list all files in the current directory, load java
echo "`date +%y/%m/%d_%H:%M:%S`"
uname -sa
set -ex
ls -la
module load jre/1.7.0_25
Input file paths for directing all output files, notably temporary ones, also the indexed and formatted reference genome assembly file, and the adapter sequences used when multiplexing the sequencing reaction, which are to be removed from the read data.
wpath=/lustre/scratch/bioenv/wg39/lhm_sequencing/
refgen=local_reference/dm6.fa
adaptors=adaptor_sequences/adaptors_nextera_130214.fasta
Remove adaptor sequences from NGS reads
for i in *.fastq ; do fastq-mcf ${adaptors} ${i} -o ${i}.cleaned ; done
Map reads to reference genome, using BWA mem.
for i in *.R1.fastq.cleaned ; do
base=${i%.R1.fastq.cleaned}
bwa mem -t 5 -M -L0 -P -R "@RG\tID:${base}-id\tSM:${base}-sm\tLB:${base}-lib\tPL:ILLUMINA" \
${refgen} "${base}.R1.fastq" "${base}.R2.fastq" > "${base}.sam" ; done
Compress sam files to bam files.
for i in *.sam ; do
samtools view -bS -q 20 ${i}_first.sam > ${i%.sam}_first.bam ; done
Secondary read-mapping, using Stampy.
for i in *_first.bam ; do
stampy.py \
-g ${wpath}${refgen} \
-h ${wpath}${refgen} \
--bamkeepgoodreads -t5 \
-M ${wpath}${i} \
--output=${wpath}${i%_first.bam}_goodreads.bam ; done
Remove bad reads
for i in *_goodreads.bam ; do
CleanSam \
INPUT=${wpath}${i} \
OUTPUT=${wpath}${i%_goodreads.bam}_clean.bam \
TMP_DIR=${wpath} ; done
Add read-group info
for i in *_clean.bam ; do
AddOrReplaceReadGroups \
INPUT=${wpath}${i} \
OUTPUT=${wpath}${i%_clean.bam}_srtd.bam \
SORT_ORDER=coordinate RGID=${i%_clean.bam}-id RGLB=RGl-lib RGPL=ILLUMINA RGPU=${i%_clean.bam}-01 RGSM=${i%_clean.bam} \
VALIDATION_STRINGENCY=LENIENT \
TMP_DIR=${wpath} ; done
Remove duplicate reads.
for i in *_srtd.bam ; do
MarkDuplicates \
INPUT=${wpath}${i} \
OUTPUT=${wpath}${i%_srtd.bam}_noDuplicates.bam \
METRICS_FILE=${wpath}${i}_noDuplicates.metrics.txt \
REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT \
TMP_DIR=${wpath} ; done
Index the bam file.
for i in *_noDuplicates.bam ; do
BuildBamIndex \
INPUT=${wpath}${i} \
VALIDATION_STRINGENCY=LENIENT \
TMP_DIR=${wpath} ; done
Mark realignment intervals.
module load gatk/3.4-0
for i in *_noDuplicates.bam ; do
GenomeAnalysisTK -R ${refgen} \
-T RealignerTargetCreator \
-I ${i} \
--filter_mismatching_base_and_quals \
-o ${i}_realignment.intervals ; done
Re-map around realignment intervals
for i in *_noDuplicates.bam ; do
GenomeAnalysisTK -R ${refgen} \
-T IndelRealigner \
-I ${i} \
-targetIntervals ${i}_realignment.intervals \
--filter_mismatching_base_and_quals \
-o ${i%_noDuplicates.bam}.bam ; done
!! Note this uses all files in the directory ending in .bam.
Calculate base mismatches per PCR cycle.
for i in *.bam ; do
GenomeAnalysisTK -R ${refgen} \
-T ErrorRatePerCycle \
-I ${i} \
-o ${i%.bam}_errsPerCycle_gatk.txt ; done
Calculate Read length distributions.
for i in *.bam ; do
GenomeAnalysisTK -R ${refgen} \
-T ReadLengthDistribution \
-I ${i} \
-o ${i%.bam}_readLengths_gatk.tbl ; done
Calculate base coverage distributions.
for i in *.bam ; do
GenomeAnalysisTK -R ${refgen} \
-T BaseCoverageDistribution \
-I ${i} \
-o ${i%.bam}_baseCoverage_gatk.txt ; done
Calculate callable loci.
for i in *.bam ; do
GenomeAnalysisTK -R ${refgen} \
-T CallableLoci \
-I ${i} \
-o ${i%.bam}_CallableLoci_gatk.txt \
-summary ${i%.bam}_CallableLocSumm_gatk.txt ; done
Calculate bam file quality metrics using Picard
module load picard-tools/1.77 R/3.0.2
for i in *.bam ; do
CollectMultipleMetrics \
INPUT=${wpath}${i} \
OUTPUT=${wpath}${i}_picard.txt \
VALIDATION_STRINGENCY=LENIENT \
TMP_DIR=${wpath} ; done
Calculate Base quality distribution using Picard.
for i in *.bam ; do
QualityScoreDistribution \
INPUT=${wpath}${i} \
OUTPUT=${wpath}${i}_qualScoreDis \
CHART_OUTPUT=${wpath}${i%.bam}_qualScore.chart \
ALIGNED_READS_ONLY=true \
VALIDATION_STRINGENCY=LENIENT \
TMP_DIR=${wpath} ; done
Unload modules, record date and time.
module unload picard-tools/1.77
module unload R/3.0.2
module unload jre/1.7.0_25
echo "`date +%y/%m/%d_%H:%M:%S` Run completed "
echo "William P. Gilks, University of Sussex 2017 "
exit
EA utils for read-cleaning https://expressionanalysis.github.io/ea-utils/
BWA for mapping reads to reference http://bio-bwa.sourceforge.net/
Picard, for formatting tools https://broadinstitute.github.io/picard/
Stampy, for fine-scale re-mapping http://www.well.ox.ac.uk/project-stampy
GATK for formatting and QC https://software.broadinstitute.org/gatk/
Publication here https://f1000research.com/articles/5-2644/v3.
25th March 2018