The folowing is from a bash shell script which uses multiple platforms to map next-generation sequencing read data to a reference genome. The key feature is the soft-coding of input files meaning that any file in the working directory matching the search terms will be read into the function. Designed for use with Drosophila melanogaster but suitable for a wide range of species with known reference genomes. Also adaptable to RNAseq data. I've omitted lines of code for loading modules as these are mostly idiosyncratic, but generally come in the form module load bio/1.5 module load stampy1.0 etc. Also omitted are commands to remove old files. The full script is available here https://zenodo.org/record/809328. The publication is here https://f1000research.com/articles/5-2644/v3.
1. Cleans NGS reads (fastq), maps to reference genome to form bam file.
2. Removes duplicate reads and poor quality mapping.
3. Generates quality statistics and graphs.
4. Assumes reference genome has been indexed.
Script is activated from the Linux SGE server with the qsub ngs_read_mapper.sh function. To run locally, load module commands should be omitted. Activate script with caffeinate -i sh ngs_read_mapper.sh > log_ngs_read_mapper.txt
The settings below specify the language, job name, number of compute nodes to use, which working directory, and server queue to use.
#!/bin/sh#$ -N map_ngs#$ -pe openmp 10#$ -S /bin/sh#$ -cwd#$ -j y#$ -q bioinf.q. /etc/profile.d/modules.shRecord the date and time, state information about the operating system, log everything, quit with any error, list all files in the current directory, load java
echo "`date +%y/%m/%d_%H:%M:%S`" uname -sa set -ex ls -la module load jre/1.7.0_25Input file paths for directing all output files, notably temporary ones, also the indexed and formatted reference genome assembly file, and the adapter sequences used when multiplexing the sequencing reaction, which are to be removed from the read data.
wpath=/lustre/scratch/bioenv/wg39/lhm_sequencing/refgen=local_reference/dm6.faadaptors=adaptor_sequences/adaptors_nextera_130214.fastaRemove adaptor sequences from NGS reads
for i in *.fastq ; do fastq-mcf ${adaptors} ${i} -o ${i}.cleaned ; doneMap reads to reference genome, using BWA mem.
for i in *.R1.fastq.cleaned ; do base=${i%.R1.fastq.cleaned} bwa mem -t 5 -M -L0 -P -R "@RG\tID:${base}-id\tSM:${base}-sm\tLB:${base}-lib\tPL:ILLUMINA" \ ${refgen} "${base}.R1.fastq" "${base}.R2.fastq" > "${base}.sam" ; doneCompress sam files to bam files.
for i in *.sam ; do samtools view -bS -q 20 ${i}_first.sam > ${i%.sam}_first.bam ; doneSecondary read-mapping, using Stampy.
for i in *_first.bam ; do stampy.py \ -g ${wpath}${refgen} \ -h ${wpath}${refgen} \ --bamkeepgoodreads -t5 \ -M ${wpath}${i} \ --output=${wpath}${i%_first.bam}_goodreads.bam ; doneRemove bad reads
for i in *_goodreads.bam ; do CleanSam \ INPUT=${wpath}${i} \ OUTPUT=${wpath}${i%_goodreads.bam}_clean.bam \ TMP_DIR=${wpath} ; doneAdd read-group info
for i in *_clean.bam ; do AddOrReplaceReadGroups \ INPUT=${wpath}${i} \ OUTPUT=${wpath}${i%_clean.bam}_srtd.bam \ SORT_ORDER=coordinate RGID=${i%_clean.bam}-id RGLB=RGl-lib RGPL=ILLUMINA RGPU=${i%_clean.bam}-01 RGSM=${i%_clean.bam} \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=${wpath} ; doneRemove duplicate reads.
for i in *_srtd.bam ; do MarkDuplicates \ INPUT=${wpath}${i} \ OUTPUT=${wpath}${i%_srtd.bam}_noDuplicates.bam \ METRICS_FILE=${wpath}${i}_noDuplicates.metrics.txt \ REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=${wpath} ; doneIndex the bam file.
for i in *_noDuplicates.bam ; do BuildBamIndex \ INPUT=${wpath}${i} \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=${wpath} ; doneMark realignment intervals.
module load gatk/3.4-0for i in *_noDuplicates.bam ; do GenomeAnalysisTK -R ${refgen} \ -T RealignerTargetCreator \ -I ${i} \ --filter_mismatching_base_and_quals \ -o ${i}_realignment.intervals ; doneRe-map around realignment intervals
for i in *_noDuplicates.bam ; do GenomeAnalysisTK -R ${refgen} \ -T IndelRealigner \ -I ${i} \ -targetIntervals ${i}_realignment.intervals \ --filter_mismatching_base_and_quals \ -o ${i%_noDuplicates.bam}.bam ; done!! Note this uses all files in the directory ending in .bam.
Calculate base mismatches per PCR cycle.
for i in *.bam ; do GenomeAnalysisTK -R ${refgen} \ -T ErrorRatePerCycle \ -I ${i} \ -o ${i%.bam}_errsPerCycle_gatk.txt ; doneCalculate Read length distributions.
for i in *.bam ; do GenomeAnalysisTK -R ${refgen} \ -T ReadLengthDistribution \ -I ${i} \ -o ${i%.bam}_readLengths_gatk.tbl ; doneCalculate base coverage distributions.
for i in *.bam ; do GenomeAnalysisTK -R ${refgen} \ -T BaseCoverageDistribution \ -I ${i} \ -o ${i%.bam}_baseCoverage_gatk.txt ; doneCalculate callable loci.
for i in *.bam ; do GenomeAnalysisTK -R ${refgen} \ -T CallableLoci \ -I ${i} \ -o ${i%.bam}_CallableLoci_gatk.txt \ -summary ${i%.bam}_CallableLocSumm_gatk.txt ; doneCalculate bam file quality metrics using Picard
module load picard-tools/1.77 R/3.0.2for i in *.bam ; do CollectMultipleMetrics \ INPUT=${wpath}${i} \ OUTPUT=${wpath}${i}_picard.txt \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=${wpath} ; doneCalculate Base quality distribution using Picard.
for i in *.bam ; do QualityScoreDistribution \ INPUT=${wpath}${i} \ OUTPUT=${wpath}${i}_qualScoreDis \ CHART_OUTPUT=${wpath}${i%.bam}_qualScore.chart \ ALIGNED_READS_ONLY=true \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=${wpath} ; doneUnload modules, record date and time.
module unload picard-tools/1.77module unload R/3.0.2module unload jre/1.7.0_25echo "`date +%y/%m/%d_%H:%M:%S` Run completed "echo "William P. Gilks, University of Sussex 2017 "exitEA utils for read-cleaning https://expressionanalysis.github.io/ea-utils/
BWA for mapping reads to reference http://bio-bwa.sourceforge.net/
Picard, for formatting tools https://broadinstitute.github.io/picard/
Stampy, for fine-scale re-mapping http://www.well.ox.ac.uk/project-stampy
GATK for formatting and QC https://software.broadinstitute.org/gatk/
Publication here https://f1000research.com/articles/5-2644/v3.
25th March 2018