William Gilks - Processing NGS data

Processing next-generation sequencing data on a Linux cluster

The folowing is from a bash shell script which uses multiple platforms to map next-generation sequencing read data to a reference genome. The key feature is the soft-coding of input files meaning that any file in the working directory matching the search terms will be read into the function. Designed for use with Drosophila melanogaster but suitable for a wide range of species with known reference genomes. Also adaptable to RNAseq data. I've omitted lines of code for loading modules as these are mostly idiosyncratic, but generally come in the form module load bio/1.5 module load stampy1.0 etc. Also omitted are commands to remove old files. The full script is available here https://zenodo.org/record/809328. The publication is here https://f1000research.com/articles/5-2644/v3.

Description

1. Cleans NGS reads (fastq), maps to reference genome to form bam file.

2. Removes duplicate reads and poor quality mapping.

3. Generates quality statistics and graphs.

4. Assumes reference genome has been indexed.

Script is activated from the Linux SGE server with the qsub ngs_read_mapper.sh function. To run locally, load module commands should be omitted. Activate script with caffeinate -i sh ngs_read_mapper.sh > log_ngs_read_mapper.txt

The settings below specify the language, job name, number of compute nodes to use, which working directory, and server queue to use.

#!/bin/sh

#$ -N map_ngs

#$ -pe openmp 10

#$ -S /bin/sh

#$ -cwd

#$ -j y

#$ -q bioinf.q

. /etc/profile.d/modules.sh

Record the date and time, state information about the operating system, log everything, quit with any error, list all files in the current directory, load java

echo "`date +%y/%m/%d_%H:%M:%S`"

uname -sa

set -ex

ls -la

module load jre/1.7.0_25

Set variables

Input file paths for directing all output files, notably temporary ones, also the indexed and formatted reference genome assembly file, and the adapter sequences used when multiplexing the sequencing reaction, which are to be removed from the read data.

wpath=/lustre/scratch/bioenv/wg39/lhm_sequencing/

refgen=local_reference/dm6.fa

adaptors=adaptor_sequences/adaptors_nextera_130214.fasta

Remove adaptor sequences from NGS reads

for i in *.fastq ; do fastq-mcf ${adaptors} ${i} -o ${i}.cleaned ; done

Read maping and formatting

Map reads to reference genome, using BWA mem.

for i in *.R1.fastq.cleaned ; do

    base=${i%.R1.fastq.cleaned}

        bwa mem -t 5 -M -L0 -P -R "@RG\tID:${base}-id\tSM:${base}-sm\tLB:${base}-lib\tPL:ILLUMINA" \

        ${refgen} "${base}.R1.fastq" "${base}.R2.fastq" > "${base}.sam" ; done

Compress sam files to bam files.

for i in *.sam ; do

        samtools view -bS -q 20 ${i}_first.sam > ${i%.sam}_first.bam ; done

Secondary read-mapping, using Stampy.

for i in *_first.bam ; do

    stampy.py \

        -g ${wpath}${refgen} \

        -h ${wpath}${refgen} \

        --bamkeepgoodreads -t5 \

        -M ${wpath}${i} \

        --output=${wpath}${i%_first.bam}_goodreads.bam ; done

Remove bad reads

for i in *_goodreads.bam ; do

    CleanSam \

        INPUT=${wpath}${i} \

        OUTPUT=${wpath}${i%_goodreads.bam}_clean.bam \

        TMP_DIR=${wpath} ; done

Add read-group info

for i in *_clean.bam ; do

    AddOrReplaceReadGroups \

        INPUT=${wpath}${i} \

        OUTPUT=${wpath}${i%_clean.bam}_srtd.bam \

        SORT_ORDER=coordinate RGID=${i%_clean.bam}-id RGLB=RGl-lib RGPL=ILLUMINA RGPU=${i%_clean.bam}-01 RGSM=${i%_clean.bam} \

        VALIDATION_STRINGENCY=LENIENT \

        TMP_DIR=${wpath} ; done

Remove duplicate reads.

for i in *_srtd.bam ; do

    MarkDuplicates \

        INPUT=${wpath}${i} \

        OUTPUT=${wpath}${i%_srtd.bam}_noDuplicates.bam \

        METRICS_FILE=${wpath}${i}_noDuplicates.metrics.txt \

        REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT \

        TMP_DIR=${wpath} ; done

Index the bam file.

for i in *_noDuplicates.bam ; do

    BuildBamIndex \

        INPUT=${wpath}${i} \

        VALIDATION_STRINGENCY=LENIENT \

        TMP_DIR=${wpath} ; done

Final re-mapping using GATK

Mark realignment intervals.

module load gatk/3.4-0

for i in *_noDuplicates.bam ; do

    GenomeAnalysisTK -R ${refgen} \

        -T RealignerTargetCreator \

        -I ${i} \

        --filter_mismatching_base_and_quals \

        -o ${i}_realignment.intervals ; done

Re-map around realignment intervals

for i in *_noDuplicates.bam ; do

    GenomeAnalysisTK -R ${refgen} \

        -T IndelRealigner \

        -I ${i} \

        -targetIntervals ${i}_realignment.intervals \

        --filter_mismatching_base_and_quals \

        -o ${i%_noDuplicates.bam}.bam ; done

Generate bam file quality metrics

!! Note this uses all files in the directory ending in .bam.

Calculate base mismatches per PCR cycle.

for i in *.bam ; do

    GenomeAnalysisTK -R ${refgen} \

        -T ErrorRatePerCycle \

        -I ${i} \

        -o ${i%.bam}_errsPerCycle_gatk.txt ; done

Calculate Read length distributions.

for i in *.bam ; do

    GenomeAnalysisTK -R ${refgen} \

        -T ReadLengthDistribution \

        -I ${i} \

        -o ${i%.bam}_readLengths_gatk.tbl ; done

Calculate base coverage distributions.

for i in *.bam ; do

        GenomeAnalysisTK -R ${refgen} \

        -T BaseCoverageDistribution \

        -I ${i} \

        -o ${i%.bam}_baseCoverage_gatk.txt ; done

Calculate callable loci.

for i in *.bam ; do

    GenomeAnalysisTK -R ${refgen} \

        -T CallableLoci \

        -I ${i} \

        -o ${i%.bam}_CallableLoci_gatk.txt \

        -summary ${i%.bam}_CallableLocSumm_gatk.txt ; done

Calculate bam file quality metrics using Picard

module load picard-tools/1.77 R/3.0.2

for i in *.bam ; do

        CollectMultipleMetrics \

            INPUT=${wpath}${i} \

            OUTPUT=${wpath}${i}_picard.txt \

            VALIDATION_STRINGENCY=LENIENT \

            TMP_DIR=${wpath} ; done

Calculate Base quality distribution using Picard.

for i in *.bam ; do

        QualityScoreDistribution \

            INPUT=${wpath}${i} \

            OUTPUT=${wpath}${i}_qualScoreDis \

            CHART_OUTPUT=${wpath}${i%.bam}_qualScore.chart \

            ALIGNED_READS_ONLY=true \

            VALIDATION_STRINGENCY=LENIENT \

            TMP_DIR=${wpath} ; done

Unload modules, record date and time.

module unload picard-tools/1.77

module unload R/3.0.2

module unload jre/1.7.0_25

echo "`date +%y/%m/%d_%H:%M:%S` Run completed "

echo "William P. Gilks, University of Sussex 2017 "

exit

Links to programes used

EA utils for read-cleaning https://expressionanalysis.github.io/ea-utils/

BWA for mapping reads to reference http://bio-bwa.sourceforge.net/

Picard, for formatting tools https://broadinstitute.github.io/picard/

Stampy, for fine-scale re-mapping http://www.well.ox.ac.uk/project-stampy

GATK for formatting and QC https://software.broadinstitute.org/gatk/

Publication here https://f1000research.com/articles/5-2644/v3.

25th March 2018

Google Sites

Report abuse