Re-doing the PSSP alignments with bwa-mem, and variant calling

Post date: Sep 22, 2017 4:21:12 PM

Given the poor performance of bwa aln/samse, I re-aligned the data with bwa mem. This gave alignment rates closer to 50% (totally fine). Here are the commands:

perl ../scripts/wrap_qsub_slurm_bwa.pl PSSP_*fastq

Here is the most relevant portion of the perl script, which includes the options for bwa mem:

## build array of jobs to be run individually (serially) by slurm

my $dir = '/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/parsed/';

my @jobarray;

my $aln;

#my $samse;

my $ind;

my $genome = "/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta";

my $cd = "cd $dir\n";

my @job;

my $job;

my $cnt = 0;

foreach my $file (@ARGV){

if ($file =~ m/(PSSP_[A-Za-z0-9_]+)\.fastq/){

$ind = $1;

}

else {

die "Failed to match $file\n";

}

$aln ="bwa mem -t 12 -k 15 -r 1.3 -T 30 -R \'\@RG\\tID:$ind\\tPL:ILLUMINA\\tLB:$ind\\tSM:$ind"."\' $genome $file > aln"."$ind".".sam \n";

$cnt++;

push (@job, "$aln");

if($cnt==20){ ## every 20 samples gets a node of its own

$job = join("",@job);

push (@jobarray, "$cd"."$job");

print "$cd"."$job";

$cnt = 0;

@job = ();

}

## last job

$job = join("",@job);

push (@jobarray, "$cd"."$job");

print "$cd"."$job";

This generates the sam alignment files (one per sample) which should then be moved to the alignments sub-directory.

We then compressed, sorted and indexed the sam alignment files with samtools (ver)

This was done by submitting a series of batch submissions (jobs) with a perl wrapper script from the alignment sub-directory (where the sam files should be at this point):

perl ../scripts/wrap_qsub_slurm_sam2bam-par.pl alnPSSP_*sam

This generates *.bam and *sorted.bam files. You only need the latter. So, after this finishes, the *sam and *.bam files can be deleted (carefully, note that rm *bam will delete the sorted.bam files too).

GATK's HaplotypeCaller was then used for variant calling. This was used for compatability with POSE (a polyploid, as bcftools call only deals with diploids).

1. Create a dict file for the reference genome (required for GATK) a

## using interactive job

module load picard

java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta O=/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.dict

2. Index the reference sequence

## using interactive job

module load samtools

samtools faidx /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta

3. Batch submission for variant calling (SEE 3b)

sbatch slurmRun_gatk.sh

Here is the slurmRun_gatk.sh script:

#!/bin/sh

#SBATCH --time=96:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=28

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=gatk

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

echo ------------------------------------------------------

echo SLURM: job identifier is $SLURM_JOBID

echo SLURM: job name is $SLURM_JOB_NAME

echo ------------------------------------------------------

cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/alignments

java -Xmx920g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta -I bams.list -o PSSP_var.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 20 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE

early()

{

echo ' '

echo ' ############ WARNING: EARLY TERMINATION ############# '

echo ' '

}

This will generate the variant file PSSP_var.vcf

3b. NOTE, we ran into problems with GATK taking to long and failing. We will need to revisit this for POSE, but for PSSP, I am now moving forward with samtools/bcftools instead. I ran the batch submission like this,

sbatch callvar.sh

The script looks like this:

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=samtools

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

echo ------------------------------------------------------

echo SLURM: job identifier is $SLURM_JOBID

echo SLURM: job name is $SLURM_JOB_NAME

echo ------------------------------------------------------

module load samtools

cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/alignments

samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta -q 20 -Q 30 -I -u -g -t DP,AD,ADF,ADR -o PSSP_variants.bcf

bcftools call -v -m -p 0.01 -P 0.001 -O v -o PSSP_variants.vcf PSSP_variants.bcf

This generated ~2 million variants which are in PSSP_variants.vcf in the variants sub-directory.

4. Variant filtering. First with local version of vcfFilter.pl

#### stringency variables, edits as desired

# N = 617 inds

my $minCoverage = 1234; # minimum number of sequences; DP, 1234 = 2X

my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $mq = 30; # minimum mapping quality; MQ

my $miss = 123; # maximum number of individuals with no data = 20%

my $minMaf = 6; # minimum MAF, as allele count ~ 0.005

my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation

##### this set is for whole genome shotgun data

perl vcfFilter.pl PSSP_variants.vcf