Post date: Sep 22, 2017 4:21:12 PM
Given the poor performance of bwa aln/samse, I re-aligned the data with bwa mem. This gave alignment rates closer to 50% (totally fine). Here are the commands:
perl ../scripts/wrap_qsub_slurm_bwa.pl PSSP_*fastq
Here is the most relevant portion of the perl script, which includes the options for bwa mem:
## build array of jobs to be run individually (serially) by slurm
my $dir = '/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/parsed/';
my @jobarray;
my $aln;
#my $samse;
my $ind;
my $genome = "/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta";
my $cd = "cd $dir\n";
my @job;
my $job;
my $cnt = 0;
foreach my $file (@ARGV){
if ($file =~ m/(PSSP_[A-Za-z0-9_]+)\.fastq/){
$ind = $1;
}
else {
die "Failed to match $file\n";
}
$aln ="bwa mem -t 12 -k 15 -r 1.3 -T 30 -R \'\@RG\\tID:$ind\\tPL:ILLUMINA\\tLB:$ind\\tSM:$ind"."\' $genome $file > aln"."$ind".".sam \n";
$cnt++;
push (@job, "$aln");
if($cnt==20){ ## every 20 samples gets a node of its own
$job = join("",@job);
push (@jobarray, "$cd"."$job");
print "$cd"."$job";
$cnt = 0;
@job = ();
}
}
## last job
$job = join("",@job);
push (@jobarray, "$cd"."$job");
print "$cd"."$job";
This generates the sam alignment files (one per sample) which should then be moved to the alignments sub-directory.
We then compressed, sorted and indexed the sam alignment files with samtools (ver)
This was done by submitting a series of batch submissions (jobs) with a perl wrapper script from the alignment sub-directory (where the sam files should be at this point):
perl ../scripts/wrap_qsub_slurm_sam2bam-par.pl alnPSSP_*sam
This generates *.bam and *sorted.bam files. You only need the latter. So, after this finishes, the *sam and *.bam files can be deleted (carefully, note that rm *bam will delete the sorted.bam files too).
GATK's HaplotypeCaller was then used for variant calling. This was used for compatability with POSE (a polyploid, as bcftools call only deals with diploids).
1. Create a dict file for the reference genome (required for GATK) a
## using interactive job
module load picard
java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta O=/uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.dict
2. Index the reference sequence
## using interactive job
module load samtools
samtools faidx /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta
3. Batch submission for variant calling (SEE 3b)
sbatch slurmRun_gatk.sh
Here is the slurmRun_gatk.sh script:
#!/bin/sh
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/alignments
java -Xmx920g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta -I bams.list -o PSSP_var.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 20 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE
early()
{
echo ' '
echo ' ############ WARNING: EARLY TERMINATION ############# '
echo ' '
}
This will generate the variant file PSSP_var.vcf
3b. NOTE, we ran into problems with GATK taking to long and failing. We will need to revisit this for POSE, but for PSSP, I am now moving forward with samtools/bcftools instead. I ran the batch submission like this,
sbatch callvar.sh
The script looks like this:
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=samtools
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
module load samtools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/alignments
samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/grasses/clustering/PSSP_reference_con90.fasta -q 20 -Q 30 -I -u -g -t DP,AD,ADF,ADR -o PSSP_variants.bcf
bcftools call -v -m -p 0.01 -P 0.001 -O v -o PSSP_variants.vcf PSSP_variants.bcf
This generated ~2 million variants which are in PSSP_variants.vcf in the variants sub-directory.
4. Variant filtering. First with local version of vcfFilter.pl
#### stringency variables, edits as desired
# N = 617 inds
my $minCoverage = 1234; # minimum number of sequences; DP, 1234 = 2X
my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 123; # maximum number of individuals with no data = 20%
my $minMaf = 6; # minimum MAF, as allele count ~ 0.005
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
##### this set is for whole genome shotgun data
perl vcfFilter.pl PSSP_variants.vcf