Post date: Jun 11, 2018 7:1:50 PM
I am aligning the DNA sequence data from the CHC caterpillars to the Dovetail version of the L. melissa genome. Note that this is the version where I removed the '=' in the header. This is all done in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/. There are 192 samples and I am working from the clean* fastq files that have the poly G tails removed.
1. Alignment with bwa aln and samse (version 0.7.17-r1188).
From /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Parsed/:
perl ../Scripts/wrap_qsub_slurm_bwa.pl clean_*fastq
Here is an example command.
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Parsed
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnclean_USL-17-15-12.sai /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta clean_USL-17-15-12.fastq
bwa samse -n 1 -r '@RG\tID:lyc-clean_USL-17-15-12\tPL:ILLUMINA\tLB:lyc-clean_USL-17-15-12\tSM:lyc-clean_USL-17-15-12' -f alnclean_USL-17-15-12.sam /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta alnclean_USL-17-15-12.sai clean_USL-17-15-12.fastq
1b. These are longer reads, and as such the bwa aln samse approach did not go well (very few reads aligned). Perhaps this is a NextSeq issue, but I really think (for now) that it is just the longer reads. So I am trying again with bwa mem instead (modified the perl submission script).
perl ../Scripts/wrap_qsub_slurm_bwa.pl clean_*fastq
Here is an example command.
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Parsed
bwa mem -t 8 -k 20 -r 1.3 -R '@RG\tID:lyc-clean_USL-17-15-12\tPL:ILLUMINA\tLB:lyc-clean_USL-17-15-12\tSM:lyc-clean_USL-17-15-12' -o aln_clean_USL-17-15-12.sam -T 30 /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta clean_USL-17-15-12.fastq
And for completeness, here are the current bwa mem defaults:
Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]
Algorithm options:
-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-y INT seed occurrence for the 3rd round seeding [20]
-c INT skip seeds with more than INT occurrences [500]
-D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
-W INT discard a chain if seeded bases shorter than INT [0]
-m INT perform at most INT rounds of mate rescues for each read [50]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
Scoring options:
-A INT score for a sequence match, which scales options -TdBOELU unless overridden [1]
-B INT penalty for a mismatch [4]
-O INT[,INT] gap open penalties for deletions and insertions [6,6]
-E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
-L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]
-U INT penalty for an unpaired read pair [17]
-x STR read type. Setting -x changes multiple parameters unless overridden [null]
pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref)
ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref)
intractg: -B9 -O16 -L5 (intra-species contigs to ref)
Input/output options:
-p smart pairing (ignoring in2.fq)
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
-H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]
-o FILE sam file to output results to [stdout]
-j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
-5 for split alignment, take the alignment with the smallest coordinate as primary
-q don't modify mapQ of supplementary alignments
-K INT process INT input bases in each batch regardless of nThreads (for reproducibility) []
-v INT verbosity level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-h INT[,INT] if there are <INT hits with score >80% of the max score, output all in XA [5,200]
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-V output the reference FASTA header in the XR tag
-Y use soft clipping for supplementary alignments
-M mark shorter split hits as secondary
-I FLOAT[,FLOAT[,INT[,INT]]]
specify the mean, standard deviation (10% of the mean if absent), max
(4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
2. Used samtools (1.5 with htslib 1.5) to convert sam to bam, etc.
perl ../Scripts/Sam2BamFork.pl aln_clean_*sam
3. Variant calling, first using GATK's haplotype caller to generate individual g.vcf files.
From the Alignments subdirectory I ran
perl ../Scripts/wrap_qsub_slurm_gatk.pl *sorted.bam
This runs the 192 alignments, here is one example.
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Alignments
java -Xmx48g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta -I /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Alignments/aln_clean_USL-17-15-12.sorted.bam -o /scratch/general/lustre/lycCHC/aln_clean_USL-17-15-12.sorted.g.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 30 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000
cp /scratch/general/lustre/lycCHC/aln_clean_USL-17-15-12.sorted.g.vcf /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Variants/aln_clean_USL-17-15-12.sorted.g.vcf
4. Joint variant calling with GATK.
First, I combined the 192 g.vcf files into two combined g.vcf files with 96 individuals each (this speeds up variant calling).
sbatch SubComb.sh
#!/bin/sh -f
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Variants
perl runCombineGVCFs.pl
This runs runCombineGVCFs.pl
#!/usr/bin/perl
# fix scaffold names
use Parallel::ForkManager;
my $max = 10;
my $pm = Parallel::ForkManager->new($max);
open(IN,"jobs.txt") or die;
while(<IN>){
push(@jobs,$_);
}
FILES:
foreach $job (@jobs){
$pm->start and next FILES; ## fork
print "JOB=$job\n\n";
system "$job\n";
$pm->finish;
}
$pm->wait_all_children;
Here is one of the two commands in jobs.txt.
java -Xmx80g -jar ~/bin/GenomeAnalysisTK.jar -T CombineGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta --variant aln_clean_PSP-17-04-02.sorted.g.vcf --variant aln_clean_PSP-17-04-03.sorted.g.vcf --variant aln_clean_PSP-17-04-04.sorted.g.vcf --variant aln_clean_PSP-17-04-05.sorted.g.vcf --variant aln_clean_PSP-17-04-07.sorted.g.vcf --variant aln_clean_PSP-17-41-01.sorted.g.vcf --variant aln_clean_PSP-17-41-02.sorted.g.vcf --variant aln_clean_PSP-17-41-03.sorted.g.vcf --variant aln_clean_PSP-17-41-04.sorted.g.vcf --variant aln_clean_PSP-17-41-05.sorted.g.vcf --variant aln_clean_PSP-17-41-06.sorted.g.vcf --variant aln_clean_PSP-17-41-07.sorted.g.vcf --variant aln_clean_PSP-17-41-09.sorted.g.vcf --variant aln_clean_PSP-17-42-01.sorted.g.vcf --variant aln_clean_PSP-17-42-02.sorted.g.vcf --variant aln_clean_PSP-17-42-04.sorted.g.vcf --variant aln_clean_PSP-17-42-05.sorted.g.vcf --variant aln_clean_PSP-17-42-06.sorted.g.vcf --variant aln_clean_PSP-17-42-08.sorted.g.vcf --variant aln_clean_PSP-17-43-01.sorted.g.vcf --variant aln_clean_PSP-17-43-02.sorted.g.vcf --variant aln_clean_PSP-17-43-03.sorted.g.vcf --variant aln_clean_PSP-17-43-04.sorted.g.vcf --variant aln_clean_PSP-17-43-06.sorted.g.vcf --variant aln_clean_PSP-17-43-09.sorted.g.vcf --variant aln_clean_PSP-17-43-11.sorted.g.vcf --variant aln_clean_PSP-17-44-01.sorted.g.vcf --variant aln_clean_PSP-17-44-02.sorted.g.vcf --variant aln_clean_PSP-17-44-04.sorted.g.vcf --variant aln_clean_PSP-17-44-05.sorted.g.vcf --variant aln_clean_PSP-17-44-07.sorted.g.vcf --variant aln_clean_PSP-17-44-10.sorted.g.vcf --variant aln_clean_PSP-17-45-08.sorted.g.vcf --variant aln_clean_RNV-17-12-01.sorted.g.vcf --variant aln_clean_RNV-17-12-03.sorted.g.vcf --variant aln_clean_RNV-17-12-06.sorted.g.vcf --variant aln_clean_SKI-17-01-02.sorted.g.vcf --variant aln_clean_SKI-17-01-03.sorted.g.vcf --variant aln_clean_SKI-17-01-04.sorted.g.vcf --variant aln_clean_SKI-17-01-06.sorted.g.vcf --variant aln_clean_SKI-17-01-07.sorted.g.vcf --variant aln_clean_SKI-17-01-08.sorted.g.vcf --variant aln_clean_SKI-17-01-10.sorted.g.vcf --variant aln_clean_SKI-17-04-03.sorted.g.vcf --variant aln_clean_SKI-17-04-04.sorted.g.vcf --variant aln_clean_SKI-17-04-08.sorted.g.vcf --variant aln_clean_SKI-17-04-09.sorted.g.vcf --variant aln_clean_SKI-17-04-10.sorted.g.vcf --variant aln_clean_SKI-17-06-01.sorted.g.vcf --variant aln_clean_SKI-17-06-02.sorted.g.vcf --variant aln_clean_SKI-17-06-03.sorted.g.vcf --variant aln_clean_SKI-17-06-05.sorted.g.vcf --variant aln_clean_SKI-17-06-06.sorted.g.vcf --variant aln_clean_SKI-17-06-12.sorted.g.vcf --variant aln_clean_SKI-17-12-01.sorted.g.vcf --variant aln_clean_SKI-17-12-02.sorted.g.vcf --variant aln_clean_SKI-17-12-04.sorted.g.vcf --variant aln_clean_SKI-17-12-09.sorted.g.vcf --variant aln_clean_SKI-17-12-10.sorted.g.vcf --variant aln_clean_SKI-17-17-02.sorted.g.vcf --variant aln_clean_SKI-17-17-03.sorted.g.vcf --variant aln_clean_SKI-17-17-08.sorted.g.vcf --variant aln_clean_SKI-17-17-11.sorted.g.vcf --variant aln_clean_USL-17-06-03.sorted.g.vcf --variant aln_clean_USL-17-06-04.sorted.g.vcf --variant aln_clean_USL-17-06-05.sorted.g.vcf --variant aln_clean_USL-17-06-06.sorted.g.vcf --variant aln_clean_USL-17-06-08.sorted.g.vcf --variant aln_clean_USL-17-06-09.sorted.g.vcf --variant aln_clean_USL-17-06-10.sorted.g.vcf --variant aln_clean_USL-17-06-11.sorted.g.vcf --variant aln_clean_USL-17-06-12.sorted.g.vcf --variant aln_clean_USL-17-08-01.sorted.g.vcf --variant aln_clean_USL-17-08-03.sorted.g.vcf --variant aln_clean_USL-17-08-12.sorted.g.vcf --variant aln_clean_USL-17-10-02.sorted.g.vcf --variant aln_clean_USL-17-10-03.sorted.g.vcf --variant aln_clean_USL-17-10-04.sorted.g.vcf --variant aln_clean_USL-17-10-05.sorted.g.vcf --variant aln_clean_USL-17-10-06.sorted.g.vcf --variant aln_clean_USL-17-10-07.sorted.g.vcf --variant aln_clean_USL-17-12-04.sorted.g.vcf --variant aln_clean_USL-17-12-06.sorted.g.vcf --variant aln_clean_USL-17-12-07.sorted.g.vcf --variant aln_clean_USL-17-12-08.sorted.g.vcf --variant aln_clean_USL-17-12-11.sorted.g.vcf --variant aln_clean_USL-17-12-12.sorted.g.vcf --variant aln_clean_USL-17-15-01.sorted.g.vcf --variant aln_clean_USL-17-15-02.sorted.g.vcf --variant aln_clean_USL-17-15-04.sorted.g.vcf --variant aln_clean_USL-17-15-05.sorted.g.vcf --variant aln_clean_USL-17-15-08.sorted.g.vcf --variant aln_clean_USL-17-15-09.sorted.g.vcf --variant aln_clean_USL-17-15-10.sorted.g.vcf --variant aln_clean_USL-17-15-11.sorted.g.vcf --variant aln_clean_USL-17-15-12.sorted.g.vcf -o combinded_2.g.vcf
Then, for joint variant calling I ran sbatch jointVarCall.sh.
#!/bin/sh
#SBATCH --time=140:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_chcs/Variants
java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta -hets 0.001 -nt 20 --variant combinded_1.g.vcf --variant combinded_2.g.vcf -o lyc_CHCs_variants.vcf