Post date: May 01, 2019 8:28:4 PM
I want to use the whole genome data to determine whether there is a deficit of introgression on the Z chromosome in the sense that the Z shows the species tree more often than the autosomes. I have essentially done this already, but want to redo it with the new dovetail reference genome. I can thus verify that the pattern still holds and also this will make it easier to add to Sam's Dubois paper (my plan) which also uses the dovetail genome assembly.
Analyses, files, etc. described are in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/.
1. Align the standard, shot-gun sequence data for the five genomes (L. anna, L. idas, L. melissa, Sierra Nevada and Warners) to the dovetail L. melissa genome. This was done with bwa version 0.7.17-r1188.
perl wrap_qsub_slurm_bwa.pl sequences/*180*1.fastq.gz
which runs this:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:Lanna\tPL:ILLUMINA\tLB:Lanna\tSM:Lanna' /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta sequences/anna_180bp_1.fastq.gz sequences/anna_180bp_2.fastq.gz > aln_Lanna.sam 2> error_Lanna.log
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:Lidas\tPL:ILLUMINA\tLB:Lidas\tSM:Lidas' /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta sequences/idas_180bp_1.fastq.gz sequences/idas_180bp_2.fastq.gz > aln_Lidas.sam 2> error_Lidas.log
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:Lmelissa\tPL:ILLUMINA\tLB:Lmelissa\tSM:Lmelissa' /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta sequences/melissa_180bp_1.fastq.gz sequences/melissa_180bp_2.fastq.gz > aln_Lmelissa.sam 2> error_Lmelissa.log
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:Lsierra\tPL:ILLUMINA\tLB:Lsierra\tSM:Lsierra' /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta sequences/sierra_180bp_1.fastq.gz sequences/sierra_180bp_2.fastq.gz > aln_Lsierra.sam 2> error_Lsierra.log
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:Lwarner\tPL:ILLUMINA\tLB:Lwarner\tSM:Lwarner' /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta sequences/warner_180bp_1.fastq.gz sequences/warner_180bp_2.fastq.gz > aln_Lwarner.sam 2> error_Lwarner.log
2. Compress, sort and index the alignments (samtools 1.5)
sbatch SubSam2Bam.sh
3. Use samtools rmdup (version 1.5) to remove PCR duplicates (note that this is similar to but more aggressive/conservative than using picardtools).
sbatch SubRmdup.sh
which runs
#!/usr/bin/perl
#
# removes pcr duplicates
#
use Parallel::ForkManager;
my $max = 24;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $bam (@ARGV){
$pm->start and next FILES; ## fork
$ubam = $bam;
$ubam =~ s/sorted/unique/ or die "failed sub $ubam\n";
system "samtools rmdup -S $bam $ubam\n";
system "samtools index -b $ubam\n";
$pm->finish;
}
$pm->wait_all_children;
Note that the final alignments were moved to the bwa_alignments_2019 subdirectory.
4. Use GATK for variant calling. As a first step, I created g.vcf files for variant calling with GATK v 3.5 HaplotypeCcaller.
sbatch RunGatk.sh
which runs
use Parallel::ForkManager;
my $max = 6;
my $pm = Parallel::ForkManager->new($max);
my $idir = '/uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes';
my $odir = '/uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/variants_2019';
my $tdir = '/scratch/general/lustre/lycGatk';
my $genome = "/uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta";
FILES:
foreach $bam (@ARGV){
$pm->start and next FILES; ## fork
$out = $bam;
$out =~ s/bam/g.vcf/ or die "failed here: $out\n";
system "java -Xmx96g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R $genome -I $idir/$bam -o $tdir/$out -gt_mode DISCOVERY -hets 0.001 -mbq 30 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000\n";
system "cp $tdir/$out $odir/$out\n";
$pm->finish;
}
$pm->wait_all_children;
5. Joint variant calling with GATK, with results in the variants_2019 subdirectory.
sbatch genomeVarCall.sh
which runs
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk-var
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load gatk
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/whole_genomes/variants_2019
java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta --variant aln_Lanna.unique.g.vcf --variant aln_Lidas.unique.g.vcf --variant aln_Lmelissa.unique.g.vcf --variant aln_Lsierra.unique.g.vcf --variant aln_Lwarner.unique.g.vcf -ploidy 2 -o LycWgVars.vcf