Post date: Aug 22, 2018 9:8:1 PM
There are 12 lanes of aspen data in /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/. This is all from UT on a HiSeq 2500 with 100 bp reads. Four of the lanes were included in a job I ran with some Lycaeides data.This includes data from the pando project and from a western US sampling phylogeography project. I intend to lead the former, whereas Karen Mock will lead the latter (but I am going to take it through variant calling). Here is what I am doing.
1. Filter out phiX. The script is described here.
perl Scripts/PhiXFilterFork.pl *fastq
This generates the clean*fastq files.
2. Split the fastq files and parse barcodes.
First, files are split for faster parsing,
sbatch SplitFq.sh
which runs
#!/bin/sh
#SBATCH --time=48:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=split
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
cd /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Sequences
split -l 200000000 clean_BP0103a_S2_L002_R1_001.fastq BP0103a_
split -l 200000000 clean_BP0103b_S1_L001_R1_001.fastq BP0103b_
split -l 200000000 clean_BP0103c_S2_L002_R1_001.fastq BP0103c_
split -l 200000000 clean_BP0104a_S3_L003_R1_001.fastq BP0104a_
split -l 200000000 clean_BP0104_S3_L003_R1_001.fastq BP0104_
split -l 200000000 clean_BP0508a_S4_L004_R1_001.fastq BP0508a_
split -l 200000000 clean_BP0508_S4_L004_R1_001.fastq BP0508_
split -l 200000000 clean_BP0912a_S5_L005_R1_001.fastq BP0912a_
split -l 200000000 clean_BP0912_S5_L005_R1_001.fastq BP0912_
split -l 200000000 clean_BP1316a_S6_L006_R1_001.fastq BP1316a_
split -l 200000000 clean_BP1316b_S7_L007_R1_001.fastq BP1316b_
split -l 200000000 clean_BP1718_S8_L008_R1_001.fastq BP1718_
Then, I parsed the sequences using the barcode files provided by Jim Walton,
I ran,
sbatch SubParse.sh
which looks like this,
#!/bin/sh
#SBATCH --time=144:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=parseseq
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load perl
cd /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Sequences/
perl RunParseFork.pl BP*_a[a-z]
And runs this,
#
# manage barcode parsing
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $fq (@ARGV){
$pm->start and next FILES; ## fork
$fq =~ m/(BP[a-zA-Z0-9]+)/ or die "failed to match $fq\n";
$bc = $1;
if($bc =~ m/BP0103/){
$bc = "p1722_$bc".".csv";
}
else{
$bc = "p1614_$bc".".csv";
}
print "Parsing $fq with $bc\n";
system "perl parse_barcodes768.pl $bc $fq\n";
$pm->finish;
}
$pm->wait_all_children;
3. Align the data to the P. tremuloides draft reference genome with bwa aln and samse (version 0.7.17-r1188). I am using these instead of mem as the data are 100 bp reads (which seem to do better/ok with aln and samse). I am using the same draft genome we used previously, which is highly fragmented (described here and in Gompert and Mock 2017).
There are 2024 parsed fastq files to align. I submitted a series of jobs (with perl wrapper) to run all of the alignments. From, /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Parsed, I ran,
perl ../Scripts/wrap_qsub_slurm_bwa.pl *fastq
Here is one example command:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Parsed/
bwa aln -n 5 -l 20 -k 2 -t 8 -q 10 -f alnWYW-1650.sai /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa WYW-1650.fastq
bwa samse -n 1 -r '@RG\tID:potr-WYW-1650\tPL:ILLUMINA\tLB:potr-WYW-1650\tSM:potr-WYW-1650' -f alnWYW-1650.sam /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa alnWYW-1650.sai WYW-1650.fastq
4. Convert the sam alignments to sorted bam files with samtools 1.5 (htslib 1.5).
From /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Alignments, I ran,
sbatch SubSam2Bam.sh
which runs
perl Sam2BamFork.pl *sam
which is shown here:
#!/usr/bin/perl
#
# conver sam to bam, then sort and index
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $sam (@ARGV){
$pm->start and next FILES; ## fork
$sam =~ m/^([A-Za-z0-9_\-]+\.)sam/ or die "failed to match $fq\n";
$base = $1;
system "samtools view -b -O BAM -o $base"."bam $sam\n";
system "samtools sort -O BAM -o $base"."sorted.bam $base"."bam\n";
system "samtools index -b $base"."sorted.bam\n";
$pm->finish;
}
$pm->wait_all_children;
5. For now I am using the standard GATK approach (generate and combine g.vcf files then HaplotypeCaller) with GATK v3.5-0-g36282e4. Ploidy is set to 2 for now. I think pando is triploid (asked Karen) and we will want to re-call variants with the correct ploidies later for the rangewide samples. Also, we might end up using a different model for variant calling for the somatic mutations (something like what is done with tumors, see, e.g., GATK's mutect), but this seems like a fine place to start to at least get ploidies.
I ran,
sbatch RunGatk.sh
which runs
perl GatkFork.pl *sorted.bam
which looks like this:
#!/usr/bin/perl
#
# make g.vcf files
#
use Parallel::ForkManager;
my $max = 48;
my $pm = Parallel::ForkManager->new($max);
my $idir = '/uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Alignments';
my $odir = '/uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Variants';
my $tdir = '/scratch/general/lustre/potrGatk';
my $genome = '/uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa';
FILES:
foreach $bam (@ARGV){
$pm->start and next FILES; ## fork
$out = $bam;
$out =~ s/bam/g.vcf/ or die "failed here: $out\n";
$out = "$out";
system "java -Xmx48g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R $genome -I $idir/$bam -o $tdir/$out -gt_mode DISCOVERY -hets 0.001 -mbq 20 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000\n";
system "cp $tdir/$out $odir/$out\n";
$pm->finish;
}
$pm->wait_all_children;
6. Combine g.vcf files and call variants. This is all done in /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Variants/.
First, I used GATK to combine the 2024 *g.vcf files into sets of 96 files. Right now these span both projects (can redo this or split them, pando is mostly in the first two combined files and a bit in the third). I ran,
sbatch runComb.sh
which runs runCombineGVCFs.pl. This runs all of the jobs in jobs.txt. Here is the first one (that is, the first 96 files):
java -Xmx80g -jar ~/bin/GenomeAnalysisTK.jar -T CombineGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa --variant aln001-B.sorted.g.vcf --variant aln001-S.sorted.g.vcf --variant aln003-B.sorted.g.vcf --variant aln003-S-rep.sorted.g.vcf --variant aln003-S-Rp.sorted.g.vcf --variant aln003-S.sorted.g.vcf --variant aln005-B.sorted.g.vcf --variant aln005-S.sorted.g.vcf --variant aln007-B.sorted.g.vcf --variant aln007-S.sorted.g.vcf --variant aln009-B.sorted.g.vcf --variant aln009-S.sorted.g.vcf --variant aln011-B.sorted.g.vcf --variant aln011-S.sorted.g.vcf --variant aln013-B.sorted.g.vcf --variant aln013-S.sorted.g.vcf --variant aln015-B.sorted.g.vcf --variant aln015-S.sorted.g.vcf --variant aln017-B.sorted.g.vcf --variant aln017-S.sorted.g.vcf --variant aln019-B.sorted.g.vcf --variant aln019-S.sorted.g.vcf --variant aln021-B.sorted.g.vcf --variant aln021-S.sorted.g.vcf --variant aln022-S.sorted.g.vcf --variant aln023-B.sorted.g.vcf --variant aln023-S.sorted.g.vcf --variant aln025-B.sorted.g.vcf --variant aln025-S.sorted.g.vcf --variant aln027-B.sorted.g.vcf --variant aln029-B.sorted.g.vcf --variant aln031-B.sorted.g.vcf --variant aln031-S.sorted.g.vcf --variant aln033-B.sorted.g.vcf --variant aln033-S.sorted.g.vcf --variant aln035-B.sorted.g.vcf --variant aln035-S.sorted.g.vcf --variant aln037-B-rep.sorted.g.vcf --variant aln037-B-Rp.sorted.g.vcf --variant aln037-B.sorted.g.vcf --variant aln037-S.sorted.g.vcf --variant aln039-B.sorted.g.vcf --variant aln039-S.sorted.g.vcf --variant aln041-B.sorted.g.vcf --variant aln041-S.sorted.g.vcf --variant aln043-B.sorted.g.vcf --variant aln043-S.sorted.g.vcf --variant aln045-S.sorted.g.vcf --variant aln047-B2.sorted.g.vcf --variant aln047-B.sorted.g.vcf --variant aln047-S.sorted.g.vcf --variant aln049-B.sorted.g.vcf --variant aln049-S.sorted.g.vcf --variant aln051-B.sorted.g.vcf --variant aln051-S.sorted.g.vcf --variant aln053-B.sorted.g.vcf --variant aln053-S.sorted.g.vcf --variant aln055-B.sorted.g.vcf --variant aln055-S-rep.sorted.g.vcf --variant aln055-S-Rp.sorted.g.vcf --variant aln055-S.sorted.g.vcf --variant aln059-S.sorted.g.vcf --variant aln061-B.sorted.g.vcf --variant aln061-S.sorted.g.vcf --variant aln063-S.sorted.g.vcf --variant aln065-B.sorted.g.vcf --variant aln065-S.sorted.g.vcf --variant aln067-B.sorted.g.vcf --variant aln067-S.sorted.g.vcf --variant aln069-B.sorted.g.vcf --variant aln069-S.sorted.g.vcf --variant aln071-B.sorted.g.vcf --variant aln073-B.sorted.g.vcf --variant aln073-S.sorted.g.vcf --variant aln075-B.sorted.g.vcf --variant aln075-S.sorted.g.vcf --variant aln079-B.sorted.g.vcf --variant aln079-S.sorted.g.vcf --variant aln081-B.sorted.g.vcf --variant aln081-S.sorted.g.vcf --variant aln083-B.sorted.g.vcf --variant aln083-S.sorted.g.vcf --variant aln085-B.sorted.g.vcf --variant aln085-S.sorted.g.vcf --variant aln086-S.sorted.g.vcf --variant aln087-B.sorted.g.vcf --variant aln087-S.sorted.g.vcf --variant aln089-B.sorted.g.vcf --variant aln091-B.sorted.g.vcf --variant aln091-S.sorted.g.vcf --variant aln093-B.sorted.g.vcf --variant aln093-S.sorted.g.vcf --variant aln095-B.sorted.g.vcf --variant aln095-S.sorted.g.vcf --variant aln097-B.sorted.g.vcf --variant aln099-B.sorted.g.vcf -o combinded_1.g.vcf
7. Call variants and genotypes. Note that this took me several iterations as it was running very slowly. The trick in the end was to run this without any parallelization.
sbatch jointVarCall.sh
which runs
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk-geno
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load gatk
cd /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Variants
java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa -hets 0.001 -nt 1 --variant combinded_10.g.vcf --variant combinded_11.g.vcf --variant combinded_12.g.vcf --variant combinded_13.g.vcf --variant combinded_14.g.vcf --variant combinded_15.g.vcf --variant combinded_16.g.vcf --variant combinded_17.g.vcf --variant combinded_18.g.vcf --variant combinded_19.g.vcf --variant combinded_1.g.vcf --variant combinded_20.g.vcf --variant combinded_21.g.vcf --variant combinded_2.g.vcf --variant combinded_3.g.vcf --variant combinded_4.g.vcf --variant combinded_5.g.vcf --variant combinded_6.g.vcf --variant combinded_7.g.vcf --variant combinded_8.g.vcf --variant combinded_9.g.vcf -o ptr_variants.vcf