Post date: Jul 09, 2018 10:5:48 PM
0. Index fasta file
path: Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed
bwa index cmacgenome.fasta
1. Alignment with bwa aln and samse (version 0.7.17-r1188).
Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/parsed
perl ../scripts/wrap_qsub_slurm_bwa.pl clean_*fastq
The above perl script runs bwa with the mem option enabled rather than aln. This is done due to the longer reads present.
Outputs aln_clean_*.sam files which are moved to /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/alignments
1.5 Check proportion of reads aligned
for i in *.sam; do samtools flagstat $i; done
I didn't observe any which were below 83%
2. Used samtools (1.5 with htslib 1.5) to convert sam to bam, etc.
Path: /uufs/chpc.utah.edu/common/home/u0795641/data/cmac_qtl_AR/alignments
perl ../Scripts/Sam2BamFork.pl aln_clean_*sam
Edit for GATK: GATK did not run correctly, perhaps something wrong with how CHPC handles multithreading. Instead used variant calling steps from the L14 paper.
3. Variant calling, first using GATK's haplotype caller to generate individual g.vcf files.
3.1 Generate fasta file index for GATK
samtools faidx cmacgenome.fasta
3.2 Generate dictionary file for GATK
java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=cmacgenome.fasta O=cmacgenome.dict
From the alignments subdirectory I ran
perl ../scripts/wrap_qsub_slurm_gatk.pl *sorted.bam
4. Combine GVCF files
First, I combined the 748 g.vcf files into 4 combined g.vcf files with 196 individuals each (this speeds up variant calling).
ls *.g.vcf > vcfs.txt
sed -i 's/^/--variant /' vcfs.txt
split -l 187 vcfs.txt splitvcf
cat splitvcfaa | tr -d '\n' >> basea.txt
Edit so that it's a continuous command
cat base* > jobs.txt
sed -i -e 's/vcf/vcf /g' jobs.txt
sbatch SubComb.sh
The above shell script runs runCombineGVCFs.pl
5. Joint variant calling
3. Call Var
samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/cmac_QTL_AR/alignments/bams/cmacgenome.fasta -q 20 -Q 30 -I -u -g -t DP,DPR -o calloVars.bcf
~/bin/bcftools call -v -m -p 0.01 -P 0.001 -O v -o calloVars.vcf calloVars.bcf
4. Filter Variants based on depth, mapping quality, etc; using version with data for >= 80% of individuals, keeping 28,249 SNPs. Creates filtered2x_calloVars.vcf
perl ../scripts/vcfFilter.pl calloVars.vcf
#stringency variables
my $minCoverage = 1526; # minimum number of sequences; DP, 1344 = 2X
my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 152; # maximum number of individuals with no data
my $minMaf = 7; # minimum MAF, as allele count ~ 0.005
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
5. Filter some more based on high SNP density. SNPs must be at least 4bp apart, plus gets rid of high coverage snps. Creates morefilter_filtered2x_calloVars.vcf
grep ^sc filtered2x_calloVars.vcf | cut -f 8 | cut -d ';' -f 1 | grep -o '[0-9]*' > depth.txt
4841+12037
grep ^sc filtered2x_calloVars.vcf | cut -f 1 > scafcal.txt
open the scaf file in vim and remove the scaf- part.
%s/scaffold_//
grep ^sc filtered2x_calloVars.vcf | cut -f 2 > poscal.txt
#create final file
paste scafBart.txt posBart.txt > snpsBart.txt
#stringency variables
my $maxCoverage = 16878; # maximum depth to avoid repeats, mean + 2sd
my $mind = 4; # don't call if SNPs are closer together than this
Retained 34284 variable loci
6. Convert to genotype likelihood format. Creates cmacL14.gl
perl ../scripts/vcf2gl.pl 0 morefilter_filtered2x_calloVars.vcf
Number of loci: 34284; number of individuals 748
Split genotype likelihoods by population
perl projects/cmac_AR/scripts/splitsPops.pl projects/cmac_AR/likelihood_data/cmacL14.gl
Create allele freq by population:
path: /uufs/chpc.utah.edu/common/home/u0795641/projects/cmac_qtl/likelihood_data
perl wrap_qsub_slurm_popmod.pl cmacseries*
paremeters:
~/bin/popmod -i $file -o outp_bayes_$pop".".hdf5 -b 1000 -n 10000 -t 5 -a 0.5
Create txt files of allele frequency (p) and genotypes (g)
module load hdf5 gsl gcc
~/bin/estpost_popmod -p g -s 0 -w 1 -o g_L1.txt outp_bayes_L1.hdf5
~/bin/estpost_popmod -p g -s 0 -w 0 -o g_L2.txt outp_bayes_L2.hdf5
~/bin/estpost_popmod -p g -s 0 -w 0 -o g_L14A.txt outp_bayes_L14A.hdf5
7. Variant calling for L14 and L1/2 references results:
L14: Number of loci: 13404; number of individuals 48 (from F16A)
L1/2/M: Number of loci: 21591; number of individuals 128