Code files here https://zenodo.org/record/159272
Publication https://f1000research.com/articles/5-2644/v3
GATK software here https://software.broadinstitute.org/gatk/
Section of a vcf genotypes file
Haplocaller on the GATK platform is frequently upgraded but commands described here are general enough to be easily modified. Other commands will depend on Java and server settings.
#!/bin/sh
#$ -N MegaGenome
## Use GATK HaplotypeCaller to genotype, merge and regenotype Dmel NGS-WGS samples.
#$ -pe openmp 24
#$ -S /bin/sh
#$ -cwd
#$ -j y
#$ -q bioinf.q
. /etc/profile.d/modules.sh
module load jre/1.7.0_25
module load gatk/3.4-0
myref=local_reference/dm6.fa
Note that the for loop wild-card picks-up on a bam file with four characters in the unique name part (????.bam). It then sets this as base name for the output file.
for i in ../../read_mapping/gwas_samples/*/????.bam
do
base_name=${i%.bam}
GenomeAnalysisTK \
-nct 24 \
-T HaplotypeCaller \
-R ${myref} \
-I $i \
--emitRefConfidence GVCF \
-stand_call_conf 31 -stand_emit_conf 31 -mbq 30 \
-L chr2L -L chr2R -L chr3L -L chr3R -L chrX \
--out $base_name.g.vcf;
done;
Using the wild-card for pattern recognition.
find ../../read_mapping/gwas_samples/*/ -type f \( -name "????.g.vcf" \) -exec ls "{}" \; > all_GVCF.list
This is memory-demanding, so it's best to trial on a few samples to get an idea of the performance and output.
GenomeAnalysisTK -R ${myref} \
-T CombineGVCFs \
-V all_GVCF.list \
--out lhm_rg.g.vcf
Log from GATK CombineGVCFs function. Note estimated run-time is 92.1 hours.
GenomeAnalysisTK -R ${myref} \
-T GenotypeGVCFs -nt 24 \
-V lhm_rg.g.vcf \
--out lhm_rg_HC_2015-09-15.vcf
Detail of the genotypes field in a vcf file. GT=genotypes, AD=allelic depth, DP=depth, GQ=genotype quality, PL=posterior likelihood.
if [ -d "QC_calling_2015-09-15" ]; then
rm -r QC_calling_2015-09-15
fi
mkdir QC_calling_2015-09-15/
GenomeAnalysisTK -R ${myref} \
-T VariantsToTable -V lhm_rg_HC_2015-09-15.vcf \
-F CHROM -F POS -F QUAL -F AC -F DP -F MQ \
--out QC_calling_2015-09-15/lhm_rg_HC_2015-09-15.QCtable.txt
gzip QC_calling_2015-09-15/lhm_rg_HC_2015-09-15.QCtable.txt
GenomeAnalysisTK -R ${myref} \
-T VariantsToTable -V lhm_rg_HC_2015-09-15.vcf \
-F CHROM -F POS -GF GT \
--out QC_calling_2015-09-15/lhm_rg_HC_2015-09-15.genotype.table.txt
gzip QC_calling_2015-09-15/lhm_rg_HC_2015-09-15.genotype.table.txt
GenomeAnalysisTK -R ${myref} \
-T VariantsToTable -V lhm_rg_HC_2015-09-15.vcf \
-F CHROM -F POS -GF GQ \
--out QC_calling_2015-09-15/lhm_rg_HC_2015-09-15.genotype.qual.txt
gzip QC_calling_2015-09-15/lhm_rg_HC_2015-09-15.genotype.qual.txt
Variant metrics table.
Large section of a vcf file viewed with bash
less -S lhm*.vcf
25th March 2018