created by slee
on 2016-06-16
The first step in the GATK ACNV workflow (http://gatkforums.broadinstitute.org/gatk/discussion/7387/description-and-examples-of-the-steps-in-the-acnv-case-workflow) is to naively call heterozygous SNPs in a case sample from the pileups at common SNP sites. These SNP sites are specified by a Picard interval list and can be constructed from any suitable database of SNPs. We outline below one possible method of building such a list with the GATK and Picard from the filtered 1000 Genomes Project Phase 3 markers used by BEAGLE 4.x.
First, download and extract the BEAGLE 1000G VCFs to a working folder:
wget -r -nH -nd -np -R index.html -A ‘chr*.1kg.phase3.v5a.vcf.zip’ http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/ unzip ‘chr*.1kg.phase3.v5a.vcf.zip’
Structural variants have already been filtered from these VCFs. We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than a given threshold (we choose 10% here) using SelectVariants. We then use CatVariants to merge the per-chromosome VCFs and produce a single VCF, which is finally converted to the desired interval list using Picard. Note that sex chromosomes are excluded as they are not currently supported by the GATK CNV/ACNV workflows.
PICARD_JAR=/seq/software/picard/current/bin/picard.jar GATK_JAR=/humgen/gsa-hpprojects/GATK/bin/current/GenomeAnalysisTK.jar HG19_FASTA=/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta CATVARIANTS_COMMAND=“java -cp $GATK_JAR org.broadinstitute.gatk.tools.CatVariants -R $HG19_FASTA” for chr in {1..22} do # Index the per-chromosome vcf.gz files using tabix. tabix -p vcf chr$chr.1kg.phase3.v5a.vcf.gz # Structural variants have already been filtered from these VCFs. We further filter out indels, multiallelic sites, and sites with minor-allele frequency less than 10%. java -jar $GATK_JAR -T SelectVariants -R $HG19_FASTA -V chr$chr.1kg.phase3.v5a.vcf.gz -selectType SNP —restrictAllelesTo BIALLELIC -select ‘vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount >= 0.10 && vc.getCalledChrCount(vc.altAlleleWithHighestAlleleCount) * 1.0 / vc.calledChrCount < 0.90’ -o chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz # Add filtered VCF for this chromosome to the list of inputs for CatVariants. CATVARIANTS_COMMAND=”$CATVARIANTS_COMMAND -V chr$chr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz“ done # Finish constructing the CatVariants command and execute it. CATVARIANTS_COMMAND=”$CATVARIANTS_COMMAND -out allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz —assumeSorted“ $CATVARIANTS_COMMAND # Convert to interval list. java -jar $PICARD_JAR VcfToIntervalList I=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.vcf.gz O=allchr.1kg.phase3.v5a.snp.maf10.biallelic.hg19.interval_list
Depending on the amount of free space in the working folder, commands to remove intermediate files can be added to the appropriate places above.
From AngeSverch on 2019-03-20
Hello, the code and BEAGLE vcfs are made for GRCh37. Is it possible to download the similar files for GRCh38?
From jml96 on 2019-07-23
Hi,
Is there an equivalent command for gatk 4.1 to this one?
java -cp gatk.jar org.broadinstitute.gatk.tools.CatVariants …
I am getting the following error.
Error: Could not find or load main class org.broadinstitute.gatk.tools.CatVariants
Thank you.
Regards,
João