Post date: Oct 04, 2017 4:35:4 PM
The SNP data set is described in the previous post. Now I am filtering the data to only retain the lines we phenotyped and to remove SNPs with too much missing data, etc.
1. I used vcftools 0.1.15 for filtering after converting the bcf files to vcf with bcftools (1.3). This was all done within a perl fork/wrapper script. Here is what I ran (from /uufs/chpc.utah.edu/common/home/u6000989/projects/mtrunc_mel/mtruncSnps):
perl forkFilter.pl 10 *bcf
Here is the script:
#!/usr/bin/perl
##
## run vcftools filter, fork over bcf files
##
## usage: #forks, *bcf
use warnings;
use strict;
use Parallel::ForkManager;
my $max = shift(@ARGV); ## get number of cores to use at one time
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach my $bcf (@ARGV){ ## loop through vcf/chromosome files
$pm->start and next FILES; ## fork
my $base = $bcf;
$base =~ s/\.[a-z]+$// or die "failed sub\n";
print "bcftools view --output-file $base.vcf --output-type v $bcf\n";
system "bcftools view --output-file $base.vcf --output-type v $bcf\n";
print "vcftools --vcf $base.vcf --out $base --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode\n";
system "vcftools --vcf $base.vcf --out $base --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode\n";
$pm->finish; ## exit the child process
}
$pm->wait_all_children;
And here is the screen output:
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chru-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chru-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr6-filtered-set-2014Apr15.vcf --out chr6-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr6-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr6-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr2-filtered-set-2014Apr15.vcf --out chr2-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr2-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr2-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr8-filtered-set-2014Apr15.vcf --out chr8-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr8-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr8-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr5-filtered-set-2014Apr15.vcf --out chr5-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr5-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr5-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr7-filtered-set-2014Apr15.vcf --out chr7-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr7-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr7-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr4-filtered-set-2014Apr15.vcf --out chr4-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr4-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr4-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr3-filtered-set-2014Apr15.vcf --out chr3-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr3-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr3-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
vcftools --vcf chr1-filtered-set-2014Apr15.vcf --out chr1-filtered-set-2014Apr15 --remove-indels --remove-filtered-all --maf 0.01 --min-alleles 2 --max-alleles 2 --min-meanDP 2 --max-missing 0.20 --minQ 30 --keep indIds.txt --recode
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf chr1-filtered-set-2014Apr15.vcf
--keep indIds.txt
--maf 0.01
--max-alleles 2
--min-alleles 2
--min-meanDP 2
--minQ 30
--max-missing 0.2
--out chr1-filtered-set-2014Apr15
--recode
--remove-filtered-all
--remove-indels
Keeping individuals in 'keep' list
After filtering, kept 94 out of 262 Individuals
Outputting VCF file...
After filtering, kept 1032131 out of a possible 2630449 Sites
Run Time = 636.00 seconds
After filtering, kept 1394374 out of a possible 3650081 Sites
Run Time = 874.00 seconds
After filtering, kept 1459932 out of a possible 4331584 Sites
Run Time = 1005.00 seconds
After filtering, kept 1498821 out of a possible 4377934 Sites
Run Time = 1005.00 seconds
After filtering, kept 1568488 out of a possible 4496375 Sites
Run Time = 1063.00 seconds
After filtering, kept 1653656 out of a possible 4761519 Sites
Run Time = 1124.00 seconds
After filtering, kept 1641196 out of a possible 4978489 Sites
Run Time = 1114.00 seconds
After filtering, kept 1820612 out of a possible 5414544 Sites
Run Time = 1199.00 seconds
After filtering, kept 1936794 out of a possible 5424868 Sites
Run Time = 1274.00 seconds
The filtered data are in the *recode files, here are the number of SNPs (note that we kept none of the u = unassigned):
chr1-filtered-set-2014Apr15.recode.vcf:1641196
chr2-filtered-set-2014Apr15.recode.vcf:1459932
chr3-filtered-set-2014Apr15.recode.vcf:1936794
chr4-filtered-set-2014Apr15.recode.vcf:1820612
chr5-filtered-set-2014Apr15.recode.vcf:1568488
chr6-filtered-set-2014Apr15.recode.vcf:1394374
chr7-filtered-set-2014Apr15.recode.vcf:1653656
chr8-filtered-set-2014Apr15.recode.vcf:1498821
Total = 12,973,873 SNPs.
2. Next, we removed SNPs in high LD, specifically we looked at 10 kb windows, sliding in 2 kb increments and removed SNPs with r^2 LD > 0.8.
First, added names to the SNPs, the names combine the LG and position,
perl addSnpNums.pl chr[0-9]*recode.vcf
Used PLINK (version 1.09) to get the set of SNPs in high LD,
perl ldPrune.pl mod_chr*vcf
3. Actually remove high LD SNPs and extract genotype likelihoods:
perl vcf2gl.pl mod_chr*recode.vcf
This produces mtrunc_pruned.gl, which includes 94 lines (individuals) and 5,648,772 SNPs.
4. Use my expectation-maximization ML allele frequency estimation program to estimate allele frequencies for all of the SNPs (EM toleranace = 0.001, 20 iterations for convergence).
estpEM -i mtrunc_pruned.gl -o p_mtruc_pruned.txt -h 1
Reading data from mtrunc_pruned.gl
Number of loci: 5648772
Number of individuals: 94
Get just the ML estimate:
cut -f 3 -d " " p_mtruc_pruned.txt > ml_p_mtruc_pruned.txt
5. Apply allele frequency prior to generate Bayesian genotype point estimates. Note that here p and (1-p) are the priors for the homozygotes and the prior for the het. is 0. This is because we are working with inbred lines.
perl gl2genestInbred.pl mtrunc_pruned.gl ml_p_mtruc_pruned.txt
This generates the genotype point estimate file pntest_mtrunc_pruned.txt, which can be used for mapping (with slight modifications for gemma).