Post date: Jun 20, 2018 11:20:8 PM
I am using the entropy model to obtain genotype estimates for the CHC bugs, which will then be used for the genomic prediction with gemma. This was done in /uufs/chpc.utah.edu/common/home/gompert-group1/projects/lyc_chcs/Gemma/Entropy/.
1. Convert vcf to gl without applying an additional MAF filter.
perl vcf2gl.pl 0.0 morefilter_filtered2x_lyc_CHCs_variants.vcf
The generated the gl file with 42,350 SNPs and 192 individuals.
2. I then generated point estimates of genotpyes (from GATK allele freq. estimates) to use as input for entropy starting values
perl gl2genest.pl af_lyc_chcs.txt lyc_chcs.gl
3. When generating initial values using the PCA-Kmean-LDA approach I noticed some individuals had very low coverage. Those with < 2x coverage were dropped leaving me with 125. This could be a plate effect or a size effect (need to check this still!). Here is what I did.
Run initq.R, this also subsets to coverage of 2x or greater.
Run keepInds.pl to do the same for gl.
4. Run entropy with k = {2,3}, 2 chains each.
sbatch RunEntropy.sh
#!/bin/sh
#SBATCH --time=140:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=entropy
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load gsl
module load hdf5
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy
perl RunEntropyFork.pl
Which runs:
#!/usr/bin/perl
#
# run entropy jobs
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
my $in = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/lyc_chc_1.gl";
my $k2 = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/ldak2.txt";
my $k3 = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/ldak3.txt";
my $odir = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/";
@C = ("entropy -i $in -l 10000 -b 5000 -t 5 -k 2 -Q 0 -s 50 -q $k2 -o $odir"."out_k2_ch1.hdf5 -w 0 -m 1\n",
"entropy -i $in -l 10000 -b 5000 -t 5 -k 2 -Q 0 -s 50 -q $k2 -o $odir"."out_k2_ch2.hdf5 -w 0 -m 1\n",
"entropy -i $in -l 10000 -b 5000 -t 5 -k 3 -Q 0 -s 50 -q $k3 -o $odir"."out_k3_ch1.hdf5 -w 0 -m 1\n",
"entropy -i $in -l 10000 -b 5000 -t 5 -k 3 -Q 0 -s 50 -q $k3 -o $odir"."out_k3_ch2.hdf5 -w 0 -m 1\n");
COMMANDS:
foreach $command (@C){
$pm->start and next COMMANDS; ## fork
system $command;
$pm->finish;
}
$pm->wait_all_children;
5. Convert (across chains) entropy output to genotype point estimates and visualize/check things in R. Next I need to make the input file for gemma, but I am waiting on the CHC data from Gary.
estpost_entropy -o g_k23.txt -p gprob -w 0 -s 0 out_k2_ch1.hdf5 out_k2_ch2.hdf5 out_k3_ch1.hdf5 out_k3_ch2.hdf5
parameter dimensions for gprob: loci = 42350, ind = 125, genotypes = 3, chains = 4
In R:
L<-42350
N<-125
G<-matrix(scan("g_k23.txt",n=N*L,sep=","),nrow=N,ncol=L,byrow=TRUE)
pc<-prcomp(G,center=TRUE,scale=FALSE)
ids<-read.table("ids.txt",header=FALSE)
keep<-read.table("IndsToKeep.txt",header=FALSE)
ids<-ids[keep[,1]==1,1]
plot(pc$x[,1],pc$x[,2],type='n')
text(pc$x[,1],pc$x[,2],ids)
plot(pc$x[,3],pc$x[,4],type='n')
text(pc$x[,3],pc$x[,4],ids)