Genotype estimates for CHC data with entropy

Post date: Jun 20, 2018 11:20:8 PM

I am using the entropy model to obtain genotype estimates for the CHC bugs, which will then be used for the genomic prediction with gemma. This was done in /uufs/chpc.utah.edu/common/home/gompert-group1/projects/lyc_chcs/Gemma/Entropy/.

1. Convert vcf to gl without applying an additional MAF filter.

perl vcf2gl.pl 0.0 morefilter_filtered2x_lyc_CHCs_variants.vcf

The generated the gl file with 42,350 SNPs and 192 individuals.

2. I then generated point estimates of genotpyes (from GATK allele freq. estimates) to use as input for entropy starting values

perl gl2genest.pl af_lyc_chcs.txt lyc_chcs.gl

3. When generating initial values using the PCA-Kmean-LDA approach I noticed some individuals had very low coverage. Those with < 2x coverage were dropped leaving me with 125. This could be a plate effect or a size effect (need to check this still!). Here is what I did.

Run initq.R, this also subsets to coverage of 2x or greater.

Run keepInds.pl to do the same for gl.

4. Run entropy with k = {2,3}, 2 chains each.

sbatch RunEntropy.sh

#!/bin/sh

#SBATCH --time=140:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=8

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=entropy

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

module load gsl

module load hdf5

cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy

perl RunEntropyFork.pl

Which runs:

#!/usr/bin/perl

# run entropy jobs

use Parallel::ForkManager;

my $max = 20;

my $pm = Parallel::ForkManager->new($max);

my $in = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/lyc_chc_1.gl";

my $k2 = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/ldak2.txt";

my $k3 = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/ldak3.txt";

my $odir = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_chcs/Gemma/Entropy/";

@C = ("entropy -i $in -l 10000 -b 5000 -t 5 -k 2 -Q 0 -s 50 -q $k2 -o $odir"."out_k2_ch1.hdf5 -w 0 -m 1\n",

"entropy -i $in -l 10000 -b 5000 -t 5 -k 2 -Q 0 -s 50 -q $k2 -o $odir"."out_k2_ch2.hdf5 -w 0 -m 1\n",

"entropy -i $in -l 10000 -b 5000 -t 5 -k 3 -Q 0 -s 50 -q $k3 -o $odir"."out_k3_ch1.hdf5 -w 0 -m 1\n",

"entropy -i $in -l 10000 -b 5000 -t 5 -k 3 -Q 0 -s 50 -q $k3 -o $odir"."out_k3_ch2.hdf5 -w 0 -m 1\n");

COMMANDS:

foreach $command (@C){

$pm->start and next COMMANDS; ## fork

system $command;

$pm->finish;

}

$pm->wait_all_children;

5. Convert (across chains) entropy output to genotype point estimates and visualize/check things in R. Next I need to make the input file for gemma, but I am waiting on the CHC data from Gary.

estpost_entropy -o g_k23.txt -p gprob -w 0 -s 0 out_k2_ch1.hdf5 out_k2_ch2.hdf5 out_k3_ch1.hdf5 out_k3_ch2.hdf5

parameter dimensions for gprob: loci = 42350, ind = 125, genotypes = 3, chains = 4

In R:

L<-42350

N<-125

G<-matrix(scan("g_k23.txt",n=N*L,sep=","),nrow=N,ncol=L,byrow=TRUE)

pc<-prcomp(G,center=TRUE,scale=FALSE)

ids<-read.table("ids.txt",header=FALSE)

keep<-read.table("IndsToKeep.txt",header=FALSE)

ids<-ids[keep[,1]==1,1]

plot(pc$x[,1],pc$x[,2],type='n')

text(pc$x[,1],pc$x[,2],ids)

plot(pc$x[,3],pc$x[,4],type='n')

text(pc$x[,3],pc$x[,4],ids)

Page updated

Google Sites

Report abuse