Post date: Apr 24, 2019 8:0:27 PM
We need genotype estimates for trait mapping/genomic prediction. Also, I wanted to look for any oddness in genotypes for individuals before moving on as another means to contrast the different data sets and genotype likelihoods. I obtained genotype estimates from entropy and the results and scripts are all in /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Entropy/.
1. I ran entropy with 3 chains and k = 2 or 3 with 10,000 step chains, 5000 iteration burnins and a thinning interval of 5. This was done for each of the 3 data sets. I ran, sbatch RunEntropy.sh, which runs
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=entropy
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
module load gsl
module load hdf5
cd /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Entropy
perl RunEntropyFork.pl lyc_timeseries_*.gl
which runs
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
my $odir = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Entropy/";
my $base = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Entropy/";
foreach $in (@ARGV){
$in =~ m/series_([A-Z]+_[a-z]+)/;
$dat = $1;
foreach $k (2..3){
foreach $ch (0..2){
$pm->start and next; ## fork
system "entropy -i $in -l 10000 -b 5000 -t 5 -k $k -Q 0 -s 50 -q $base"."ldak$k".".txt -o $odir"."out_$dat"."_k$k"."_ch$ch".".hdf5 -w 0 -m 1\n";
$pm->finish;
}
}
}
$pm->wait_all_children;
2. I then extracted posterior genotype estimates by integrating over the 3 chains and two values of k.
estpost_entropy -o G_GATK_full.txt -p gprob -w 0 -s 0 out_GATK_full_k* &
estpost_entropy -o G_GATK_sub.txt -p gprob -w 0 -s 0 out_GATK_sub_k* &
estpost_entropy -o G_SAM_sub.txt -p gprob -w 0 -s 0 out_SAM_sub_k* &
3. I then read these into R and made a PCA. Again, the samtools/bcftools data set appears to give the cleanest results with a clear spatial signal but not much else (as would be expected). See commands.R and gPcs.pdf. Nothing looks amiss.
## descriptive analyses based on individual genotype estimates, again with comparisons across data sets
L<-c(30033,12886,12886)
N<-1536
files<-c("G_GATK_full.txt","G_GATK_sub.txt","G_SAM_sub.txt")
G<-vector("list",3)
for(i in 1:3){
## rows = inds, cols = SNPs
G[[i]]<-matrix(scan(files[i],n=L[i]*N,sep=","),nrow=N,ncol=L[i],byrow=TRUE)
}
PC<-vector("list",3)
for(i in 1:3){
PC[[i]]<-prcomp(G[[i]],center=TRUE,scale=FALSE)
}
## plots
library(RColorBrewer)
ids<-read.table("indIds.txt",header=FALSE)
tit<-c("GATK full","GATK sub","SAM sub")
myc<-brewer.pal(n=10,"Paired")
myc<-c(myc[1:6],"black",myc[7:10])
pdf("gPcs.pdf",width=6,height=6)
par(mar=c(4,5.5,2.5,1))
for(i in 1:3){
plot(PC[[i]]$x[,1],PC[[i]]$x[,2],pch=ids[,2]+8,col=myc[as.numeric(ids[,1])],xlab="PC1",ylab="PC2",cex.lab=1.5)
title(main=tit[i],cex.main=1.4)
if(i == 3){
legend(-4,6.7,unique(ids[,1]),fill=myc,ncol=2,bty='n')
}
}
dev.off()