Post date: Jan 02, 2017 6:13:58 PM
1. Identify ancestry informative markers (AIMs), here defined as those with an allele frequency difference of > 0.2. I used the L. anna and L. melissa (west) infiles (genotype point estimates) for this. Here is the R code.
## allele frequency estimates from point genotype estimates from combine pops.
anna<-read.table("comb_lychybAutos_Anna.txt",skip=1,header=F)
anna_m<-as.matrix(anna[,-1])
anna_p<-apply(anna_m,1,mean)/2
mel<-read.table("comb_lychybAutos_MelWest.txt",skip=1,header=F)
mel_m<-as.matrix(mel[,-1])
mel_p<-apply(mel_m,1,mean)/2
## define aims as those with |p_anna - p_mel| >= 2
aims<-which(abs(anna_p-mel_p) > 0.2)
d<-(abs(anna_p-mel_p))
## this includes 1388 SNPs, most of which are fixed or nearly fixed in one population
## but variable in the other
aimSnps<-data.frame(anna[aims,1],aims)
colnames(aimSnps)<-c("mappos","snpno")
write.table(aimSnps,"aimsnps.txt",row.names=F,col.names=T,quote=F)
The file with the SNP ids is aimsnps.txt and is now in /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/entropy_aims/.
Here are the numbers of AIMs by chromosome (# LG)
104 1
68 10
72 11
68 12
67 13
50 14
42 15
64 16
54 17
34 18
38 19
109 2
29 20
28 21
14 22
81 3
90 4
79 5
94 6
79 7
76 8
48 9
2. Generate entropy infiles for the AIMs that contain the L. anna and L. melissa west populations and one of the admixed populations (thus, there will be one infile per admixed population). This should maximize comparisons with popanc. I made the 13 infiles using the perl script mkInfiles.pl in the entropy_aims sub directory. This makes the lycaims_*.gl files and starting values files (0.05 for anna, 0.95 for melissa and 0.5 for admixed):
perl mkInfiles.pl CLH &
perl mkInfiles.pl CRP &
perl mkInfiles.pl LAE &
perl mkInfiles.pl SWM &
perl mkInfiles.pl SOF &
perl mkInfiles.pl REF &
perl mkInfiles.pl EGP &
perl mkInfiles.pl STM &
perl mkInfiles.pl MTR &
perl mkInfiles.pl SOP &
perl mkInfiles.pl TIC &
perl mkInfiles.pl CSP &
perl mkInfiles.pl BKM &
3. Run entropy analyses, 13 runs with five chains each, all k = 2. I am using fork for this.
I ran this submission wrapper:
perl wrap_qsub_slurm_ent.pl
Which calls this fork script forkRunPopanc.pl for each population. Here is the content:
#!/usr/bin/perl
#
# creates a child process per chain
#
$pop = shift(@ARGV);
$idir = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/entropy_aims/infiles/";
$qdir = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/entropy_aims/startingvals/";
$odir = "/uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/entropy_aims/mcmc/";
foreach $ch (0..4){
$pid = fork;
if ($pid) { ## fork successful
$forks++;
system "/uufs/chpc.utah.edu/common/home/u6000989/bin/entropy -i $idir"."lycaims_$pop.gl -l 15000 -b 5000 -t 5 -k 2 -Q 0 -s 50 -q $qdir"."svk2_$pop.txt -o /scratch/general/lustre/ento/eo_lycaim$pop"."_"."$ch.hdf5 -w 0 -m 1\n";
system "mv /scratch/general/lustre/ento/eo_lycaim$pop"."_"."$ch.hdf5 /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/entropy_aims/mcmc/\n";
exit;
}
}
for (1 .. $forks){
$pid = wait();
print "Parent saw $pid exit\n";
}
Results will be in here: /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/entropy_aims/mcmc/
NOTE: I re-ran this with the little and big Q model. The latter probably makes more sense, but I want to see how much it matters.
4. Extract ancestry information. Posterior means can be obtained with a hdf5 utility, which I have wrapped in this perl wrapper:
perl getLocalAnc.pl
Estimates of Q from estpost:
perl runEstpost.pl