Post date: Dec 09, 2015 10:9:41 PM
Data were obtained from a transect along the Bavarian hybrid zone. The data come from the Dryad submission that is part of Turner and Harr (2014). SNPs were already screened for high LD and I extracted the subset of SNPs with no missing data and low LD from sample.data_noVinonoNA1_FstALL_5MAF.txtHYBS.sorted.pruned using plink. Results are in the musHzLd.* files in /labs/evolution/projects/popanc_sims/mouse/infiles. This left me with 93,699 SNPs across 23 chromosomes (no X-linked SNPs). Note that the individual IDs did not contain sample localities, so I obtained these data from Tina Harr (sample data). I then extracted the data for parental populations and 3 hybrid populations with good samples sizes using the following R script (getMusIds.R):
## parental: M. m. domesticus (P0) = SO + ST; M. m. musculus (P1) = GL + RE + RF
## hybrid pops: FS (31), HA (32), TU (21)
mus<-read.table("mouseHzTable.txt",header=T)
FS<-as.character(mus$individualID[which(mus$Siresite=='FS' & mus$Damsite=='FS')])
HA<-as.character(mus$individualID[which(mus$Siresite=='HA' & mus$Damsite=='HA')])
TU<-as.character(mus$individualID[which(mus$Siresite=='TU' & mus$Damsite=='TU')])
P0<-as.character(mus$individualID[ which((as.character(mus$Siresite)==as.character(mus$Damsite)) & (mus$Siresite=='SĂ–' | mus$Siresite=='ST'))])
P1<-as.character(mus$individualID[ which((as.character(mus$Siresite)==as.character(mus$Damsite)) & (mus$Siresite=='GL' | mus$Siresite=='RE' | mus$Siresite=='RF'))])
write.table(FS,"idsFS.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(HA,"idsHA.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(TU,"idsTU.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(P0,"idsP0.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)
write.table(P1,"idsP1.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)
Note that the best description of these populations comes from an earlier paper, which is (here, see Figure 1). Note that the parental samples sizes are low, 5 each, but that the admixed population sizes are good, 31, 32, and 21. Also, these are lab reared offspring from wild-caught individuals, and I only retained those individuals with both parents from the same site (SS in the spreadsheet).
I next wrote and ran a perl script to convert the ped genotype file to the input format for popanc. The script is called makeGenotypeFiles.pl (see below), and generated the geno* files for the parents (P0 = domesticus and P1 = musculus).
#!/usr/bin/perl
#
# generate genotype files for popanc for each of 3 admixed populations and two parental populations
#
# usage ids*txt
# read in sets of ids from ids*txt files
foreach $idf (@ARGV){
$idf =~ m/ids([A-Z0-9]+)/ or die "failed to match population for $idf\n";
$pop = $1;
$pops{$pop} = 1;
open(IN, $idf) or die "couldn't open $idf\n";
while(<IN>){
chomp;
s/^M/X/;
$mice{$_} = $pop;
}
close (IN);
}
# open outfiles for each population
foreach $pop (keys %pops){
$fh = "F"."$pop";
open ($fh, "> geno$pop".".txt") or die "failed to write for $fh\n";
$files{$pop} = $fh;
}
# read locus information
open(IN, "musHzLD.map") or die "failed to read map file\n";
while(<IN>){
chomp;
@line = split(/\s+/,$_);
$locus = "$line[0]".":"."$line[2]";
push(@lid,$locus);
}
close(IN);
# read genotype data
%gdata = ();
open(IN, "musHzLD.ped") or die "failed to read ped file\n";
while(<IN>){
chomp;
@line = split(" ",$_);
@genotypes = ();
if(defined $mice{$line[0]}){
$pop = $mice{$line[0]};
for($i=0;$i<7;$i++){## get rid of non-genotype data
shift(@line);
}
$cnt = 0;
foreach $a (@line){
$b = shift(@line);
$gen = ($a+$b-2);
push(@genotypes, $gen);
$cnt++;
}
$gen = join(" ",@genotypes);
push(@{$gdata{$pop}},$gen);
print "found one from $pop with $cnt SNPs\n";
}
}
close(IN);
foreach $pop (keys %gdata){
@gmat = ();
$ln = @{$gdata{$pop}}; ## no. inds
print {$files{$pop}} "$cnt $ln\n";
for($j=0;$j<$ln;$j++){
$gen = shift(@{$gdata{$pop}});
@gen = split(" ",$gen);
$i=0;
foreach $g (@gen){
$gmat[$i][$j] = $g;
$i++;
}
}
for($i=0; $i<$cnt; $i++){
print {$files{$pop}} "$lid[$i]";
for($j=0; $j<$ln; $j++){
print {$files{$pop}} " $gmat[$i][$j]";
}
print {$files{$pop}} "\n";
}
}
# close outfiles
foreach $pop (keys %files){
close($files{$pop});
}
I used popanc to estimate ancestry frequencies for each admixed population. I ran 3 chains, 30,000 steps each with 10,000 burnin and 5 thinning interval, and SNP windows of 15. The results will be in /labs/evolution/projects/popanc_sims/mcmc/mus/. Here is an example command:
cd /local/scratch/
sleep 10
popanc -o estpanc_musTUChain2.hdf5 -m 30000 -b 10000 -t 5 -n 15 -d 10 -s 1 -u 20 -l 0.05 -a 0.1 -z 1 /labs/evolution/projects/popanc_sims/mouse/infiles/genoP0.txt /labs/evolution/projects/popanc_sims/mouse/infiles/genoP1.txt /labs/evolution/projects/popanc_sims/mouse/infiles/genoTU.txt
scp estpanc_musTUChain2.hdf5 /labs/evolution/projects/popanc_sims/mcmc/mus/