popanc with mouse hybrid zone data

Post date: Dec 09, 2015 10:9:41 PM

Data were obtained from a transect along the Bavarian hybrid zone. The data come from the Dryad submission that is part of Turner and Harr (2014). SNPs were already screened for high LD and I extracted the subset of SNPs with no missing data and low LD from sample.data_noVinonoNA1_FstALL_5MAF.txtHYBS.sorted.pruned using plink. Results are in the musHzLd.* files in /labs/evolution/projects/popanc_sims/mouse/infiles. This left me with 93,699 SNPs across 23 chromosomes (no X-linked SNPs). Note that the individual IDs did not contain sample localities, so I obtained these data from Tina Harr (sample data). I then extracted the data for parental populations and 3 hybrid populations with good samples sizes using the following R script (getMusIds.R):

## parental: M. m. domesticus (P0) = SO + ST; M. m. musculus (P1) = GL + RE + RF

## hybrid pops: FS (31), HA (32), TU (21)

mus<-read.table("mouseHzTable.txt",header=T)

FS<-as.character(mus$individualID[which(mus$Siresite=='FS' & mus$Damsite=='FS')])

HA<-as.character(mus$individualID[which(mus$Siresite=='HA' & mus$Damsite=='HA')])

TU<-as.character(mus$individualID[which(mus$Siresite=='TU' & mus$Damsite=='TU')])

P0<-as.character(mus$individualID[ which((as.character(mus$Siresite)==as.character(mus$Damsite)) & (mus$Siresite=='SÖ' | mus$Siresite=='ST'))])

P1<-as.character(mus$individualID[ which((as.character(mus$Siresite)==as.character(mus$Damsite)) & (mus$Siresite=='GL' | mus$Siresite=='RE' | mus$Siresite=='RF'))])

write.table(FS,"idsFS.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)

write.table(HA,"idsHA.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)

write.table(TU,"idsTU.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)

write.table(P0,"idsP0.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)

write.table(P1,"idsP1.txt",quote=FALSE,row.names=FALSE,col.names=FALSE)

Note that the best description of these populations comes from an earlier paper, which is (here, see Figure 1). Note that the parental samples sizes are low, 5 each, but that the admixed population sizes are good, 31, 32, and 21. Also, these are lab reared offspring from wild-caught individuals, and I only retained those individuals with both parents from the same site (SS in the spreadsheet).

I next wrote and ran a perl script to convert the ped genotype file to the input format for popanc. The script is called makeGenotypeFiles.pl (see below), and generated the geno* files for the parents (P0 = domesticus and P1 = musculus).

#!/usr/bin/perl

#

# generate genotype files for popanc for each of 3 admixed populations and two parental populations

#

# usage ids*txt

# read in sets of ids from ids*txt files

foreach $idf (@ARGV){

$idf =~ m/ids([A-Z0-9]+)/ or die "failed to match population for $idf\n";

$pop = $1;

$pops{$pop} = 1;

open(IN, $idf) or die "couldn't open $idf\n";

while(<IN>){

chomp;

s/^M/X/;

$mice{$_} = $pop;

}

close (IN);

}

# open outfiles for each population

foreach $pop (keys %pops){

$fh = "F"."$pop";

open ($fh, "> geno$pop".".txt") or die "failed to write for $fh\n";

$files{$pop} = $fh;

}

# read locus information

open(IN, "musHzLD.map") or die "failed to read map file\n";

while(<IN>){

chomp;

@line = split(/\s+/,$_);

$locus = "$line[0]".":"."$line[2]";

push(@lid,$locus);

}

close(IN);

# read genotype data

%gdata = ();

open(IN, "musHzLD.ped") or die "failed to read ped file\n";

while(<IN>){

chomp;

@line = split(" ",$_);

@genotypes = ();

if(defined $mice{$line[0]}){

$pop = $mice{$line[0]};

for($i=0;$i<7;$i++){## get rid of non-genotype data

shift(@line);

}

$cnt = 0;

foreach $a (@line){

$b = shift(@line);

$gen = ($a+$b-2);

push(@genotypes, $gen);

$cnt++;

}

$gen = join(" ",@genotypes);

push(@{$gdata{$pop}},$gen);

print "found one from $pop with $cnt SNPs\n";

}

close(IN);

foreach $pop (keys %gdata){

@gmat = ();

$ln = @{$gdata{$pop}}; ## no. inds

print {$files{$pop}} "$cnt $ln\n";

for($j=0;$j<$ln;$j++){

$gen = shift(@{$gdata{$pop}});

@gen = split(" ",$gen);

$i=0;

foreach $g (@gen){

$gmat[$i][$j] = $g;

$i++;

}

for($i=0; $i<$cnt; $i++){

print {$files{$pop}} "$lid[$i]";

for($j=0; $j<$ln; $j++){

print {$files{$pop}} " $gmat[$i][$j]";

}

print {$files{$pop}} "\n";

}

# close outfiles

foreach $pop (keys %files){

close($files{$pop});

}

I used popanc to estimate ancestry frequencies for each admixed population. I ran 3 chains, 30,000 steps each with 10,000 burnin and 5 thinning interval, and SNP windows of 15. The results will be in /labs/evolution/projects/popanc_sims/mcmc/mus/. Here is an example command:

cd /local/scratch/

sleep 10

popanc -o estpanc_musTUChain2.hdf5 -m 30000 -b 10000 -t 5 -n 15 -d 10 -s 1 -u 20 -l 0.05 -a 0.1 -z 1 /labs/evolution/projects/popanc_sims/mouse/infiles/genoP0.txt /labs/evolution/projects/popanc_sims/mouse/infiles/genoP1.txt /labs/evolution/projects/popanc_sims/mouse/infiles/genoTU.txt

scp estpanc_musTUChain2.hdf5 /labs/evolution/projects/popanc_sims/mcmc/mus/