PCA and genotype freqs. for non-T. cristinae species

Post date: Mar 15, 2017 3:46:58 PM

In T. cristinae we see an excess of m/g heterozygotes (u or s, whichever is more common). We are beginning to check for similar signals in other species, focusing to the extent possible on the portion of LG8 where the GWA signal for stripe is the greatest.

The data I am using for this are from Romain and on diogenes (I have copied some over). Here is what he said:

What I was starting from for PCAs are the posterior genotype probabilities (same input file than gemma .geno file). I don't have workspaces loaded with this data but everything (input files scripts and results) is stored and shall be easy to access on diogenes:

for T. californicum, T. chumash, T.curi and T. poppensis :

/stash/timema/gwas_trad/10_gwas/

for T. bartmani (different population, but I mainly used JL for latter graphics), T. knulli, T. landelsensis and T. petita

/stash/timema/gwas_trad_Patrik_clines/10_gwas/

for 2013FHA t.cristinae and T. podura:

/stash/timema/2013fha_gwas/draft_1.3b2/10_gwas/

for n1 T. cristinae:

/stash/timema/Farkas_project/timema/draft_1.3b2/07_gwas/

I started with two T. bartmani populations (JL and PCT) where the phenotypes are green, red or red-headed.

First for JL:

I ran a PCA on scaffold 128 5mb-6mb, which corresponds with the indel/gwa signal. The first two PCs explain > 50% of the variation, so with just this region they are really capturing the story (pca plot). Colors denote phenotypes and numbers clusters based on kmeans clustering with k = 6. Based on Romain's PCA, I think orange = read, green = green, and blue = red-head. Overall the clustering worked well, but I would have expected the split between clusters 5 and 6 to carve off just the corner as a different cluster. The algorithm disagrees, in part I think because there are so few in what I would call the homozygous red-head cluster. Because of this (and because it is where we have good numbers), I focused on clusters 2-4, which cover the g/g, g/r, and r/r genotypes. Here are the observed and expected (HW) for these (normalized to sum to 1, ignoring clusters 1, 5 and 6):

obs hw

g/r 0.45 0.50

r/r 0.26 0.23

g/g 0.29 0.27

So, absolutely no excess of hets., and in fact a slight (and marginally significant with p = 0.043 from binomial sampling) deficit of g/r hets. This is of course just one population and I think there is some ambiguity in even the boundaries of these clusters, but clearly in this case the story is not the same as for T. cristinae.

I next tried JL (all files are in /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_fluct/tim_sp_pca/Tbartmani_PCT/, the JL ones are in a different sub-directory but are currently incomplete as I ran this from my laptop).

Similar to JL, three cluster for PCT are well defined, the others less so. In the PCA plot clutser 3 = g/g, 2 = g/r and 5 =r/r. I focused on these. You have near perfect HW expectations, with a slight (probably trivial) deficit of hets.

obs hw obs-hw

g/g 0.2267442 0.2217753 0.004968902

g/r 0.4883721 0.4983099 -0.009937804

r/r 0.2848837 0.2799148 0.004968902

So, similar to JL and again quite distinct from T. cristinae.

Here is the R code (from commands.R):

## read data

G<-as.matrix(read.table("indel_TbartPCTGenSc128.txt",header=F))

pc<-prcomp(t(G),center=TRUE,scale=FALSE)

#Importance of components:

# PC1 PC2 PC3 PC4 PC5 PC6 PC7

#Standard deviation 2.5094 2.1373 1.6688 1.03913 0.96081 0.80385 0.78977

#Proportion of Variance 0.2364 0.1715 0.1046 0.04054 0.03466 0.02426 0.02342

#Cumulative Proportion 0.2364 0.4079 0.5125 0.55300 0.58766 0.61191 0.63533

library(MASS)

## kmeans clustering

ko<-kmeans(x=pc$x[,1:2],centers=6,iter.max=100,nstart=100)

plot(pc$x[,1],pc$x[,2],type='n')

text(pc$x[,1],pc$x[,2],ko$cluster)

## some noise, but core homos are 3 and 5 and hets are 2

## note, don't have proper phenotype data, or it doesn't match

gfrq<-c(sum(ko$cluster==3),sum(ko$cluster==2),sum(ko$cluster==5))

n<-sum(gfrq)

gfrq<-gfrq/n

p<-gfrq[1]+0.5*gfrq[2]

hw<-c(p^2,2*p*(1-p),(1-p)^2)

o<-cbind(gfrq,hw,gfrq-hw)

rownames(o)<-c("g/g","g/r","r/r")

colnames(o)<-c("obs","hw","obs-hw")

pdf("pcaTbartPCT.pdf",width=6,height=6)

par(mar=c(5,5,0.5,0.5))

plot(pc$x[,1],pc$x[,2],type='n',xlab="PC1",ylab="PC2",cex.lab=1.4,cex.axis=1.2)

text(pc$x[,1],pc$x[,2],ko$cluster)

dev.off()

Page updated

Google Sites

Report abuse