Post date: Oct 24, 2013 6:45:49 PM
I transferred a file to diogenes (diogenes:/stash/timema/zgompert/parallelismInformation.txt.gz) that contains information about parallelism in the whole genome sequence data set for all linkage group SNPs. This is the file that we will want to use to ask whether parallel change SNPs are enriched for coding regions, etc. We have three measures of parallelism, (i) consistent high Fst, (ii) consistent allele frequency differences in the same direction, and (iii) a combined score from both (i) and (ii). These are all in the file. Specifically, the file has the following data,
[1,] "lg" = linkage group
[2,] "scaf_order" = order of the scaffold within the linkage group
[3,] "scaf" = scaffold number
[4,] "pos" = position of SNP within the scaffold
[5,] "maf" = average minor allele frequency across all 8 populations
[6,] "fstCompQ" = composite Fst quantile for each SNP (this parallelism measure (i))
[7,] "fstCompQ_rank" = ranking based on fstCompQ, 1 is the SNP with the largest composite Fst quantile, i.e., most evidence of consistent high Fst
[8,] "afDifCompQ" = composite allele frequency different quantile for each SNP (this parallelism measure (ii))
[9,] "afDifCompQ_rank" = ranking based on afDifCompQ, 1 is the SNP with the largest composite allele frequency difference quantile
[10,] "combinedCompQ" = combined score (this parallelism measure (iii))
[11,] "combinedCompQ_rank" = ranking based on combined 1 is the SNP with the largest combined score
We want to test for functional enrichments based on each of these three measures of parallelism. This means we want to ask whether the annotation (for now just the intergenic, coding, non-coding classification) differs between the top parallel SNPs and the genome for each set of SNPs. We had used the top 100 SNPs, but Patrik and I thought it might be better to use a percentage cut-off. The top 0.001% of SNPs is equal to the top 44 SNPs (ranks 1-44), meaning the top 0.002% would be 88 SNPs (ranks 1-88). We might want to start with one of these cut-offs. We could even go to the top 0.005% = 220 SNPs. Maybe start with the top 88 if no one has any other thoughts (this is closest to 100).
Here is the new, relevant R code,
foldedQsumAfDif<-qsumAfDif
foldedQsumAfDif[qsumAfDif < 2]<-4-qsumAfDif[qsumAfDif < 2]
foldedQsumAfDif<-(foldedQsumAfDif-2)*2## this rescales to original range of 0..4
rankFst<-rev(order(qsum,na.last=FALSE))
rankAfD<-rev(order(foldedQsumAfDif,na.last=FALSE))
rankComp<-rev(order(qsum+foldedQsumAfDif,na.last=FALSE))
out<-cbind(locinfo,round(mafall,5),round(qsum,5),order(rankFst),round(foldedQsumAfDif,5),order(rankAfD),round(qsum+foldedQsumAfDif,5),order(rankComp))
colnames(out)<-c("lg","scaf_order","scaf","pos","maf","fstCompQ","fstCompQ_rank","afDifCompQ","afDifCompQ_rank","combinedCompQ","combinedCompQ_rank")
write.table(out,"parallelismInformation.txt",row.names=TRUE,quote=FALSE)