23andcolm‎ > ‎

Day-13 - PLINK annotation

posted Jul 20, 2011, 6:56 AM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:57 AM ]
I use PLINK alot in my day to day work. It's and excellent tool for analysis and manipulation of genomewide association datasets. I had the idea of using its annotation feature to see what I could find out. The website provides a pre-computed SNP attributes file for dbSNP v129 (I re-did this for 132 but there isn't that much difference for gene annotations). The method is described on this page. What I wanted to do is ask the following: for SNPs in my genome that are rare in the HapMap Caucasian population (as a reference), highlight any genes with interesting functional mutations. Even though any findings might not have any disease associations, I think it's a useful exercise. In addition, you can use the method to annotate however you like, e.g. other functional annotations, conserved elements, sites under selection etc.

I'm just going the process as a series of steps:

# Download the data and prepare my own
 wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/snp129.attrib.gz
 wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/glist-hg18
 grep '^rs' ../genome_Colm_O_Dushlaine_Full_20110124134637.txt | gawk '{print $2,$1,$3,"1",$4 }' > cod.snps

# manual edit - add header - CHR         SNP        BP         P

# Annotate genes, dbSNP annotations and subset on rare SNPs (1%) (the snps.1pc.tmp file I had prepared earlier, pulling out SNP genotypes that I had that were <1% in frequency in CEU)
 plink --annotate cod.snps attrib=snp129.attrib.gz ranges=glist-hg18 snps=snps.1pc.tmp --out cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu

# The functional annotations cover the following, missense being the most frequent
 gawk '{print $2 }' snp129.attrib | sort | uniq -c
    348 =frameshift
     19 =FRAMESHIFT
 203933 =missense
  30874 =MISSENSE
   7682 =nonsense
    849 =NONSENSE
    963 =splice
    297 =SPLICE
== 244765

# Look at functional categories of interest that have gene annotations. Note: these results are first-pass and the snp129.attrib file includes annotations based on a SNP being in LD with a functional SNP, i.e. it may not necessarily be the true SNP
 grep -P "nonsense|NONSENSE|frameshift|FRAMESHIFT|SPLICE|splice" cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu.annot | grep '('
1 rs969788 86689444 1 AG CLCA2(0)|=missense|=nonsense
5 rs1423414 41179513 1 AA C6(0)|=missense|=nonsense
5 rs4957377 41228756 1 CC C6(0)|=missense|=nonsense
5 rs4957152 41270160 1 CC C6(0)|=missense|=nonsense
5 rs2004385 41286190 1 AA C6(0)|=missense|=nonsense
5 rs7724199 156922535 1 AA ADAM19(0)|=missense|=nonsense
5 rs7719224 156931918 1 TT ADAM19(0)|=missense|=nonsense
5 rs11134822 156933880 1 TT ADAM19(0)|=missense|=nonsense
6 rs3778638 31200103 1 AA PSORS1C1(0)|=missense|=nonsense
6 rs2233945 31215340 1 AA PSORS1C1(0)|=missense|=nonsense
6 rs12364 31218765 1 AA CCHCR1(0)|=missense|=nonsense
6 rs9263739 31219335 1 TT CCHCR1(0)|=missense|=nonsense
6 rs9263740 31219379 1 CC CCHCR1(0)|=missense|=nonsense
6 rs130077 31230309 1 AA CCHCR1(0)|=missense|=nonsense
6 rs9263785 31233802 1 GG CCHCR1(0)|=missense|=nonsense
6 rs9263794 31237998 1 GG TCF19(0)|=missense|=nonsense
6 rs1044870 31238800 1 TT TCF19(0)|=missense|=nonsense
6 rs9263796 31240862 1 TT POU5F1(0)|=missense|=nonsense
6 rs9263800 31242578 1 AA POU5F1(0)|=missense|=nonsense
7 rs596572 4174875 1 TT SDK1(0)|=missense|=nonsense
7 rs623667 4177807 1 AA SDK1(0)|=missense|=nonsense
8 rs3213604 17770696 1 AT FGL1(0)|=frameshift|=missense
11 rs4910844 5733060 1 TT OR52N4(0)|=NONSENSE|=missense
11 rs10838637 5766053 1 GG OR52N1(0)|=missense|=nonsense
11 rs10769224 5766249 1 TT OR52N1(0)|=MISSENSE|=nonsense
11 rs10742787 5766322 1 TT OR52N1(0)|=MISSENSE|=nonsense
13 rs11568656 94512943 1 GT ABCC4(0)|=frameshift
13 rs7997839 94514336 1 AG ABCC4(0)|=frameshift

CLCA2: calcium channel: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CLCA2
C6: immune-related: http://www.genecards.org/cgi-bin/carddisp.pl?gene=C6
ADAM19: lots of roles, including neurogenesis (explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=ADAM19
PSORS1C1: unclear: http://www.genecards.org/cgi-bin/carddisp.pl?gene=PSORS1C1
CCHCR1: keratinocyte: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CCHCR1
TCF19: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=TCF19
POU5F1: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=POU5F1
SDK1: cell adhesion protein that guides axonal terminals to specific synapses in developing neurons (definitely explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=SDK1
FGL1: hepatocyte mitogenic activity: http://www.genecards.org/cgi-bin/carddisp.pl?gene=FGL1
OR52N4: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N4
OR52N1: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N1
ABCC4: organic anion pump relevant to cellular detoxification http://www.genecards.org/cgi-bin/carddisp.pl?gene=ABCC4

Comments