23andcolm

The idea here is to keep track of projects carried out using my genotypes from 23andme. So far, I have data on ancestry, traits, carrier status and disease risk, but it will be interesting to see how I can add to what 23&m currently provide. 

Of most interest are:
- imputation to increase the number of SNPs
- polygenic score: where do I fall along the distribution of polygenic loading for e.g. schizophrenia?
- various ways to explore the most damaging SNPs I have, e.g. polyphen2, SIFT
- runs of homozygosity analysis
- do I have any interesting copy-number variants?

I will add to these ideas when/if I come up with more.

Please get in touch if you would like any advice on playing around with your own 23andMe data: colm _at_ broadinstitute.org

23andGG

posted Sep 5, 2013, 1:05 PM by Colm O'Dushlaine

Check out this awesome post by my colleague Giulio Genovese: http://apol1.blogspot.com/2013/08/impute-apoe-and-apol1-with-23andme.html 
What he does is provide some neat tricks to rapidly get information on variants not covered by a dataset.

Day 15 - More CNVs

posted Oct 24, 2011, 4:36 PM by Colm O'Dushlaine   [ updated Oct 24, 2011, 4:44 PM ]

I found a list of CNVs reported by the Wellcome trust a while ago (http://www.wtccc.org.uk/wtcccplus_cnv/supplemental.shtml). This study was interesting, even though it didn't find anything amazing. The last line of the abstract says it all really (!) "We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases"
....nevertheless, I decided to check if I have an SNPs that are (a) rare in HapMap CEU and (b) tag CNVs. 

Very simply I obtained minor allele frequency for my genotypes from HapMap and just filtered <=10%. I then cross-references with the CNV table listed here. I took care to use the bestLocalIlluminaTagRSID column because this is the platform my data is on. Clearly this is a quick and dirty analysis. For example, I could have expanded to a list of imputed genotypes perhaps. Anyway, long story short I found just 3 genes found within CNVs tagged by rare genotypes in my genome:
(hg18)
          Gene CTGLF3 chr 10 51418083 51418083 within CNVR4732.1 chr10:51403541-51447456 loss rs17178655
                http://www.genecards.org/cgi-bin/carddisp.pl?gene=AGAP6
                 - Putative GTPase-activating protein .. low expression everywhere, so probably not interesting

          Gene OR52N5 chr 11 5755465 5755465 within CNVR5049.1 chr11:5741435-5765861 gain rs1453415
                - Olfactory receptor, we know these vary in all sorts of ways

          Gene CHST5 chr 16 74119928 74119928 within WTCCC1_181 chr16:74096937-74132230 unknown rs11149841
                - Catalyzes the transfer of sulfate to position 6 of non-reducing N-acetylglucosamine (GlcNAc) residues and O-linked sugars of mucin-type acceptors.
                - not sure about this. It doesn't interact with anything according to http://string-db.org. Maybe a connection with poor eyesight, http://www.ncbi.nlm.nih.gov/pubmed/21440637 http://www.ncbi.nlm.nih.gov/pubmed/16938851 (though I might be clutching at straws here). Some interesting pharmacogenetic possibilities also, http://www.ncbi.nlm.nih.gov/pubmed/11829137

That's all I could find! Besides, this is not an ideal CNV dataset given the lack of any convincing disease association.


Day 14 - CNVs

posted Jul 22, 2011, 7:54 AM by Colm O'Dushlaine   [ updated Jul 22, 2011, 8:56 AM ]

I am keen to look at my CNVs. Unfortunately, it isn't that easy. Ideally, I'd like my beadstudio file from 23andme and then I would run Kai Wang's PennCNV to call CNVs. I would then look up pubmed and the database of genomic variants to get a feel for the potentially damaging effects of any CNVs I have.
...but I don't have the files. There's a great article at http://www.genomesunzipped.org/2010/08/dude-where-are-my-copy-number-variants.php that goes through 2 interesting methods to detect CNVs from array data. I'm in the process of convincing my parents to do 23andme, which would allow me one way to look for CNVs. I was also intrigued by the imputation method described at the above link. In the end, however, I just did a really quick-and-dirty analysis using an April 2010 Conrad et al. article. The table I used is here. It summarizes CNVs highly correlated with SNPs and likely to be the causal variant. So I pulled these from the table and searched them against my genome, taking care to check strandedness (SNPedia).     

- e.g. pull out SNPs from my genome
  grep -P "rs10492927|rs11809207|..." mygenome
- look at the genotypes (not all SNPs were in there)
- how frequent in a Caucasian population? Does this tell me anything about which of my SNPs, if any, are likely to tag deleterious CNVs?

In general, most of my genotypes were common in HapMap CEU. So even though these are associated with some traits, they don't appear to be under major selective constraint. However, three SNPs that I have were rare in HapMap CEU - rs9291683, rs3129934 and rs2705293. These are associated with bone mineral density, multiple sclerosis and Schizophrenia respectively. My genotypes have a frequency of about 20, 5 and 15% respectively. So the multiple sclerosis variant seems to be potentially the most damaging. (Interesting that I have strong bones though!) The reported gene is HLA-DRB1. The risk increase is 3.3 fold, based on SNPedia. Now, based on 23andme data, I have 0.2% risk of MS, which is ~the same as the average, albeit with a slight decrease in risk of 0.77x (0.34 per 100 on average, 0.2 with my genotype). Note that this is coming from 2 SNPs - one in IL7RA and one in HLA-DRB1. Even though it's the same gene, the references are different. The 23andme results are pretty good and have a 4-star rating, with multiple reference sources from large populations. The SNPedia reference here was for work done on a reasonably small cohort of 242 individuals, so less likely to be as credible (though they did replicated their findings).

So it's nice to get stuck into you data in this way and to consider some of the literature that might not be directly apparent when you first look at your results. I have protective and risk variants for MS, some in the same gene. This is interesting, but given the odds ratios represents only very modest effects on risk.




Day 13 - PLINK annotation

posted Jul 20, 2011, 6:56 AM by Colm O'Dushlaine   [ updated Jul 20, 2011, 7:33 AM ]

I use PLINK alot in my day to day work. It's and excellent tool for analysis and manipulation of genomewide association datasets. I had the idea of using its annotation feature to see what I could find out. The website provides a pre-computed SNP attributes file for dbSNP v129 (I re-did this for 132 but there isn't that much difference for gene annotations). The method is described on this page. What I wanted to do is ask the following: for SNPs in my genome that are rare in the HapMap Caucasian population (as a reference), highlight any genes with interesting functional mutations. Even though any findings might not have any disease associations, I think it's a useful exercise. In addition, you can use the method to annotate however you like, e.g. other functional annotations, conserved elements, sites under selection etc.

I'm just going the process as a series of steps:

# Download the data and prepare my own
 wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/snp129.attrib.gz
 wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/glist-hg18
 grep '^rs' ../genome_Colm_O_Dushlaine_Full_20110124134637.txt | gawk '{print $2,$1,$3,"1",$4 }' > cod.snps

# manual edit - add header - CHR         SNP        BP         P

# Annotate genes, dbSNP annotations and subset on rare SNPs (1%) (the snps.1pc.tmp file I had prepared earlier, pulling out SNP genotypes that I had that were <1% in frequency in CEU)
 plink --annotate cod.snps attrib=snp129.attrib.gz ranges=glist-hg18 snps=snps.1pc.tmp --out cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu

# The functional annotations cover the following, missense being the most frequent
 gawk '{print $2 }' snp129.attrib | sort | uniq -c
    348 =frameshift
     19 =FRAMESHIFT
 203933 =missense
  30874 =MISSENSE
   7682 =nonsense
    849 =NONSENSE
    963 =splice
    297 =SPLICE
== 244765

# Look at functional categories of interest that have gene annotations. Note: these results are first-pass and the snp129.attrib file includes annotations based on a SNP being in LD with a functional SNP, i.e. it may not necessarily be the true SNP
 grep -P "nonsense|NONSENSE|frameshift|FRAMESHIFT|SPLICE|splice" cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu.annot | grep '('
1 rs969788 86689444 1 AG CLCA2(0)|=missense|=nonsense
5 rs1423414 41179513 1 AA C6(0)|=missense|=nonsense
5 rs4957377 41228756 1 CC C6(0)|=missense|=nonsense
5 rs4957152 41270160 1 CC C6(0)|=missense|=nonsense
5 rs2004385 41286190 1 AA C6(0)|=missense|=nonsense
5 rs7724199 156922535 1 AA ADAM19(0)|=missense|=nonsense
5 rs7719224 156931918 1 TT ADAM19(0)|=missense|=nonsense
5 rs11134822 156933880 1 TT ADAM19(0)|=missense|=nonsense
6 rs3778638 31200103 1 AA PSORS1C1(0)|=missense|=nonsense
6 rs2233945 31215340 1 AA PSORS1C1(0)|=missense|=nonsense
6 rs12364 31218765 1 AA CCHCR1(0)|=missense|=nonsense
6 rs9263739 31219335 1 TT CCHCR1(0)|=missense|=nonsense
6 rs9263740 31219379 1 CC CCHCR1(0)|=missense|=nonsense
6 rs130077 31230309 1 AA CCHCR1(0)|=missense|=nonsense
6 rs9263785 31233802 1 GG CCHCR1(0)|=missense|=nonsense
6 rs9263794 31237998 1 GG TCF19(0)|=missense|=nonsense
6 rs1044870 31238800 1 TT TCF19(0)|=missense|=nonsense
6 rs9263796 31240862 1 TT POU5F1(0)|=missense|=nonsense
6 rs9263800 31242578 1 AA POU5F1(0)|=missense|=nonsense
7 rs596572 4174875 1 TT SDK1(0)|=missense|=nonsense
7 rs623667 4177807 1 AA SDK1(0)|=missense|=nonsense
8 rs3213604 17770696 1 AT FGL1(0)|=frameshift|=missense
11 rs4910844 5733060 1 TT OR52N4(0)|=NONSENSE|=missense
11 rs10838637 5766053 1 GG OR52N1(0)|=missense|=nonsense
11 rs10769224 5766249 1 TT OR52N1(0)|=MISSENSE|=nonsense
11 rs10742787 5766322 1 TT OR52N1(0)|=MISSENSE|=nonsense
13 rs11568656 94512943 1 GT ABCC4(0)|=frameshift
13 rs7997839 94514336 1 AG ABCC4(0)|=frameshift

CLCA2: calcium channel: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CLCA2
C6: immune-related: http://www.genecards.org/cgi-bin/carddisp.pl?gene=C6
ADAM19: lots of roles, including neurogenesis (explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=ADAM19
PSORS1C1: unclear: http://www.genecards.org/cgi-bin/carddisp.pl?gene=PSORS1C1
CCHCR1: keratinocyte: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CCHCR1
TCF19: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=TCF19
POU5F1: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=POU5F1
SDK1: cell adhesion protein that guides axonal terminals to specific synapses in developing neurons (definitely explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=SDK1
FGL1: hepatocyte mitogenic activity: http://www.genecards.org/cgi-bin/carddisp.pl?gene=FGL1
OR52N4: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N4
OR52N1: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N1
ABCC4: organic anion pump relevant to cellular detoxification http://www.genecards.org/cgi-bin/carddisp.pl?gene=ABCC4

Day 12 - Promethease revisited

posted Jul 14, 2011, 1:00 PM by Colm O'Dushlaine   [ updated Jul 19, 2011, 7:21 AM ]

I mentioned this on day 5. Pretty good for Windows users but slow. I wrote a script that emulates the process and runs quite quickly on UNIX. 23andme genotypes are all on the forward strand and the script checks that if the SNPedia genotypes are on the minus strand the alleles are reverse complemented. Note: Promethease does this also. So, in summary:

- loop over all 23andme SNPs
- check if any genotype information in SNPedia
- [if yes] ensure genotypes are matched for strand and reverse complement if necessary
- [if yes] parse out disease/association annotation from wiki data

I've set the script up to only give me an output when there is something interesting for my genotype. Example of output is:

 snp_and_genotype strand magnitude disease_annotations
 rs1000113(C;C) plus na [ normal ]
 rs10008492(C;C) plus na [ normal ]
 rs10033464(G;G) plus 0 }} [ norm ]
 rs1004819(C;T) minus na [ 1.5x_risk ]
 rs10050860(C;C) plus 0  [ normal_risk ]
 rs10086908(T;T) plus 0  [ normal_risk ]
 rs10090154(C;C) plus 0  [ normal ]
 rs1010(A;G) minus na [ 1.75x_risk_of_MI ]
 rs1012729(A;A) plus na [ normal ]
 rs10134944(C;C) plus 0  [ normal ]
 rs1015362(A;G) minus na [ 2-4x_higher_risk_of_sun_sensitivity_if_part_of_risk_haplotype ]
 ...

This is crude, but a quick look:

 grep -vP "common|normal|norm|None" genome_Colm_O_Dushlaine_Full_20110124134637.txt.out | gawk '$2 != "?" { print }' | more


Most interesting here are:
  • rs11983225(T;T), http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs11983225 shows that T/T is ad a 0% frequency in CEU. It seems to be associated with a 7x reduced likelihood of responding to certain antidepressants. So that's not good if ever I need them! Also http://www.snpedia.com/index.php/Rs7787082(G;G)
  • rs1570360(A;A): 3x increased risk of sudden infant death syndrome. I'm 30 so ok I think
  • Confirmation of many phenotypes (taste bitter, blue eyes etc.) reported by 23andme
  • Some cardiac and stroke-related SNPs. These are mostly common and I would expect this given the incidence in the Irish population
  • rs2270641(G;G), which is pretty rare in CEU and increased Schizophrenia rist 3.7 fold. That study was small, however and we now know that alot of SNPs are implicated
  • rs2834167(A;A): 2.67x_increased_risk_for_systemic_sclerosis. Might have something to do with my Raynaud's-like symptoms
  • rs4606(C:C): complex association with anxiety disorders. I definitely have anxiety, but this association is not a simple one
  • Some variants associated with neuroblastoma. This is a condition generally restricted to children (~2% of cases in adults (>18yrs))
  • rs6596075(C:C): 2x risk of Crohn's disease
  • rs7442295(A;A): increased risk of hyperuracemia, but this is common in Europeans, see http://www.snpedia.com/index.php/Rs7442295
  • rs762551(A;A): CYP1A2*1F_homozygote;_Fast_metabolizer (or Caffeine). Definitely true!
Overall, many of these variants are well-covered by 23andme but it is certainly nice to do something like this as it makes the sources more transparent I think. You can go straight to SNPedia and look at the papers referenced. It also has a nice simple digestion of the implicated variants.

So this approach works well I think, though a very much stripped down version of Promethease. Apologies to Mike Cariaso for pestering him so much when I was working out how to query SNPedia properly. Get in touch if you want the script, or download it here. You are free to use it in a non-commercial context.

Day 11 - population genetics

posted Jul 14, 2011, 12:47 PM by Colm O'Dushlaine   [ updated Jul 14, 2011, 12:56 PM ]

My colleagues - Jim Wilson and Angelika Kritz - did this for me a few months ago. It's a cool service they provide as part of ethnoancestry.com. As expected, it confirms my status as European and the proportion of my genome within runs of homozygosity doesn't appear to be too high, which is nice to know. Check out the attachment for details.

Day 10 - NHGRI GWAS catalogue

posted Jul 13, 2011, 7:44 AM by Colm O'Dushlaine   [ updated Jul 13, 2011, 8:04 AM ]

There is another nice resource out there called the catalog of published genome-wide association studies, available at http://www.genome.gov/gwastudies/. It has it's limitations, but is a nice list of associations that we can be pretty confident about. So what I decided to do here was to compare my 23&me results with this list. 

I first made a "light" version of the download (www.genome.gov/admin/gwascatalog.txt), selecting fields of interest - positions, gene-names, control frequency:

Adiposity       13      80959207        SPRY2   SPRY2 - ARF4P4  rs534870-A      rs534870        0.68
Adiposity       16      53816275        FTO     FTO     rs8050136-C     rs8050136       0.60
Schizophrenia   19      42066279        ATP5SL, CEACAM21        PLEKHA3P1 - CEACAM21    rs4803480-A     rs4803480       0.13
...

I wanted some Caucasion minor allele frequency data also (get_maf.pl is attached)

 wget http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/hapmapSnpsCEU.txt.gz
 gunzip hapmapSnpsCEU.txt.gz
 perl get_maf.pl > ceu.maf
 grep -v arning ceu.maf > ceu.maf.clean

I then ran parse.pl (attached) to make a file called gwascatalogue.res.txt

 perl parse_gwas_catalogue.pl gwascatalog.txt.lite


I wanted SNPs that were rare. The file contains control frequency and CEU frequency. Here, I'm pulling out SNPs with frequency <2.5% in CEU (I think this is more accurate than the frequencies listed in the studies).

 gawk -F "\t" '$11 <= 0.025 {print  }' gwascatalogue.res.txt

disease chr     pos     geneP..down     snp_allelesnp   risk_all_cont   cod_geno        dose    ceu_maf
Type 2 diabetes 9       10430602        PTPRD   PTPRD   rs649891-C      rs649891        0.35    CC      2       0.0167
Optic disc parameters   1       92077097        CDC7, TGFBR3    RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       NR      GG      2       0.0167
Parkinson's disease     17      44828931        MAPT    NSF     rs199533-C      rs199533        0.78    AA      0       0.0167
Non-alcoholic fatty liver disease histology (AST)       9       78425925        Intergenic      OSTF1 - PCSK5   rs12344488-A    rs12344488      0.07    AA      2       0.0169
Optic disc parameters   1       92077097        CDC7,TGFBR3     RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       0.18    GG      2       0.0167
Optic disc size (disc)  1       92077097        HSP90B3P        RPL39P13 - HSP90B3P     rs1192415-A     rs1192415       NR      GG      0       0.0167
Ankylosing spondylitis  5       96129512        ERAP1   ERAP1   rs27434-A       rs27434 0.23    AA      2       0.0167
Parkinson's disease     17      44828931        NSF     NSF     rs199533-C      rs199533        0.83    AA      0       0.0167
Parkinson's disease     17      43719143        MAPT,  C17orf69, KIAA1267, LOC644246, IMP5      C17orf69        rs393152-A      rs393152        0.82    GG      0       0.0167
Obesity (extreme)       10      37982097        ZNF248  MTRNR2L7 - TLK2P2       rs7474896-T     rs7474896       0.14    TT      2       0.0167
Rheumatoid arthritis    6       138002637       TNFAIP3, OLIG3  OLIG3 - TNFAIP3 rs10499194-C    rs10499194      0.71    TT      0       0.0167

The e.g. 'CC' (3rd last column) are my genotypes and the 'dose' field means how many copies of the risk allele do I have. So, of the above results, I think the following are most interesting:

Type 2 diabetes 9       10430602        PTPRD   PTPRD   rs649891-C      rs649891        0.35    CC      2       0.0167
Optic disc parameters   1       92077097        CDC7, TGFBR3    RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       NR      GG      2       0.0167
Non-alcoholic fatty liver disease histology (AST)       9       78425925        Intergenic      OSTF1 - PCSK5   rs12344488-A    rs12344488      0.07    AA      2       0.0169
Optic disc parameters   1       92077097        CDC7,TGFBR3     RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       0.18    GG      2       0.0167
Ankylosing spondylitis  5       96129512        ERAP1   ERAP1   rs27434-A       rs27434 0.23    AA      2       0.0167
Obesity (extreme)       10      37982097        ZNF248  MTRNR2L7 - TLK2P2       rs7474896-T     rs7474896       0.14    TT      2       0.0167

I have bad eyesight (very bad!) so I'm intrigued by the optic disc parameters findings! The obesity/T2D etc. are not so interesting as common and complex genetic bases. 


Day 9 - OMIM

posted Jul 11, 2011, 9:10 AM by Colm O'Dushlaine   [ updated Jul 11, 2011, 11:07 AM ]

Can we identify any SNPs, predicted to be damaging, in OMIM? This would point to any Mendelian traits/disorders I might have. There is a file in dbSNP that is useful for this - ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/OmimVarLocusIdSNP.bcp.gz. The file links omim ids to SNP ids. 

In UNIX:

  wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/OmimVarLocusIdSNP.bcp.gz
 gunzip OmimVarLocusIdSNP.bcp.gz
 cut -f 1,9 OmimVarLocusIdSNP.bcp | gawk '{print "rs"$2,$1 }' | sort > snps_2_omim.txt

Then, sort my missense SNPs and join both lists. None of my rare missense match, so I checked the larger list of all 476 missense SNPs:

 join all_missense.sorted.txt snps_2_omim.txt 

You can then just look up the OMIM entries listed, e.g. http://omim.org/entry/607751 

rs10246939 607751 <-- This polymorphism, in conjunction with other SNPs in the gene, give rise to the ability to taste or not taste phenylthiocarbamide
rs1049550 612388 <--  strongly associated with sarcoidosis
rs1051740 132810 <-- LYMPHOPROLIFERATIVE DISORDERS, SUSCEPTIBILITY TO
rs1126809 601800 <-- association with skin sensitivity to sun (p = 7.1 x 10(-13)) and blue versus green eye color (p = 4.6 x 10(-21)).
rs1126809 606933 <-- blue eye colour
rs1154510 609695 <-- HAWKINSINURIA (autosomal dominant inborn error of metabolism) (elevated levels of blood tyrosine and massive excretion of tyrosine derivatives into urine)
rs2108622 122700 <-- resistance to warfarin
rs2108622 604426 <-- resistance to warfarin
rs288326 605083 <-- OSTEOARTHRITIS SUSCEPTIBILITY 1 (strong risk factor for primary osteoarthritis of the hip in females)
rs3775291 603029 <-- protective against the development of geographic atrophy or advanced dry age-related macular degeneration
rs3775291 603075 <-- associated with protection from progression to geographic atrophy in patients with age-related macular degeneration
rs4673 608508 <-- cardiovascular disease related
rs4917 138680 <-- an increased risk for leanness (OR, 1.90; p = 0.027)
rs602662 182100 <-- ...genetic factors affecting circulating vitamin B12 levels and identified rs602662 in the FUT2 gene
rs602662 612542 <-- vit B12
rs61630004 602767 <-- ECTODERMAL DYSPLASIA, 'PURE' HAIR-NAIL TYPE
rs7080536 603924 <-- CAROTID STENOSIS, SUSCEPTIBILITY TO (basically a heart disease risk factor)
rs7133914 609007 <-- Decreased susceptibility to Parkinson Disease
rs7308720 609007 <-- Decreased susceptibility to Parkinson Disease

So, overall I think some interesting stuff in there, although alot of this is covered nicely by 23&me already.

Day 8 - SIFT, GRAIL and STRING

posted Jul 11, 2011, 8:11 AM by Colm O'Dushlaine   [ updated Jul 11, 2011, 8:40 AM ]

I wouldn't put too much emphasis on these, but I did a quick SIFT and GRAIL analysis using only about 476 missense SNPs reported in my 23&me data (I got these by just loading all the data into Polyphen2). SIFT was roughly concordant with polyphen2 and I made a spreadsheet of those SNPs reported by pphen2 AND SIFT as being deleterious. Some of the results are interesting (as reported in Day 7). The olfactor receptor findings are not surprising. This is highlighted when I put these SNPs/GENES into STRING (see image below). We see loads of connections between olfactory genes with missense mutations. Note though that I did not filter out common mutations hete - this is the entire list of 476 SNPs. For SIFT and PPHEN2 analyses, I sorted by most rare in HapMap CEU first. Genes with <5% CEU freq and predicted to me probably damaging by SIFT and PPHEN2 are below:

C13orf26 
YSK4 Sps1/Ste20-related kinase homolog
ZKSCAN2 zinc finger with KRAB and SCAN domains 2 
KIAA0564 
LY6G5B lymphocyte antigen 6 complex, locus G5B 
Carboxypeptidase N, polypeptide 2   

Now, GRAIL: (a) all SNPs, (b) rare SNPs only (13 SNPs < 5% freq). 

(a) results
(b) results

So nothing screaming out here, for (b) - which is more interesting as mutations are rare - 2 genes had significant connections:

 TOE1    0.0096856492    ZNF687(5), ZKSCAN2(52), VPS72(251), BAT4(615), BAT3(1476), SLC44A4(1655), BAT5(1937), DOM3Z(1979) 
 HPDL    0.034077166     C13orf26(43), ZNF687(76), ZSWIM5(801), VPS72(936), SLC44A4(1289), BAT4(1496)  

TOE1:  Inhibits cell growth rate and cell cycle. Induces CDKN1A expression as well as TGF-beta expression. Mediates the inhibitory growth effect of EGR1 (specifically expressed in Schwann cells,induced by a wide variety of extracellular stimuli,involved in cell proliferation,macrophage differentiation,synaptic activation and long term potentiation,coactivated by CREBBP HPDL: May have dioxygenase activity)
HPDL: may have dioxygenase activity (Potential)

Can't immediately relate these to my phenotype, but good to record anyway(!)

We can also do a GRAIL GO analysis, though I wouldn't expect much to come out of this...

I think this gives you a taste of some of the interesting tools out there. STRING is particularly cool I think, especially the ability to look for interactions between genes harboring genes with potentially interesting mutations.

Day 7 - polyphen2 screening for damaging SNPs

posted Jul 6, 2011, 9:19 AM by Colm O'Dushlaine   [ updated Jul 6, 2011, 9:36 AM ]

So imputation (day 6) revealed very little additional SNPs. To be honest, it's probably not a good idea given that I only have one individual.

Polyphen is another interesting way to look at you 23andMe genotypes. Here, I converted my data into a polyphen2 compatible file and ran the analysis. It took about 20 minutes or so. I then downloaded the results and filtered out only what I thought was most interesting - SNPs with a "probably damaging" label. Now it's one thing to do that, but don't rush to look at the results. I found some interesting mutations that caused alarm, e.g. "essential for sperm motility"...but when you look at the frequency of this mutation in Caucasians you find that it's at 100%...and we seem to be procreating without much difficulty! So it's very important to pull in some estimates of the frequency of your SNP in your population. I did this and then pulled out the 20 most rare SNPs. Some of these relate to MHC and olfactory genes, which not unsurprising I think. One or two genes involved in transcriptional regulation, etc.

Overall, I think this is a very useful endeavour, and it complements the established research presented by 23andMe and may give some interesting additional leads into your phenotype. Be careful when interpreting the findings though and especially be sure to interpret your mutation in the context of a matched population.

1-10 of 16

Comments