The idea here is to keep track of projects carried out using my genotypes from 23&me. So far, I have data on ancestry, traits, carrier status and disease risk, but it will be interesting to see how I can add to what 23&me currently provides no longer provides! 

Of most interest are:
- imputation to increase the number of SNPs
- polygenic scoring: where do I fall along the distribution of polygenic loading for e.g. schizophrenia?
- various ways to explore the most damaging SNPs I have, e.g. polyphen2, SIFT
- runs of homozygosity analysis
- do I have any interesting copy-number variants?

I will add to these ideas when/if I come up with more.

Please get in touch if you would like any advice on playing around with your own 23andMe data: colm _at_ broadinstitute.org

Day-17 - functional annotation with VEP and LOFTEE

posted Oct 28, 2014, 5:40 AM by Colm O'Dushlaine   [ updated Oct 28, 2014, 9:30 AM ]

I'm going to start with the results from DNA.LAND and run VEP and the LOFTEE plugin to see what loss-of-function variants I have.

(1) make a file of only the variants I have

bcft view cod_imputed.vcf.gz --min-ac 1 | bgzip > cod_imputed.variant.vcf.gz

This reduces the number of variants from 39,128,182 to 3,757,309 variants (1,550,062 of which are homozygous). 

(2) annotate

perl variant_effect_predictor.pl \
      -i cod_imputed.variant.vcf.gz  \
      -o cod_VEP_LOFTEE  \
      --dir_plugins /bin/vep_plugins/  \
      --cache --force_overwrite --fork 14 --buffer_size 10000 \
      --dir /bin/vep_data/ \
      --plugin LoF,human_ancestor_fa:/bin/vep_data/human_ancestor.fa.rz \

bgzip cod_VEP_LOFTEE && cod_VEP_LOFTEE.vcf.gz  
I now have a list of results consisting of LOFTEE-annotated variants. Generally, they are pretty harmless.

Pull out the variants with damaging consequences, for example 61 frameshifts:

grep 'frameshift_variant' cod_VEP_LOFTEE.vcf | grep '1/1' | grep 'HC' | gawk '{print $1,$2,$3,$4,$5,$8 }' | sort -nrk1 -nrk2 | cut -f1 -d ','

22 45810275 rs202120654 G GA NS=1;AC=2;AN=2;CSQ=A|ENSG00000128408|ENST00000342894|Transcript|5_prime_UTR_variant&feature_elongation|398-399|||||||1||||
22 18222868 rs71690189 GGCCACGCTCAACT G NS=1;AC=2;AN=2;CSQ=-|ENSG00000015475|ENST00000342111|Transcript|frameshift_variant&feature_truncation|375-387|285-297|95-99|||||-1|POSITION:0.717391304347826|||HC
21 34169316 rs199532219 G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000205929|ENST00000382373|Transcript|intron_variant&feature_elongation||||||||-1||||
19 58413071 rs59723971 T TG NS=1;AC=2;AN=2;CSQ=G|ENSG00000269476|ENST00000602124|Transcript|frameshift_variant&NMD_transcript_variant&feature_elongation|453-454|46-47|16|||||-1||||
19 52803669 rs3217319 CTG C NS=1;AC=2;AN=2;CSQ=-|ENSG00000198464|ENST00000334564|Transcript|frameshift_variant&feature_truncation|69-70|5-6|2|||||1|POSITION:0.00405679513184584|||HC
19 52097560 rs141848717 G GA NS=1;AC=2;AN=2;CSQ=A|ENSG00000267808|ENST00000601315|Transcript|upstream_gene_variant|||||||4981|1||||
19 44589998 rs201737232 TTC T NS=1;AC=2;AN=2;CSQ=-|ENSG00000267022|ENST00000591793|Transcript|3_prime_UTR_variant&NMD_transcript_variant&feature_truncation|2294-2295|||||||1||||
19 41928867 rs3217385 G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000177191|ENST00000601379|Transcript|downstream_gene_variant|||||||3093|-1||||
19 21299773 rs202144538 T TTA NS=1;AC=2;AN=2;CSQ=TA|ENSG00000160352|ENST00000596143|Transcript|frameshift_variant&feature_elongation|628-629|303-304|101-102|||||1|POSITION:0.181981981981982|||HC
19 20981870 rs74172370 TC T NS=1;AC=2;AN=2;CSQ=-|ENSG00000160229|ENST00000594534|Transcript|frameshift_variant&feature_truncation|429|291|97|||||1|POSITION:0.470873786407767|||HC
19 20807177 rs35575803 G GA NS=1;AC=2;AN=2;CSQ=A|ENSG00000188171|ENST00000601440|Transcript|frameshift_variant&feature_elongation|1652-1653|1505-1506|502|||||-1|POSITION:0.94833018273472|||HC
18 3452222 rs11571510 CT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000177426|ENST00000345133|Transcript|intron_variant&feature_truncation||||||||1||||
18 3452221 rs202189171 CCT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000177426|ENST00000345133|Transcript|intron_variant&feature_truncation||||||||1||||
17 26708547 rs199750023 CT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000004139|ENST00000582323|Transcript|upstream_gene_variant|||||||3297|1||||
16 88599700 rs67322929 CT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000179588|ENST00000319555|Transcript|frameshift_variant&feature_truncation|1657|1335|445|||||1|POSITION:0.441906653426018|||HC
16 88599695 rs67712719 GGA G NS=1;AC=2;AN=2;CSQ=-|ENSG00000179588|ENST00000319555|Transcript|frameshift_variant&feature_truncation|1652-1653|1330-1331|444|||||1|POSITION:0.440582588546839|||HC
16 12027474 rs71139503 CAT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000234719|ENST00000356023|Transcript|upstream_gene_variant|||||||3981|-1||||
16 732287 rs3216838 GC G NS=1;AC=2;AN=2;CSQ=-|ENSG00000161999|ENST00000565302|Transcript|non_coding_exon_variant&nc_transcript_variant&feature_truncation|1385|||||||-1||||
15 27516740 rs76888516 GAGTC G NS=1;AC=2;AN=2;CSQ=-|ENSG00000182256|ENST00000554696|Transcript|frameshift_variant&feature_truncation|53-56|53-56|18-19|||||1|POSITION:0.0723514211886305|||HC
13 113688286 rs57693403 TC T NS=1;AC=2;AN=2;CSQ=-|ENSG00000126217|ENST00000434480|Transcript|intron_variant&feature_truncation||||||||1||||
11 119993897 rs66674912 CCT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000137699|ENST00000532195|Transcript|downstream_gene_variant|||||||4217|-1||||
11 56380546 rs72003051 CCAGA C NS=1;AC=2;AN=2;CSQ=-|ENSG00000255012|ENST00000526538|Transcript|frameshift_variant&feature_truncation|429-432|429-432|143-144|||||-1|POSITION:0.455696202531646|SINGLE_EXON||HC
11 704604 rs35782494 GA G NS=1;AC=2;AN=2;CSQ=-|ENSG00000177106|ENST00000527199|Transcript|upstream_gene_variant|||||||612|1||||
10 27702256 rs112067123 G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000182077|ENST00000438700|Transcript|frameshift_variant&feature_elongation|1041-1042|923-924|308|||||-1|POSITION:0.400607638888889|||HC
9 137968914 rs199830264 G GGA NS=1;AC=2;AN=2;CSQ=GA|ENSG00000130558|ENST00000371801|Transcript|downstream_gene_variant|||||||581|1||||
8 74581290 . A AG NS=1;AC=2;AN=2;CSQ=G|ENSG00000040341|ENST00000521451|Transcript|intron_variant&feature_elongation||||||||-1||||
7 44180079 rs5883888 A AG NS=1;AC=2;AN=2;CSQ=G|ENSG00000106631|ENST00000457910|Transcript|downstream_gene_variant|||||||132|-1||||
6 41721682 rs59644697 G GT NS=1;AC=2;AN=2;CSQ=T|ENSG00000096088|ENST00000415707|Transcript|frameshift_variant&feature_elongation|165-166|24-25|8-9|||||-1|POSITION:0.0930232558139535|||HC
6 33048693 rs202162010 CA C NS=1;AC=2;AN=2;CSQ=-|ENSG00000223865|ENST00000488575|Transcript|non_coding_exon_variant&nc_transcript_variant&feature_truncation|395|||||||1||||
6 33048688 rs141530233 G GA NS=1;AC=2;AN=2;CSQ=A|ENSG00000223865|ENST00000488575|Transcript|non_coding_exon_variant&nc_transcript_variant&feature_elongation|389-390|||||||1||||
6 31324603 rs200186034 CT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000234745|ENST00000481849|Transcript|upstream_gene_variant|||||||2021|-1||||
6 3264526 rs200271803 CT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000180822|ENST00000454610|Transcript|intron_variant&feature_truncation||||||||1||||
6 3264524 rs34894854 C CT NS=1;AC=2;AN=2;CSQ=T|ENSG00000180822|ENST00000454610|Transcript|intron_variant&feature_elongation||||||||1||||
5 149374881 rs200200374 TG T NS=1;AC=2;AN=2;CSQ=-|ENSG00000164296|ENST00000532987|Transcript|downstream_gene_variant|||||||4521|-1||||
5 149374879 rs3832324 CT C NS=1;AC=2;AN=2;CSQ=-|ENSG00000164296|ENST00000532987|Transcript|downstream_gene_variant|||||||4523|-1||||
5 140242451 rs11321479 GC G NS=1;AC=2;AN=2;CSQ=-|ENSG00000204970|ENST00000504120|Transcript|intron_variant&feature_truncation||||||||1||||
5 131324250 rs3043838 C CTG NS=1;AC=2;AN=2;CSQ=TG|ENSG00000164398|ENST00000489047|Transcript|upstream_gene_variant|||||||467|-1||||
4 187112348 rs201551129 G GT NS=1;AC=2;AN=2;CSQ=T|ENSG00000145476|ENST00000378802|Transcript|upstream_gene_variant|||||||325|1||||
3 100053684 rs34835464 C CG NS=1;AC=2;AN=2;CSQ=G|ENSG00000114021|ENST00000497785|Transcript|frameshift_variant&feature_elongation|30-31|32-33|11|||||1|POSITION:0.040251572327044|||HC
3 98110406 rs144759043 G GA NS=1;AC=2;AN=2;CSQ=A|ENSG00000251088|ENST00000508616|Transcript|intron_variant&nc_transcript_variant&feature_elongation||||||||1||||
3 98073591 rs11288615 TA T NS=1;AC=2;AN=2;CSQ=-|ENSG00000251088|ENST00000508616|Transcript|intron_variant&nc_transcript_variant&feature_truncation||||||||1||||
3 56591278 rs150150392 T TGGGGTAAGCA NS=1;AC=2;AN=2;CSQ=GGGGTAAGCA|ENSG00000180376|ENST00000469966|Transcript|upstream_gene_variant|||||||456|1||||
3 44540791 rs201582604 TTC T NS=1;AC=2;AN=2;CSQ=-|ENSG00000178917|ENST00000489411|Transcript|downstream_gene_variant|||||||222|-1||||
2 185802209 rs200158920 A AAC NS=1;AC=2;AN=2;CSQ=AC|ENSG00000170396|ENST00000302277|Transcript|frameshift_variant&feature_elongation|2680-2681|2086-2087|696|||||1|POSITION:0.57465564738292|||HC
2 120704165 . TG T NS=1;AC=2;AN=2;CSQ=-|ENSG00000088179|ENST00000544261|Transcript|intron_variant&feature_truncation||||||||1||||
2 120704164 rs3214717 A AT NS=1;AC=2;AN=2;CSQ=T|ENSG00000088179|ENST00000544261|Transcript|intron_variant&feature_elongation||||||||1||||
2 105953938 . G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000272994|ENST00000610036|Transcript|upstream_gene_variant|||||||6|-1||||
2 24387178 rs3214485 G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000266118|ENST00000584973|Transcript|frameshift_variant&NMD_transcript_variant&feature_elongation|601-602|512-513|171|||||1||||
1 248525638 rs34079073 CA C NS=1;AC=2;AN=2;CSQ=-|ENSG00000196944|ENST00000366475|Transcript|frameshift_variant&feature_truncation|757|757|253|||||1|POSITION:0.723018147086915|SINGLE_EXON||HC
1 245133549 rs10536649 GGC G NS=1;AC=2;AN=2;CSQ=-|ENSG00000272195|ENST00000607453|Transcript|frameshift_variant&feature_truncation|840-841|826-827|276|||||-1|POSITION:0.75941230486685|SINGLE_EXON||HC
1 220603312 . TGA T NS=1;AC=2;AN=2;CSQ=-|ENSG00000224867|ENST00000416839|Transcript|frameshift_variant&feature_truncation|125-126|125-126|42|||||-1|POSITION:0.823529411764706|||HC
1 220603310 rs10553491 TGTGA T NS=1;AC=2;AN=2;CSQ=-|ENSG00000224867|ENST00000416839|Transcript|frameshift_variant&feature_truncation|125-128|125-128|42-43|||||-1|POSITION:0.836601307189543|||HC
1 203023239 . T TG NS=1;AC=2;AN=2;CSQ=G|ENSG00000143847|ENST00000600447|Transcript|intron_variant&nc_transcript_variant&feature_elongation||||||||1||||
1 54605319 rs79249469 G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000256407|ENST00000311841|Transcript|downstream_gene_variant|||||||3745|-1||||
X 153151284 chrX:153151284:D GT G NS=1;AC=2;AN=2;CSQ=-|ENSG00000198910|ENST00000464967|Transcript|intron_variant&nc_transcript_variant&feature_truncation||||||||-1||||
X 153151280 chrX:153151280:I G GCC NS=1;AC=2;AN=2;CSQ=CC|ENSG00000198910|ENST00000464967|Transcript|intron_variant&nc_transcript_variant&feature_elongation||||||||-1||||
X 153149708 chrX:153149708:I G GC NS=1;AC=2;AN=2;CSQ=C|ENSG00000198910|ENST00000464967|Transcript|intron_variant&nc_transcript_variant&feature_elongation||||||||-1||||
X 153149707 chrX:153149707:I C CG NS=1;AC=2;AN=2;CSQ=G|ENSG00000198910|ENST00000464967|Transcript|intron_variant&nc_transcript_variant&feature_elongation||||||||-1||||
X 122336600 chrX:122336600:I T TG NS=1;AC=2;AN=2;CSQ=G|ENSG00000125675|ENST00000542149|Transcript|intron_variant&feature_elongation||||||||1||||
X 104464281 chrX:104464281:D TC T NS=1;AC=2;AN=2;CSQ=-|ENSG00000189108|ENST00000344799|Transcript|intron_variant&feature_truncation||||||||1||||
X 104464276 chrX:104464276:D CA C NS=1;AC=2;AN=2;CSQ=-|ENSG00000189108|ENST00000344799|Transcript|intron_variant&feature_truncation||||||||1||||

I don't really feel like going into these in detail, but what are the themes? I used http://www.biomart.org/ and searched against GO and MIM

 condition gene p


Day-16 - more imputation

posted Oct 24, 2014, 1:58 PM by Colm O'Dushlaine   [ updated Oct 26, 2014, 12:31 PM ]

I heard about dna.land from twitter. Looked interesting so I uploaded my genotypes. As Yaniv Erlich suggested, it did indeed take <1hr or so. Great! I downloaded the imputed VCF file.

What should I do now? Well NCBI have some cool VCF files that I could check against my own. I like to use BCFTOOLS for this.

Download the data and prepare:
  wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_20140929.vcf.gz
  gunzip clinvar_20140929.vcf.gz
  bgzip clinvar_20140929.vcf
  tabix -f clinvar_20140929.vcf.gz

Prepare my data:
  gunzip cod_imputed.vcf.gz
  bgzip cod_imputed.vcf
  tabix -f cod_imputed.vcf.gz

Annotate using BCFTOOLS:
  bcft annotate -a clinvar_20140929.vcf.gz -c CLNSIG cod_imputed.vcf.gz | grep 'CLNSIG' 

The results are below. I list only CLNSIG with values 4 (Likely pathogenic), 5 (Pathogenic), 6 (drug response), 7 (histocompatibility):

10 62813496 rs73271714 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
10 71725422 rs145738242 C T 0 PASS NS=1;CLNSIG=5|5 GT:GL 0/0:0,0,-10
10 73250935 rs115286496 G A 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
11 64223094 rs192072447 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
11 64804789 rs187832727 G A 0 PASS NS=1;CLNSIG=5|5 GT:GL 0/0:0,0,-10
12 120999522 rs150043680 C T 0 PASS NS=1;CLNSIG=4 GT:GL 0/0:0,0,-10
15 63062598 rs184722137 C T 0 PASS NS=1;CLNSIG=4 GT:GL 0/0:0,0,-10
16 1447052 rs148313356 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
17 70175284 rs80265682 G A 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
18 3457606 rs4468717 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
2 88583430 rs186826541 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
20 3912532 rs147714242 T C 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
21 42388524 rs2837902 G A 0 PASS NS=1;CLNSIG=5 GT:GL 0/1:-10,0,0
3 30688511 rs57037798 G A 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
3 126498141 rs190419091 G A 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
6 118558946 rs73525883 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
7 107919079 rs12667131 A G 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
7 117590404 rs12532066 C T 0 PASS NS=1;CLNSIG=4 GT:GL 0/1:-10,0,0
8 71244656 rs188407588 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
9 6589230 rs121964976 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10
9 131506217 rs7868318 G A 0 PASS NS=1;CLNSIG=5 GT:GL 1/1:-10,0,-10
                      X 71529785 rs188822811 C T 0 PASS NS=1;CLNSIG=5 GT:GL 0/0:0,0,-10

So it looks like I have 3 pathogenic variants, 2 heterozygotes and 1 homozygote. 

rs2837902 lies between DSCAM and BACE2, conceivably in promoter regions of both.
rs12532066 lies upstream of CTTNBP2, conceivably in the promoter region.
rs7868318 lies in the intron of ZER1.

Some of the associated diseases look scary but would need to take with a pinch of salt I think. The main point of this exercise was to demonstrate how you can add useful additional information to your genotypes and look for additional variants that might not be explicitly reported by 23andMe (or accessory software like Promethease).

By the way, having the imputed VCF file is great. You could upload to SeattleSeq for instance and get some more annotations.

You could obviously just make a BED file from the imputed VCF file, and cross-reference that with other BED files of variants/regions of interest (using something like BEDTOOLS).

  zcat cod_imputed.vcf.gz | grep -v '^#' | grep -P "0/1|1/1|1/0" | gawk '{print "chr"$1"\t"$2"\t"$2"\t"$3 }' > cod.bed

Download flagged SNPs from dbSNP 141 (clinical significance - snp141Flagged). Got this from the UCSC genome browser.

  bedtools intersect -a cod.bed -b flagged_snps.dbsnp141.bed -wo | gawk '$10 != "unknown" && $10 != "" && $10 != "" && $10 != "" && $10 != "" && $10 != "" && $10 != "" && $10 != ""   .... didn't get around to this, but worth a look at another time..

However, overall, I think something like VEP followed by LOFTEE would make more sense though. That, my friends, will be in my next installment...


posted Sep 5, 2013, 1:05 PM by Colm O'Dushlaine

Check out this awesome post by my colleague Giulio Genovese: http://apol1.blogspot.com/2013/08/impute-apoe-and-apol1-with-23andme.html 
What he does is provide some neat tricks to rapidly get information on variants not covered by a dataset.

Day-15 - More CNVs

posted Oct 24, 2011, 4:36 PM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:54 AM ]

I found a list of CNVs reported by the Wellcome trust a while ago (http://www.wtccc.org.uk/wtcccplus_cnv/supplemental.shtml). This study was interesting, even though it didn't find anything amazing. The last line of the abstract says it all really (!) "We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases"
....nevertheless, I decided to check if I have an SNPs that are (a) rare in HapMap CEU and (b) tag CNVs. 

Very simply I obtained minor allele frequency for my genotypes from HapMap and just filtered <=10%. I then cross-references with the CNV table listed here. I took care to use the bestLocalIlluminaTagRSID column because this is the platform my data is on. Clearly this is a quick and dirty analysis. For example, I could have expanded to a list of imputed genotypes perhaps. Anyway, long story short I found just 3 genes found within CNVs tagged by rare genotypes in my genome:
          Gene CTGLF3 chr 10 51418083 51418083 within CNVR4732.1 chr10:51403541-51447456 loss rs17178655
                 - Putative GTPase-activating protein .. low expression everywhere, so probably not interesting

          Gene OR52N5 chr 11 5755465 5755465 within CNVR5049.1 chr11:5741435-5765861 gain rs1453415
                - Olfactory receptor, we know these vary in all sorts of ways

          Gene CHST5 chr 16 74119928 74119928 within WTCCC1_181 chr16:74096937-74132230 unknown rs11149841
                - Catalyzes the transfer of sulfate to position 6 of non-reducing N-acetylglucosamine (GlcNAc) residues and O-linked sugars of mucin-type acceptors.
                - not sure about this. It doesn't interact with anything according to http://string-db.org. Maybe a connection with poor eyesight, http://www.ncbi.nlm.nih.gov/pubmed/21440637 http://www.ncbi.nlm.nih.gov/pubmed/16938851 (though I might be clutching at straws here). Some interesting pharmacogenetic possibilities also, http://www.ncbi.nlm.nih.gov/pubmed/11829137

That's all I could find! Besides, this is not an ideal CNV dataset given the lack of any convincing disease association.

Day-14 - CNVs

posted Jul 22, 2011, 7:54 AM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:57 AM ]

I am keen to look at my CNVs. Unfortunately, it isn't that easy. Ideally, I'd like my beadstudio file from 23andme and then I would run Kai Wang's PennCNV to call CNVs. I would then look up pubmed and the database of genomic variants to get a feel for the potentially damaging effects of any CNVs I have.
...but I don't have the files. There's a great article at http://www.genomesunzipped.org/2010/08/dude-where-are-my-copy-number-variants.php that goes through 2 interesting methods to detect CNVs from array data. I'm in the process of convincing my parents to do 23andme, which would allow me one way to look for CNVs. I was also intrigued by the imputation method described at the above link. In the end, however, I just did a really quick-and-dirty analysis using an April 2010 Conrad et al. article. The table I used is here. It summarizes CNVs highly correlated with SNPs and likely to be the causal variant. So I pulled these from the table and searched them against my genome, taking care to check strandedness (SNPedia).     

- e.g. pull out SNPs from my genome
  grep -P "rs10492927|rs11809207|..." mygenome
- look at the genotypes (not all SNPs were in there)
- how frequent in a Caucasian population? Does this tell me anything about which of my SNPs, if any, are likely to tag deleterious CNVs?

In general, most of my genotypes were common in HapMap CEU. So even though these are associated with some traits, they don't appear to be under major selective constraint. However, three SNPs that I have were rare in HapMap CEU - rs9291683, rs3129934 and rs2705293. These are associated with bone mineral density, multiple sclerosis and Schizophrenia respectively. My genotypes have a frequency of about 20, 5 and 15% respectively. So the multiple sclerosis variant seems to be potentially the most damaging. (Interesting that I have strong bones though!) The reported gene is HLA-DRB1. The risk increase is 3.3 fold, based on SNPedia. Now, based on 23andme data, I have 0.2% risk of MS, which is ~the same as the average, albeit with a slight decrease in risk of 0.77x (0.34 per 100 on average, 0.2 with my genotype). Note that this is coming from 2 SNPs - one in IL7RA and one in HLA-DRB1. Even though it's the same gene, the references are different. The 23andme results are pretty good and have a 4-star rating, with multiple reference sources from large populations. The SNPedia reference here was for work done on a reasonably small cohort of 242 individuals, so less likely to be as credible (though they did replicated their findings).

So it's nice to get stuck into you data in this way and to consider some of the literature that might not be directly apparent when you first look at your results. I have protective and risk variants for MS, some in the same gene. This is interesting, but given the odds ratios represents only very modest effects on risk.

Day-13 - PLINK annotation

posted Jul 20, 2011, 6:56 AM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:57 AM ]

I use PLINK alot in my day to day work. It's and excellent tool for analysis and manipulation of genomewide association datasets. I had the idea of using its annotation feature to see what I could find out. The website provides a pre-computed SNP attributes file for dbSNP v129 (I re-did this for 132 but there isn't that much difference for gene annotations). The method is described on this page. What I wanted to do is ask the following: for SNPs in my genome that are rare in the HapMap Caucasian population (as a reference), highlight any genes with interesting functional mutations. Even though any findings might not have any disease associations, I think it's a useful exercise. In addition, you can use the method to annotate however you like, e.g. other functional annotations, conserved elements, sites under selection etc.

I'm just going the process as a series of steps:

# Download the data and prepare my own
 wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/snp129.attrib.gz
 wget http://pngu.mgh.harvard.edu/~purcell/plink/dist/glist-hg18
 grep '^rs' ../genome_Colm_O_Dushlaine_Full_20110124134637.txt | gawk '{print $2,$1,$3,"1",$4 }' > cod.snps

# manual edit - add header - CHR         SNP        BP         P

# Annotate genes, dbSNP annotations and subset on rare SNPs (1%) (the snps.1pc.tmp file I had prepared earlier, pulling out SNP genotypes that I had that were <1% in frequency in CEU)
 plink --annotate cod.snps attrib=snp129.attrib.gz ranges=glist-hg18 snps=snps.1pc.tmp --out cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu

# The functional annotations cover the following, missense being the most frequent
 gawk '{print $2 }' snp129.attrib | sort | uniq -c
    348 =frameshift
 203933 =missense
  30874 =MISSENSE
   7682 =nonsense
    849 =NONSENSE
    963 =splice
    297 =SPLICE
== 244765

# Look at functional categories of interest that have gene annotations. Note: these results are first-pass and the snp129.attrib file includes annotations based on a SNP being in LD with a functional SNP, i.e. it may not necessarily be the true SNP
 grep -P "nonsense|NONSENSE|frameshift|FRAMESHIFT|SPLICE|splice" cod.snps_vs_hg18_vs_snp129_vs_1pc_ceu.annot | grep '('
1 rs969788 86689444 1 AG CLCA2(0)|=missense|=nonsense
5 rs1423414 41179513 1 AA C6(0)|=missense|=nonsense
5 rs4957377 41228756 1 CC C6(0)|=missense|=nonsense
5 rs4957152 41270160 1 CC C6(0)|=missense|=nonsense
5 rs2004385 41286190 1 AA C6(0)|=missense|=nonsense
5 rs7724199 156922535 1 AA ADAM19(0)|=missense|=nonsense
5 rs7719224 156931918 1 TT ADAM19(0)|=missense|=nonsense
5 rs11134822 156933880 1 TT ADAM19(0)|=missense|=nonsense
6 rs3778638 31200103 1 AA PSORS1C1(0)|=missense|=nonsense
6 rs2233945 31215340 1 AA PSORS1C1(0)|=missense|=nonsense
6 rs12364 31218765 1 AA CCHCR1(0)|=missense|=nonsense
6 rs9263739 31219335 1 TT CCHCR1(0)|=missense|=nonsense
6 rs9263740 31219379 1 CC CCHCR1(0)|=missense|=nonsense
6 rs130077 31230309 1 AA CCHCR1(0)|=missense|=nonsense
6 rs9263785 31233802 1 GG CCHCR1(0)|=missense|=nonsense
6 rs9263794 31237998 1 GG TCF19(0)|=missense|=nonsense
6 rs1044870 31238800 1 TT TCF19(0)|=missense|=nonsense
6 rs9263796 31240862 1 TT POU5F1(0)|=missense|=nonsense
6 rs9263800 31242578 1 AA POU5F1(0)|=missense|=nonsense
7 rs596572 4174875 1 TT SDK1(0)|=missense|=nonsense
7 rs623667 4177807 1 AA SDK1(0)|=missense|=nonsense
8 rs3213604 17770696 1 AT FGL1(0)|=frameshift|=missense
11 rs4910844 5733060 1 TT OR52N4(0)|=NONSENSE|=missense
11 rs10838637 5766053 1 GG OR52N1(0)|=missense|=nonsense
11 rs10769224 5766249 1 TT OR52N1(0)|=MISSENSE|=nonsense
11 rs10742787 5766322 1 TT OR52N1(0)|=MISSENSE|=nonsense
13 rs11568656 94512943 1 GT ABCC4(0)|=frameshift
13 rs7997839 94514336 1 AG ABCC4(0)|=frameshift

CLCA2: calcium channel: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CLCA2
C6: immune-related: http://www.genecards.org/cgi-bin/carddisp.pl?gene=C6
ADAM19: lots of roles, including neurogenesis (explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=ADAM19
PSORS1C1: unclear: http://www.genecards.org/cgi-bin/carddisp.pl?gene=PSORS1C1
CCHCR1: keratinocyte: http://www.genecards.org/cgi-bin/carddisp.pl?gene=CCHCR1
TCF19: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=TCF19
POU5F1: transcription factor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=POU5F1
SDK1: cell adhesion protein that guides axonal terminals to specific synapses in developing neurons (definitely explains my insanity!): http://www.genecards.org/cgi-bin/carddisp.pl?gene=SDK1
FGL1: hepatocyte mitogenic activity: http://www.genecards.org/cgi-bin/carddisp.pl?gene=FGL1
OR52N4: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N4
OR52N1: olfactory receptor: http://www.genecards.org/cgi-bin/carddisp.pl?gene=OR52N1
ABCC4: organic anion pump relevant to cellular detoxification http://www.genecards.org/cgi-bin/carddisp.pl?gene=ABCC4

Day-12 - Promethease revisited

posted Jul 14, 2011, 1:00 PM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:56 AM ]

I mentioned this on day 5. Pretty good for Windows users but slow. I wrote a script that emulates the process and runs quite quickly on UNIX. 23andme genotypes are all on the forward strand and the script checks that if the SNPedia genotypes are on the minus strand the alleles are reverse complemented. Note: Promethease does this also. So, in summary:

- loop over all 23andme SNPs
- check if any genotype information in SNPedia
- [if yes] ensure genotypes are matched for strand and reverse complement if necessary
- [if yes] parse out disease/association annotation from wiki data

I've set the script up to only give me an output when there is something interesting for my genotype. Example of output is:

 snp_and_genotype strand magnitude disease_annotations
 rs1000113(C;C) plus na [ normal ]
 rs10008492(C;C) plus na [ normal ]
 rs10033464(G;G) plus 0 }} [ norm ]
 rs1004819(C;T) minus na [ 1.5x_risk ]
 rs10050860(C;C) plus 0  [ normal_risk ]
 rs10086908(T;T) plus 0  [ normal_risk ]
 rs10090154(C;C) plus 0  [ normal ]
 rs1010(A;G) minus na [ 1.75x_risk_of_MI ]
 rs1012729(A;A) plus na [ normal ]
 rs10134944(C;C) plus 0  [ normal ]
 rs1015362(A;G) minus na [ 2-4x_higher_risk_of_sun_sensitivity_if_part_of_risk_haplotype ]

This is crude, but a quick look:

 grep -vP "common|normal|norm|None" genome_Colm_O_Dushlaine_Full_20110124134637.txt.out | gawk '$2 != "?" { print }' | more

Most interesting here are:
  • rs11983225(T;T), http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs11983225 shows that T/T is ad a 0% frequency in CEU. It seems to be associated with a 7x reduced likelihood of responding to certain antidepressants. So that's not good if ever I need them! Also http://www.snpedia.com/index.php/Rs7787082(G;G)
  • rs1570360(A;A): 3x increased risk of sudden infant death syndrome. I'm 30 so ok I think
  • Confirmation of many phenotypes (taste bitter, blue eyes etc.) reported by 23andme
  • Some cardiac and stroke-related SNPs. These are mostly common and I would expect this given the incidence in the Irish population
  • rs2270641(G;G), which is pretty rare in CEU and increased Schizophrenia rist 3.7 fold. That study was small, however and we now know that alot of SNPs are implicated
  • rs2834167(A;A): 2.67x_increased_risk_for_systemic_sclerosis. Might have something to do with my Raynaud's-like symptoms
  • rs4606(C:C): complex association with anxiety disorders. I definitely have anxiety, but this association is not a simple one
  • Some variants associated with neuroblastoma. This is a condition generally restricted to children (~2% of cases in adults (>18yrs))
  • rs6596075(C:C): 2x risk of Crohn's disease
  • rs7442295(A;A): increased risk of hyperuracemia, but this is common in Europeans, see http://www.snpedia.com/index.php/Rs7442295
  • rs762551(A;A): CYP1A2*1F_homozygote;_Fast_metabolizer (or Caffeine). Definitely true!
Overall, many of these variants are well-covered by 23andme but it is certainly nice to do something like this as it makes the sources more transparent I think. You can go straight to SNPedia and look at the papers referenced. It also has a nice simple digestion of the implicated variants.

So this approach works well I think, though a very much stripped down version of Promethease. Apologies to Mike Cariaso for pestering him so much when I was working out how to query SNPedia properly. Get in touch if you want the script, or download it here. You are free to use it in a non-commercial context.

Day-11 - population genetics

posted Jul 14, 2011, 12:47 PM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:55 AM ]

My colleagues - Jim Wilson and Angelika Kritz - did this for me a few months ago. It's a cool service they provide as part of ethnoancestry.com. As expected, it confirms my status as European and the proportion of my genome within runs of homozygosity doesn't appear to be too high, which is nice to know. Check out the attachment for details.

Day-10 - NHGRI GWAS catalogue

posted Jul 13, 2011, 7:44 AM by Colm O'Dushlaine   [ updated Oct 15, 2014, 9:54 AM ]

There is another nice resource out there called the catalog of published genome-wide association studies, available at http://www.genome.gov/gwastudies/. It has it's limitations, but is a nice list of associations that we can be pretty confident about. So what I decided to do here was to compare my 23&me results with this list. 

I first made a "light" version of the download (www.genome.gov/admin/gwascatalog.txt), selecting fields of interest - positions, gene-names, control frequency:

Adiposity       13      80959207        SPRY2   SPRY2 - ARF4P4  rs534870-A      rs534870        0.68
Adiposity       16      53816275        FTO     FTO     rs8050136-C     rs8050136       0.60
Schizophrenia   19      42066279        ATP5SL, CEACAM21        PLEKHA3P1 - CEACAM21    rs4803480-A     rs4803480       0.13

I wanted some Caucasion minor allele frequency data also (get_maf.pl is attached)

 wget http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/hapmapSnpsCEU.txt.gz
 gunzip hapmapSnpsCEU.txt.gz
 perl get_maf.pl > ceu.maf
 grep -v arning ceu.maf > ceu.maf.clean

I then ran parse.pl (attached) to make a file called gwascatalogue.res.txt

 perl parse_gwas_catalogue.pl gwascatalog.txt.lite

I wanted SNPs that were rare. The file contains control frequency and CEU frequency. Here, I'm pulling out SNPs with frequency <2.5% in CEU (I think this is more accurate than the frequencies listed in the studies).

 gawk -F "\t" '$11 <= 0.025 {print  }' gwascatalogue.res.txt

disease chr     pos     geneP..down     snp_allelesnp   risk_all_cont   cod_geno        dose    ceu_maf
Type 2 diabetes 9       10430602        PTPRD   PTPRD   rs649891-C      rs649891        0.35    CC      2       0.0167
Optic disc parameters   1       92077097        CDC7, TGFBR3    RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       NR      GG      2       0.0167
Parkinson's disease     17      44828931        MAPT    NSF     rs199533-C      rs199533        0.78    AA      0       0.0167
Non-alcoholic fatty liver disease histology (AST)       9       78425925        Intergenic      OSTF1 - PCSK5   rs12344488-A    rs12344488      0.07    AA      2       0.0169
Optic disc parameters   1       92077097        CDC7,TGFBR3     RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       0.18    GG      2       0.0167
Optic disc size (disc)  1       92077097        HSP90B3P        RPL39P13 - HSP90B3P     rs1192415-A     rs1192415       NR      GG      0       0.0167
Ankylosing spondylitis  5       96129512        ERAP1   ERAP1   rs27434-A       rs27434 0.23    AA      2       0.0167
Parkinson's disease     17      44828931        NSF     NSF     rs199533-C      rs199533        0.83    AA      0       0.0167
Parkinson's disease     17      43719143        MAPT,  C17orf69, KIAA1267, LOC644246, IMP5      C17orf69        rs393152-A      rs393152        0.82    GG      0       0.0167
Obesity (extreme)       10      37982097        ZNF248  MTRNR2L7 - TLK2P2       rs7474896-T     rs7474896       0.14    TT      2       0.0167
Rheumatoid arthritis    6       138002637       TNFAIP3, OLIG3  OLIG3 - TNFAIP3 rs10499194-C    rs10499194      0.71    TT      0       0.0167

The e.g. 'CC' (3rd last column) are my genotypes and the 'dose' field means how many copies of the risk allele do I have. So, of the above results, I think the following are most interesting:

Type 2 diabetes 9       10430602        PTPRD   PTPRD   rs649891-C      rs649891        0.35    CC      2       0.0167
Optic disc parameters   1       92077097        CDC7, TGFBR3    RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       NR      GG      2       0.0167
Non-alcoholic fatty liver disease histology (AST)       9       78425925        Intergenic      OSTF1 - PCSK5   rs12344488-A    rs12344488      0.07    AA      2       0.0169
Optic disc parameters   1       92077097        CDC7,TGFBR3     RPL39P13 - HSP90B3P     rs1192415-G     rs1192415       0.18    GG      2       0.0167
Ankylosing spondylitis  5       96129512        ERAP1   ERAP1   rs27434-A       rs27434 0.23    AA      2       0.0167
Obesity (extreme)       10      37982097        ZNF248  MTRNR2L7 - TLK2P2       rs7474896-T     rs7474896       0.14    TT      2       0.0167

I have bad eyesight (very bad!) so I'm intrigued by the optic disc parameters findings! The obesity/T2D etc. are not so interesting as common and complex genetic bases. 

Day-09 - OMIM

posted Jul 11, 2011, 9:10 AM by Colm O'Dushlaine   [ updated Oct 15, 2014, 10:02 AM ]

Can we identify any SNPs, predicted to be damaging, in OMIM? This would point to any Mendelian traits/disorders I might have. There is a file in dbSNP that is useful for this - ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/OmimVarLocusIdSNP.bcp.gz. The file links omim ids to SNP ids. 


  wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/organism_data/OmimVarLocusIdSNP.bcp.gz
 gunzip OmimVarLocusIdSNP.bcp.gz
 cut -f 1,9 OmimVarLocusIdSNP.bcp | gawk '{print "rs"$2,$1 }' | sort > snps_2_omim.txt

Then, sort my missense SNPs and join both lists. None of my rare missense match, so I checked the larger list of all 476 missense SNPs:

 join all_missense.sorted.txt snps_2_omim.txt 

You can then just look up the OMIM entries listed, e.g. http://omim.org/entry/607751 

rs10246939 607751 <-- This polymorphism, in conjunction with other SNPs in the gene, give rise to the ability to taste or not taste phenylthiocarbamide
rs1049550 612388 <--  strongly associated with sarcoidosis
rs1126809 601800 <-- association with skin sensitivity to sun (p = 7.1 x 10(-13)) and blue versus green eye color (p = 4.6 x 10(-21)).
rs1126809 606933 <-- blue eye colour
rs1154510 609695 <-- HAWKINSINURIA (autosomal dominant inborn error of metabolism) (elevated levels of blood tyrosine and massive excretion of tyrosine derivatives into urine)
rs2108622 122700 <-- resistance to warfarin
rs2108622 604426 <-- resistance to warfarin
rs288326 605083 <-- OSTEOARTHRITIS SUSCEPTIBILITY 1 (strong risk factor for primary osteoarthritis of the hip in females)
rs3775291 603029 <-- protective against the development of geographic atrophy or advanced dry age-related macular degeneration
rs3775291 603075 <-- associated with protection from progression to geographic atrophy in patients with age-related macular degeneration
rs4673 608508 <-- cardiovascular disease related
rs4917 138680 <-- an increased risk for leanness (OR, 1.90; p = 0.027)
rs602662 182100 <-- ...genetic factors affecting circulating vitamin B12 levels and identified rs602662 in the FUT2 gene
rs602662 612542 <-- vit B12
rs7080536 603924 <-- CAROTID STENOSIS, SUSCEPTIBILITY TO (basically a heart disease risk factor)
rs7133914 609007 <-- Decreased susceptibility to Parkinson Disease
rs7308720 609007 <-- Decreased susceptibility to Parkinson Disease

So, overall I think some interesting stuff in there, although alot of this is covered nicely by 23&me already.

1-10 of 18