23andMe

(2016/01/28)

遺伝子検査サービスの最大手である23andMeでは、結果をヴィジュアライゼーションして返してくれるほか、自分の遺伝データをダウンロードすることができる。いわゆるSNPアレイのフォーマットに近いデータである。それを再解析してみる方法を記述する。

"生データ"を眺めてみる

23andMeにログインすると、上のタブにTools -> Download Raw Dataというリンク先がある。そこからgenotypeのデータを一括してダウンロード可能。SNPについてのアノテーションはたまにアップデートしているらしく(change logが掲載されている)、この記事を書いている時点では2015/7/22にアップデートされたデータが最新版の模様。最新のデータのみ、ダウンロードできる形になっている。2015/07/22以前に申し込んだので、以前のデータも持っているのだが、確かに2015年以降ではデータが変わっていた。

ファイルのフォーマットは、ファイルの頭に#でコメントアウトしたヘッダー行が続き、そのあとがSNP ID, 染色体, ポジション, genotype情報となっている。ヘッダー行には、このデータは、研究・教育・情報的な目的でのみ使うのはよいけど、臨床目的やその他の目的には適さないからね、と書いてある。

# This file contains raw genotype data, including data that is not used in 23andMe reports.

# This data has undergone a general quality review however only a subset of markers have been

# individually validated for accuracy. As such, this data is suitable only for research,

# educational, and informational use and not for medical or other use.

# Below is a text version of your data. Fields are TAB-separated

# Each line corresponds to a single SNP. For each SNP, we provide its identifier

# (an rsid or an internal id), its location on the reference human genome, and the

# genotype call oriented with respect to the plus strand on the human reference sequence.

# We are using reference human assembly build 37 (also known as Annotation Release 104).

# Note that it is possible that data downloaded at different times may be different due to ongoing

# improvements in our ability to call genotypes. More information about these changes can be found at:

# https://www.23andme.com/you/download/revisions/

# More information on reference human assembly build 37 (aka Annotation Release 104):

# http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606

# rsid chromosome position genotype

rs12564807 1 734462 AA

rs3131972 1 752721 GG

rs148828841 1 760998 CC

その他、データ形式について

タブ区切り
SNPのアノテーションファイルはないが、リファンレス配列はGRCh37に基づいているとのこと。
- SNPのIDには、rs ナンバーがついているものと、"i"で始まるオリジナルのID がある。
- アリルで、--はmissing, Dはdeletion, Iはinsertion

PLINKで扱う

このままだと扱いづらいので一般的なフォーマットへ変換する。ヒト向け遺伝学解析のソフトウェアであるPLINK 1.9 以降では23andMeのデータを直接入力できるオプションが出来ていた！ --23fileで23andMeのファイルを入力できる。

ファイル名のあとは、[family ID] , [within family ID (individual ID)] , 性別(1=男性, 2=女性, -i でXY染色体のヘテロ接合度から推定してくれる), case/controlの指定、[Paternal ID] [Maternal ID] をいれる。これらのオプションは省略可能。
出力ファイルには、--recode --transposeなどの、ped形式やtped形式へ変換する従来のオプションも利用可能。

PLINKの使い方については、詳しくはPLINKのマニュアルにて。

例えばこんな感じで、

$ plink --23file genome_23andMe.txt myfamily01 me01 2 1 --out mydata

とやると、.bedファイルができあがります。610,544 variants あって、うち15,170がindel, total genotyping ratio は 0.993075とそこそこ。

次は、rsナンバーのついてる1塩基多型(INDEL抜き)のみを取り出す。いくつかやり方がありますが、お好みの方法でサクッと。

まずはRSナンバーがついているSNPだけ抜き出し、

$ awk '$2~/^rs/ {print $2}' mydata.bim > rs_list.txt

PLINKを使って、目的のSNPを抽出

$ plink --bfile mydata --make-bed --out mydata --snps-only no-DI --extract rs_list.txt

552,068 variantsに絞られ、genotyping rate も0.995918と改善した。

ではここから他のSNPデータと比較してみることに（続く）

Page updated

Google Sites

Report abuse