Big Data Analytics in genomics

Genomic data have been growing explosively in the past few years. Until now, there are more than 500K gene expression profiles in public databases (e.g., NCBI Gene Expression Omnibus). The Encyclopedia of DNA Elements (ENCODE) Consortium has generated vast amounts of annotation data using next-generation sequencing, such as gene expression (RNA-seq), transcription factor binding sites (ChIP-seq), etc. As of September 2012, more than 1,600 ENCODE data from 147 cell lines have been produced. In the meanwhile, thousands of GWAS directly genotyped millions of SNP markers to study the genetic bases of complex diseases, and more than 10,000 loci have been reported to be associated with at least one diseases. Whole genome sequencing aims at directly detecting all genetic variants and it is rapidly becoming a primary tool to characterize the genetic bases of human diseases.

In 2010, Yang et al. showed that 45% of the variance for human height can be explained by using all genotyped common SNPs. This result suggests that most of the "missing heritability" is not missing but remains hidden in the genome: due to the limited sample size, many individual effects of genetic markers are too weak to pass the genome-wide significance, and thus those risk genetic variants remain undiscovered. So far, people have found similar genetic architectures for many other complex diseases, such as psychiatric disorders, i.e., the phenotype is affected by many genetic variants with small or moderate effects, which is referred to as "polygenicity". The polygenicity of complex diseases is further supported by recent GWAS with larger sample sizes, in which more associated common SNPs with moderate effects have been identified (e.g., GWAS data from 34,840 patients and 114,981 healthy people are analyzed to understand the genetic architecture of type 2 diabetes). However, large sample recruitment may be expensive and time-consuming. Identification of those hidden risk variants is very challenging. Thanks to the Big Data in genomics, statistics can be very helpful for borrowing relevant information:

  • Shared information in multiple GWAS: Accumulating evidence suggests that different complex human traits are genetically correlated, i.e., multiple diseases share common risk genetic bases, which is known as "pleiotropy". Based on a systematic analysis of published GWAS, 16.9% genes and 4.6% SNPs have been reported to show pleiotropic effects.
  • Data enrichment with functional annotation: SNPs are not equally important and functionally annotated genetic variants have revealed a consistent pattern of enrichment. Associated SNPs are more likely to be eQTLs, e.g., SNPs in genes preferentially expressed in the central nervous system are shown to be more important in pyschiatry disorder. The ENCODE Project Consortium reported that 12% of disease-associated SNPs overlap transcription factor binding regions and 34% overlap DNase I hypersensitive sites.

Clearly, there is a great need to develop a statistically rigorous and computationally efficient methods to integrate genomic data. It allows biomedical researchers to make the most efficient use of the vast amounts of valuable data that have been generated to dissect complex disease genetics. The methods developed here is also broadly applicable to many other disciplines where diverse, rich, and multiscale data are available to address challenging scientific problems.

See more relevant information in our recent paper