Genomic data have been growing explosively in the past few years. Until now, there are more than 500K gene expression profiles in public databases (e.g., NCBI Gene Expression Omnibus). The Encyclopedia of DNA Elements (ENCODE) Consortium has generated vast amounts of annotation data using next-generation sequencing, such as gene expression (RNA-seq), transcription factor binding sites (ChIP-seq), etc. As of September 2012, more than 1,600 ENCODE data from 147 cell lines have been produced. In the meanwhile, thousands of GWAS directly genotyped millions of SNP markers to study the genetic bases of complex diseases, and more than 10,000 loci have been reported to be associated with at least one diseases. Whole genome sequencing aims at directly detecting all genetic variants and it is rapidly becoming a primary tool to characterize the genetic bases of human diseases.
In 2010, Yang et al. showed that 45% of the variance for human height can be explained by using all genotyped common SNPs. This result suggests that most of the "missing heritability" is not missing but remains hidden in the genome: due to the limited sample size, many individual effects of genetic markers are too weak to pass the genome-wide significance, and thus those risk genetic variants remain undiscovered. So far, people have found similar genetic architectures for many other complex diseases, such as psychiatric disorders, i.e., the phenotype is affected by many genetic variants with small or moderate effects, which is referred to as "polygenicity". The polygenicity of complex diseases is further supported by recent GWAS with larger sample sizes, in which more associated common SNPs with moderate effects have been identified (e.g., GWAS data from 34,840 patients and 114,981 healthy people are analyzed to understand the genetic architecture of type 2 diabetes). However, large sample recruitment may be expensive and time-consuming. Identification of those hidden risk variants is very challenging. Thanks to the Big Data in genomics, statistics can be very helpful for borrowing relevant information:
Clearly, there is a great need to develop a statistically rigorous and computationally efficient methods to integrate genomic data. It allows biomedical researchers to make the most efficient use of the vast amounts of valuable data that have been generated to dissect complex disease genetics. The methods developed here is also broadly applicable to many other disciplines where diverse, rich, and multiscale data are available to address challenging scientific problems.
See more relevant information in our recent paper