To accomplish our goal, we use summary statistics data from genome-wide association studies (GWAS), which search the genome for small variations called single nucleotide polymorphisms (SNPs). SNPs occur more frequently in people with a particular disease than in people without the disease.
In our research, we conducted a meta-analysis on five different summary statistics done on Type two diabetes (T2D). We analyzed over 15 million different SNPs to find that there are certain regions of the genome that contribute more to T2D than others.
Specifically, our study found that certain regions of chromosome 10, 11, and 6 are associated with T2D Some specific genes we found that seemed to be correlated with T2D due to their very low p-values are TCF7l2, CDKAL1, and KCNQ1. In particular, "TCF712 is a transcription factor influencing the transcription of several genes, thereby exerting a large variety of functions within the cell and it was named a prime suspect in causing T2D" (11). "CDKAL1 is a protein coding gene and is suspected to be correlated with T2D" (8). " At last, KCNQ1 is a protein that supports the creation of potassium channels and is also believed to be correlated with T2D" (9).
According to the Mayo Clinic, “T2D is a chronic condition that affects the way your body metabolizes sugar an important source of fuel for your body.
With T2D, your body either resists the effects of insulin, a hormone that regulates the movement of sugar into your cells, or doesn't produce enough insulin to maintain normal glucose levels.
T2D used to be known as adult-onset diabetes, but today more children are being diagnosed with the disorder, probably due to the rise in childhood obesity. There's no cure for T2D, but losing weight, eating well and exercising can help manage the disease. If diet and exercise aren't enough to manage your blood sugar well, you may also need diabetes medications or insulin therapy.
As one of the leading causes of death in the nation, T2D has more than 30 million cases" (10). We want to contribute ourselves to possibly find a cure to T2D.
"Genome-wide association studies (GWAS) is a relatively new way for scientists to identify genes involved in human disease. This method searches the genome for small variations, called single nucleotide polymorphisms or SNPs (pronounced “snips”), that occur more frequently in people with a particular disease than in people without the disease. Each study can look at hundreds or thousands of SNPs at the same time. Researchers use data from this type of study to pinpoint genes that may contribute to a person’s risk of developing a certain disease.
Because GWAS examine SNPs across the genome, they represent a promising way to study complex, common diseases in which many genetic variations contribute to a person’s risk. This approach has already identified SNPs related to several complex conditions including diabetes, heart abnormalities, Parkinson disease, and Crohn disease. Researchers hope that future GWAS will identify more SNPs associated with chronic diseases, as well as variations that affect a person’s response to certain drugs and influence interactions between a person’s genes and the environment" (12).
To accomplish our goal, we will be using summary statistics data from GWAS. These summary statistics contain information such as the variant id, the chromosome, the position, the allele, the effect allele, the allele frequency, and the p values. From this data we are able to determine which areas of the genome are associated with T2D and potentially find more when doing meta-analysis between studies. From the summary statistics of several studies, we are able to analyze and determine which area of the genome contribute to T2D. As a result, our analysis will help further T2D research.
According to the Meta-Analysis website, "Meta-analysis is the statistical procedure for combining data from multiple studies. When the treatment effect (or effect size) is consistent from one study to the next, meta-analysis can be used to identify this common effect. When the effect varies from one study to the next, meta-analysis may be used to identify the reason for the variation" (13).
Some previous studies on T2D have shown that T2D is caused by environmental factors as well as hereditary components. GWAS have been able to identify multiple genes that are related to T2D. Although many genes that are associated with T2D have been identified, “they only account for a small portion of the observed heritability of T2D. The estimated heritability of T2D ranges from twenty to eighty percent. This estimate has been created by studying a variety of population, families, and twin-based studies. The risk of obtaining T2D, if you have one parent that has it, is forty percent; while if a person has two parents that has T2D, they have a seventy percent chance of obtaining T2D. Even though this seems that genes have a large portion to do with T2D, it might also be caused a lot by environmental factors" (14).
Various approaches have been used to identify genes related to diabetes. One type of study done to investigate genes relations to the diabetes is called GWAS. "These studies identify genetic loci that appear to be linked to a trait. Some of these studies focused on genes that affect obesity and by association obesity causes T2D. Genes found in these studies were TCF7L2 which is a transcription factor, HHEX which is a transcription factor, SLC30A8 which facilitates the accumulation of zinc, CDKN2A/B which makes instructions for making several proteins, and IGF2BP2 which is related with metabolism and variation. These genes were associated with T2D. All these different genes were found to influence whether a person has T2D. Other studies have found that there is correlation between T2D and regions of chromosome 20" (14). We have conducted our analysis using GWAS.
The data we used are the summary statistics from GWAS studies done on T2D. Summary statistics capture the results of the GWAS of that particular trait being investigated. The data contain information such as the variant id, the chromosome, the position, the allele, the effect allele, and the p values. From the data, we were able to determine which areas of the genome are associated with T2D and potentially find more using meta-analysis between studies. From the summary statistics on five different studies, we were able to analyze and determine which areas of genome contribute to T2D.
Specifically, the five different studies we used were Dupuis study on new genetic loci implicated in fasting glucose homeostasis and their impact on T2D risk, Saxena's study on genome-wide association analysis to identify loci for T2D and triglyceride levels, Wood and Wojcik's genetic analysis of diverse populations to discover loci for T2D, and Bonas-Guarch's re-analysis of public genetic data with chromosome X and its association with T2D.
In order to get our data, we used FTP server and accessed different studies on T2D at http://ftp.ebi.ac.uk/pub/databases/. We created a data ingestion pipeline to take in the data from the GWAS catalog. After taking in the data, we used Python Pandas to read in the data into tables. We then cleaned the data by getting rid of null columns and values. We also made sure the column names were consistent between all the studies.
After obtaining and cleaning the data, we analyzed individual summary statistics to see which areas of the genome seem to affect T2D. We plotted p-values histograms for each study to see the distributions by using histogram function. Subsequently, we plotted a Manhattan plot for each study to see which area of the genome are related to T2D in each study using Pandas and matplotlib. After we investigated the individual studies, we decided to do a meta-analysis to further investigate all the summary statistics put together. By doing so, we wanted to see if we could discover more areas of the genome that affect T2D and analyze the data all together. Meta-analysis allows us to get closer to the unknown truth because it combines different studies. In order to conduct our meta-analysis, we used a software called Metal created by the University of Michigan. We exported our clean data into txt files in order to input them into Metal. This software combines these individual studies and creates a meta-analysis on them.
“METAL is a tool for meta-analysis genome-wide association scans. METAL can combine either (a) test statistics and standard errors or (b) p-values across studies (taking sample size and direction of effect into account). METAL analysis is a convenient alternative to a direct analysis of merged data from multiple studies. It is especially appropriate when data from the individual studies cannot be analyzed together because of differences in ethnicity, phenotype distribution, gender or constraints in sharing of individual level data imposed. Meta-analysis results in little or no loss of efficiency compared to analysis of a combined datasets including data from all individual studies. First, for each marker, a reference allele is selected and a z-statistic characterizing the evidence for association is calculated. The z-statistic summarizes the magnitude and the direction of effect relative to the reference allele and all studies are aligned to the same reference allele. Next, an overall z-statistic and p-value are then calculated from a weighted sum of the individual statistics. Weights are proportional to the square-root of the number of individuals examined in each sample and selected such that the squared weights sum to 1.0. For samples that contain related individuals, a smaller ‘effective’ sample size may be used, but simulations suggest that modest changes in the effective sample size have very little impact on the final p-value” (3).
After using Metal, we obtained a tbl file containing information such as marker id, p-value, and z score. We used Python Pandas to read it into a tabular data structure. We then plotted a Q-Q plot from those p-values given from metal to see what kind of distribution the output gave us. We used Python package scipy to plot the Q-Q plot. We then used Pandas to rejoin the marker ids with their chromosome and to make a Manhattan plot. We plotted a Manhattan plot to see which area of the genome, with all the studies combined, were related to T2D using matplotlib. We also plotted another p-value histogram using pandas again to see the distribution.
This is the Manhattan plot of the final result. We can see that there are regions in the sixth, tenth, and eleventh chromosome that are correlated with T2D. The higher the dot, the more correlated the dot is with T2D. In addition, the SNPs that are considered to be significant are the dots that are higher than the dotted line.
This is the Q-Q plot of the Meta-Analysis final result. It shows that the distribution is a rightly skewed distribution. This means that a good amount of SNPs are correlated with the disease. There are a lot of low p-values, meaning that these SNPs with small p-values are correlated with T2D.
This is the meta-analysis P-value histogram. We are able to see that there is a huge spike in the values furthest to the left. The p-value histogram is anti-conservative, which indicates that there is a high number of genes that are considered to be correlated with T2D.
First we decided to create p-value histograms of each study to see the distribution. On the x-axis of the histograms are p-values, and on the y-axis is the count. The higher that bar, the more p-values there are of that value in the study. From the individual P-value histograms, we find that the p-values on the five studies are either anti-conservative or uniform distributions. If the p-value histogram is anti-conservative, it indicates that the p-values are well behaved and that there are a lot of SNPs that are determined to be correlated with T2D. If the histogram is a uniform distribution then it means that there are not a lot of SNPs that are determined to be correlated with T2D. Saxena, Wojcik, and Wood's summary statistics are anti-conservative while Dupuis and Bonas-Guarch's ones are uniform distributions.
Then we decided to make a Manhattan Plot for each study to see which SNP were considered associated with T2D. On the x-axis of the Manhattan plot is the chromosome and on the y-axis is the negative log 10 of the p-value. Each dot is a SNP, and the higher the dot is the more correlated that snips is with the T2D. From the individual Manhattan Plots we see that either the studies indicated there are no areas of the genome that affect T2D or that regions in chromosome 6, 10, and 11 affect T2D. Saxena, Wood and Bonas-Guarch's studies indicate that there are regions in chromosome 6, 10, and 11 that are correlated with T2D while Dupuis and Wojcik's studies indicate there are no specific regions that contribute more to T2D.
From the meta-analysis P-value histogram we are able to see that there is a huge spike in the values furthest to the left. The p-value histogram is anti-conservative. This indicates that there is a high number of genes that are considered correlated with T2D. From our Q-Q plot of the meta-analysis result, we see the same pattern as our p-value histograms. It shows that the distribution is a rightly skewed distribution. This means that a good amount of SNPs are correlated with the disease. There are a lot of p-values that are low, meaning that these SNPs are correlated with T2D.
From the Manhattan Plot of the Meta-Analysis, we found that regions of chromosomes 6, 10, and 11 are correlated with T2D. With chromosome 10 being the most significant one. This is reflective of the individual studies we used in our analysis. We also investigated some genes that are related with T2D. We used a website called Ensembl where we inputted alleles with very low p-values to determine the gene names.
Some of the lowest p-value genes we found that are correlated with T2D are TCF7L2, CDKAL1, and KCNQ1. This is consistent with other studies done before. TCF7L2 is "a transcription factor influencing the transcription of several genes thereby exerting a large variety of functions within the cell and it was named a prime suspect in causing T2D" (1). TCF7L2 is a “gene that encodes a high mobility group (HMG) box-containing transcription factor that plays a key role in the Wnt signaling pathway. The protein has been implicated in blood glucose homeostasis” (11). "The protein encoded by CDKAL1 is a member of the methylthiotransferase family. The function of this CDKAL1 is not known. GWAS have linked single nucleotide polymorphisms in an intron of this gene with susceptibility to T2D" (8). KCNQ1 "belongs to a large family of genes that provide instructions for making potassium channels. These channels, which transport positively charged atoms (ions) of potassium out of cells, play key roles in a cell's ability to generate and transmit electrical signals. The specific function of a potassium channel depends on its protein components and its location in the body. KCNQ1 is related to familial atrial fibrillation, Jervell and Lange-Nielsen syndrome, Romano-Ward syndrome, Short QT syndrome, and Gestational diabetes" (9).
From this, scientist can focus on certain genes to better understand T2D. As a result of our research, we were able to find regions of the genome that are correlated with T2D. The genes we found to be correlated with T2D aligned with previous studies except we did not find any regions of chromosome 20 to be correlated with T2D. This is most likely because we used different data.
We recognize that since our analysis was only conducted on five studies, due to the small sample size, our sample mean may differ from the population mean.
For both the p-value histogram and other plots, there might be a more educative way of showing the results.
We did not generate our own GWAS studies due to time and financial limitations.
We also acknowledge that many other factors including environmental and physiological ones contribute to T2D. Therefore, our analysis result is not deterministic.
In conclusion we found that regions of chromosomes 6, 10, and 11 are correlated to T2D using meta-analysis and summary statistics of GWAS. We also found certain genes such as TCF712, CDKAL1, and KCNQ1 are associated with T2D.
Our results did align with some previous studies and their findings except we did not find any regions of chromosome 20 to be correlated with T2D.
At last, we hope our analysis results along with this website we created can help scientists better understand and analyze T2D and help the general audience gain an insight into T2D.
Hattersley, Andrew T. “Prime Suspect: the TCF7L2 Gene and Type 2 Diabetes Risk.” The Journal of Clinical Investigation, American Society for Clinical Investigation, Aug. 2007, www.ncbi.nlm.nih.gov/pmc/articles/PMC1934573/.
A Q-Q Plot Dissection Kit, seankross.com/2016/02/29/A-Q-Q-Plot-Dissection-Kit.html.
“METAL Documentation.” METAL Documentation - Genome Analysis Wiki, genome.sph.umich.edu/wiki/METAL_Documentation.
Ensembl Genome Browser 100, uswest.ensembl.org/biomart/martview/4bab1f774e54945288f179bbc924d7b1.
Morton, Newton E. “Linkage Disequilibrium Maps and Association Mapping.” The Journal of Clinical Investigation, American Society for Clinical Investigation, June 2005, www.ncbi.nlm.nih.gov/pmc/articles/PMC1137007/.
Permutt, Marshall Alan, et al. “Searching for Type 2 Diabetes Genes on Chromosome 20.” Diabetes, American Diabetes Association, 1 Dec. 2002, diabetes.diabetesjournals.org/content/51/suppl_3/S308.
Burdett, Tony, et al. GWAS Catalog, www.ebi.ac.uk/gwas/downloads/summary-statistics.
Database, Gene. “CDKAL1 Gene (Protein Coding).” GeneCards, www.genecards.org/cgi-bin/carddisp.pl?gene=CDKAL1.
Database, Gene. “KCNQ1 Gene (Protein Coding).” GeneCards, www.genecards.org/cgi-bin/carddisp.pl?gene=KCNQ1.
“Type 2 Diabetes.” Mayo Clinic, Mayo Foundation for Medical Education and Research, 9 Jan. 2019, www.mayoclinic.org/diseases-conditions/type-2-diabetes/symptoms-causes/syc-20351193.
McKusick, Victor A. TRANSCRIPTION FACTOR 7-LIKE 2; TCF7L2. 7 Jan. 1998, www.omim.org/entry/602228.
“What Are Genome-Wide Association Studies? - Genetics Home Reference - NIH.” U.S. National Library of Medicine, National Institutes of Health, ghr.nlm.nih.gov/primer/genomicresearch/gwastudies.
“Comprehensive Meta.” Analysis, 1 Jan. 2020, www.meta-analysis.com/pages/why_do.php?cart=.
Ali, Omar. “Genetics of Type 2 Diabetes.” World Journal of Diabetes, Baishideng Publishing Group Co., Limited, 15 Aug. 2013, www.ncbi.nlm.nih.gov/pmc/articles/PMC3746083/.