Pathway enrichment algorithms

 
Pathway enrichment analysis tests in iPAVS

The rational behind a enrichment analysis ( gene-set, pathway etc) is to compute statistics of whether the overlap between the focus
list and the gene-set is significant. ie the confidence that overlap between the list is not due to chance. To compute this significance of 
overlap we could use different methods. 

The following  are few used by iPAVS.
  • Fishers Exact Test ( option of choosing Bonferroni corrections for multiple comparisons)
  • Binomial Proportions Test
  • Parametric Analysis of Gene Set Enrichment (PAGE)
  • Gene Ontology Overrepresentation using GOSTAT


Overview of p-value calculation for pathway enrichment analysis

The p-value associated with a pathway in enrichment analysis is a measure that tell that association between a selected genes from the experimental list and a pathway is likely due to random chance alone. That means smaller the p-value the less likely that there could be a random association. In pathway analysis, generally p-values less than 0.05 indicate the association is statistically significant and Null hypothesis that there is no association between the genes of interest and pathway could be rejected.

NOTE: Large p-value (i.e not statistically significant) does not 'ALWAYS' mean that associations between pathway and gene is not biologically relevant and vice-versa. Further investigation considering all supporting evidences is necessary to understand the biological importance.


Fishers Exact Test

Fisher's exact test is a statistical significance test used in the analysis of contingency tables where sample sizes are small. Its highly efficient computationally and preferred method to that of hyper-geometric distribution. Fisher's exact test is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Thus Fisher's exact test is considered appropriate in case of pathway analysis for reasons that it works well with both large and small result set (<10). For more details please refer to Wiki pages [1].

The test is most commonly applied to 2x2 contingency table ( matrices) . To compute the P-value of the test, the tables must then be ordered by some criterion that measures dependence.

 The table below is an example to calculate p-Value for pathway analysis.


Molecules associated with a pathway
Molecules not associated with a pathway Total # in rows
Input list focus molecules (DEG's) a b a + b
Non-Focus molecule (NON-DEG's) c d c + d
Total # in column a + c b + d n


Computation of probability ( p-value ) is given by Fisher:




The multiple testing correction


Bonferroni multiple test correction is implemented in iPAVS. 

Algorithm
1. Get the family wise significant pvalue =  alpha.
2. Find Pvalue for each pathway = p and order them 0...n-1 in  i=1...m positions
3. number of pathways investigated = n_pathways
4. PI first pvalue where the the following inequality is correct ( p<alpha/n_pathways )
5. Adjust the pvalue for i = 1 to the position PI as below
If p<alpha/n_pathways then reject all pvalues from the i=1...n positions
adjust pvalue (Ap) = n_pathways*p
else use the original pvalue p.

For more details please refer to wiki.

Review of multiple test correction:
The more tests we perform on a set of data, the more likely we tend to commit Type I error (ie  chances of rejecting the null hypothesis when it is true is more). So more pathways analyzed, the greater the chance of observing a false-positive result.
This is a consequence of the logic of hypothesis testing: We reject the null hypothesis if we witness a rare event. But the larger the number of tests, the easier it is to find rare events and therefore the easier it is to make the mistake of thinking that there is an effect when there is none. This problem is called the inflation of the
alpha level [6].

Several methods have been proposed to protect from commiting Type I error. One strategy is to correct the alpha level when performing multiple tests. Making the alpha level more smaller (stringent) will create less errors. Methods such as the Bonferroni correction involve adjusting the P-value by, for example, multiplying each P-value by the number of comparisons made (number of pathways investigated) and using the modified P-values to indicate significance. 

NOTE:
However be warned it may also make it harder to detect real effects [6]. Because many comparisons can be made using microarray, deep sequencing or proteomics  data. This might result in a too stringent alpha level. Thus a statistically significant P-value almost impossible to attain, particularly where P-values have been obtained from a limited number of permutations. Also such corrections may be  inappropriate in cases of large pathways ( Maps) or pathways having overlapping with others as the pathways are not independent of each other.

Binomial Proportions Test

While the Fisher exact test computes statistics of significance of overlap between two lists of genes exactly whereas the binomial proportions or Chi-squared tests are approximations.  The significance of overlap between pathway and input focus list is given by Z-score which is computed for each gene in the input focus list using a binomial proportions test as below [2]. z score of 0 indicates no enrichment.

Higher the Z-score greater is its significance. 

Math
a = Intersection between input list focus molecules and pathway molecules
b = Intersection between input list focus molecules and Total of molecules in all pathways
c = Intersection between pathway molecules and the background list or Genome/Proteome. ( in case of genome / proteome its will be the # of molecules in pathway itself)
d = molecules in background list or Total molecules in organism (genome, proteome etc)


Parametric Analysis of Gene Set Enrichment (PAGE) analysis 

New Gene set analysis (GSA) alogrithms have be proposed and have been commonly employed like Generally Applicable Gene-set Enrichment (GAGE), PAGE and Gene Set Enrichment Analysis (GSEA)[3]. At present we have implemented PAGE in iPAVS.

Page is a modified gene set enrichment analysis method based on the parametric statistical model. Like Fishers Exact test,  PAGE also uses normal distribution for statistical inference and thus requires less computation than other methods like GSEA. Page has been succesfully used in several studies to detect significantly changed gene sets from microarray data and compare multiple microarray data sets. For more details on PAGE please refer to their manuscript.


Gene Ontology Overrepresentation using GOSTAT
 
(NOTE: disabled currently)
 
Is a tool to find statistically overrepresented gene ongologies(GO) within a group of genes. It uses GO annotations and computes statistics of which GO term are overrepresented in the input list focus molecules. This results in a list of GO terms sorted by their specificity. In iPAVS the gene list for each enriched pathway is submitted and the Fishers score is computed and top 10 GO terms in each category (Function, Component and Biological Process) is reported.

For more details about this GOSTAT[4] package consult this online manual


In near future the following alternative pathway analysis algorithms will be implemented in iPAVS :


Topology based analysis methods






Reference
  1. Fishers Exact Test Wiki: 
    http://en.wikipedia.org/wiki/Fisher's_exact_test
  2. Berger, S. I., J. M. Posner, et al. (2007). "Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases." BMC bioinformatics 8: 372.
  3. Kim, S. Y. and D. J. Volsky (2005). "PAGE: parametric analysis of gene set enrichment." BMC bioinformatics 6: 144.
  4. Beissbarth, T. and T. P. Speed (2004). "GOstat: find statistically overrepresented Gene Ontologies within a group of genes." Bioinformatics 20(9): 1464-1465.
  5. Luo, W., M. S. Friedman, et al. (2009). "GAGE: generally applicable gene set enrichment for pathway analysis." BMC bioinformatics 10: 161.
  6. Herve, A. (2007). Bonferroni and Sidak corrections for multiple comparisons. In N.J Salkind. Thousand Oaks, California, Encyclopedia of Measurement and Statistics: 103-107.
  7. Osier, M. V., H. Zhao, et al. (2004). "Handling multiple testing while interpreting microarrays with the Gene Ontology Database." BMC bioinformatics 5: 124.


  • Comments