Pathway enrichment analysis tests in iPAVS
The rational behind a enrichment analysis ( gene-set, pathway etc) is to compute statistics of whether the overlap between the focus
list and the gene-set is significant. ie the confidence that overlap between the list is not due to chance. To compute this significance of
overlap we could use different methods.
The following are few used by iPAVS.
Overview of p-value calculation for pathway enrichment analysis
The p-value associated with a pathway in enrichment analysis is a measure that tell that association between a selected genes from the experimental list and a pathway is likely due to random chance alone. That means smaller the p-value the less likely that there could be a random association. In pathway analysis, generally p-values less than 0.05 indicate the association is statistically significant and Null hypothesis that there is no association between the genes of interest and pathway could be rejected.
NOTE: Large p-value (i.e not statistically significant) does not 'ALWAYS' mean that associations between pathway and gene is not biologically relevant and vice-versa. Further investigation considering all supporting evidences is necessary to understand the biological importance.
Fishers Exact Test
Fisher's exact test is a statistical significance test used in the analysis of contingency tables where sample sizes are small. Its highly efficient computationally and preferred method to that of hyper-geometric distribution. Fisher's exact test is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Thus Fisher's exact test is considered appropriate in case of pathway analysis for reasons that it works well with both large and small result set (<10). For more details please refer to Wiki pages [1].
The test is most commonly applied to 2x2 contingency table ( matrices) . To compute the P-value of the test, the tables must then be ordered by some criterion that measures dependence.
The table below is an example to calculate p-Value for pathway analysis.
Computation of probability ( p-value ) is given by Fisher:
The multiple testing correctionBonferroni multiple test correction is implemented in iPAVS.
Algorithm 1. Get the family wise significant pvalue = alpha.2. Find Pvalue for each pathway = p and order them 0...n-1 in i=1...m positions
3. number of pathways investigated = n_pathways 4. PI first pvalue where the the following inequality is correct ( p<alpha/n_pathways )
5. Adjust the pvalue for i = 1 to the position PI as below
If p<alpha/n_pathways then reject all pvalues from the i=1...n positions
adjust pvalue (Ap) = n_pathways*p
else use the original pvalue p.
For more details please refer to wiki.
Review of multiple test correction:
The more tests we perform on a set of data, the more likely we tend to commit Type I error (ie chances of rejecting the null hypothesis when it is true is more). So more pathways analyzed, the greater the chance of observing a false-positive result. alpha level [6]. NOTE: Binomial Proportions Test
While the Fisher exact test computes statistics of significance of overlap between two lists of genes exactly whereas the binomial proportions or Chi-squared tests are approximations. The significance of overlap between pathway and input focus list is given by Z-score which is computed for each gene in the input focus list using a binomial proportions test as below [2]. z score of 0 indicates no enrichment.
Higher the Z-score greater is its significance.
a = Intersection between input list focus molecules and pathway molecules
b = Intersection between input list focus molecules and Total of molecules in all pathways
c = Intersection between pathway molecules and the background list or Genome/Proteome. ( in case of genome / proteome its will be the # of molecules in pathway itself)
d = molecules in background list or Total molecules in organism (genome, proteome etc)
Parametric Analysis of Gene Set Enrichment (PAGE) analysis
New Gene set analysis (GSA) alogrithms have be proposed and have been commonly employed like Generally Applicable Gene-set Enrichment (GAGE), PAGE and Gene Set Enrichment Analysis (GSEA)[3]. At present we have implemented PAGE in iPAVS.
Page is a modified gene set enrichment analysis method based on the parametric statistical model. Like Fishers Exact test, PAGE also uses normal distribution for statistical inference and thus requires less computation than other methods like GSEA. Page has been succesfully used in several studies to detect significantly changed gene sets from microarray data and compare multiple microarray data sets. For more details on PAGE please refer to their manuscript.
Gene Ontology Overrepresentation using GOSTAT
(NOTE: disabled currently) Is a tool to find statistically overrepresented gene ongologies(GO) within a group of genes. It uses GO annotations and computes statistics of which GO term are overrepresented in the input list focus molecules. This results in a list of GO terms sorted by their specificity. In iPAVS the gene list for each enriched pathway is submitted and the Fishers score is computed and top 10 GO terms in each category (Function, Component and Biological Process) is reported.
For more details about this GOSTAT[4] package consult this online manual
In near future the following alternative pathway analysis algorithms will be implemented in iPAVS :
Topology based analysis methods
Reference
|