Manual

Launch

      • For microarray data anlaysis
          • Double-click the launcer icon
              • (Windows) "run-microarray.vbs" file
              • (Mac) "launcher-microarray.sh.command" file
              • (Linux) "launcher-microarray.desktop" file
      • For RNA-seq data analysis
          • Double-click the launcer icon
              • (Windows) "run-RNAseq.vbs" file
              • (Mac) "launcher-RNAseq.sh.command" file
              • (Linux) "launcher-RNAseq.desktop" file
      • Sometimes, it takes ~20 seconds to initiate iGEAK. Please be patient and DO NOT initiate another job.
      • If you launch iGEAK from an USB flash drive or an external HDD, it could take > 20 seconds. Please be patient.

Interactive Consoles

iGEAK uses several interactive control widgets (sliders, radio button, text/area input, select box ...etc.) to provide users convenient ways to adjust parameters quickly and intuitively. For example, an expression heatmap in "Heatmap" tab can be easily re-sized, re-colored, and re-clustered just moving sliders. You can use a mouse in most cases, but you may want to use mouse AND keyboard for fine adjustment.

Mouse Control Only

      • Point a slider knob ("circle") in the console
      • Click and hold the left mouse button
      • Move sliders to left or right
      • and release the mouse button

Mouse + Keyboard Control (Fine-Adjustment)

      • Point a slider knob ("circle")
      • Click left mouse button
      • Use UP (or RIGHT) arrow keys for increasing numbers or DOWN (or LEFT) arrow keys for decreasing
      • You can use Shift/Control key to select multiple targets.

A Current Pipeline Implemented in iGEAK

Each page tab in iGEAK interface corresponds to each task (e.g. "GSEA"). A user can initiate and update each task just by clicking tab. You can easily switch between different tasks and update parameters by several mouse clicks.

"Introduction"

      • This tab provides a brief introduction to iGEAK.
      • This tab is not associated with any task.
      • You can get a pre-installed package list and an iGEAK session information here.

"Data Upload"

Two species are currently supported: (1) Human and (2) Mouse. Well-formatted 3 input data described in "Input Data" page are required to run iGEAK.


For microarray data:

your data matrix should be already summarized / normalized (e.g. by RMA (Robust Multi-array Average) method). Please check the sample boxplot and confirm if your data matrix is properly normalized. The mean processed signals represent the normalized and background corrected mean signal intensities. The mean values should be properly aligned.

Open data folders and upload (#1) an annotation file, (#2) a sample-group definition file, then (#3) a log2-transformed mRNA expression (microarray) or a raw count (RNA-seq) matrix.

You can choose probesets with gene symbols [#4] and you can also remove [#5] "sub-optimal" Affymetrix probesets from the downstream analyses. In most cases Affymetrix (U133 and similar platforms) probesets which are not "_at" endings are sub-optimal probesets.

      • All probe sets have one of the following two extensions:
          • _at : anti-sense target (most probe sets on the array)
          • _st : sense target (only some control probes are in sense orientation on the array)
      • A few probe sets are designated as follows:
          • _i : reduced number of pairs in the probe set.
      • Some probe sets represent more than one gene or EST:
          • _s_at : designates probe sets that share common probes among multiple transcripts from different genes.
          • _a_at : designates probe sets that recognize multiple alternative transcripts from the same gene (on HG-U133 these probe sets have an "_s" suffix).
          • _x_at : designates probe sets where it was not possible to select either a unique probe set or a probe set with identical probes among multiple transcripts. Rules for cross-hybridization were dropped. Therefore, these probe sets may cross-hybridize in an unpredictable manner with other sequences.
          • _g_at : similar genes, also unique probe sets elswhere on the array.
          • _f_at : similarity rules dropped, probe set will recognize more than one gene.
          • _i_at : designates sequences for which there are fewer than the required numbers of unique probes specified in the design.
          • _b_at : all probe selection rules were ignored. Withdrawn from GenBank.
          • _l_at : sequence represented by more than 20 probe pairs.
          • _r_ : designates sequences for which it was not possible to pick a full set of unique probes using Affymetrix' probe selection rules.

You may check the quality of each probeset at : GeneAnnot server: https://genecards.weizmann.ac.il/geneannot/index.shtml

Once all 3 input files are uploaded, [#6] choose groups of samples (at least two) you want to analyze, move them from the left panel to the right panel, then click the "Submit" botton. Finally, click "Submit" button [#7]. This action subsets the original gene expression matrix to the iGEAK engine.

For RNA-seq data:

You upload a raw count matrix. iGEAK normalized these counts on the fly using edgeR's TMM (Trimmed Mean of M-values) normalization method (See this edgeR paper). You can choose one of two differentially expressed gene (DEG) prediction method between edgeR and voom-limma. You can filter out lowly expressed genes by changing "minumum CPM values" and/or "mimimum sample size" [#6]. The normalized gene expression matrix is displayed below the raw count matrix.

"PCA"

You can briefly check if there are outliers in your sample group. This tab provides principal component analysis (PCA) and sample-correlation plot.

If you decide to remove some samples, edit your sample-group definition file (metadata) and reload the updated file.

PCA (Principal Component Analysis) Plot

      • You can get sample information by setting an area
          • Place your cursor over the plot
          • Click the mouse's button and drag to create a rectangular area
          • Then you will get sample names and its first two principal component scores.
      • The plot can be downloaded as an image file. Place your cursor over the plot, click right mouse button, and save it (PNG format).

Correlation Plot

    • The correlation information among samples can be obtained using this (Pearson) correlation plot. You can easily adjust plot size, font size, three height, border color using consoles.

"Multi-group"

Probably you are only interested in a subset of genes in your list. Please copy and paste or type your genes (symbol, case-sensitive) of interest in the text area, then find the updated gene expression matrix, heatmap, and boxplots of them.

You can choose parametric (ANOVA & post-hoc pairwise Tukey's test) or non-parametric (Kruskall-Wallis & post-hoc Mann-Whitney U-test) variance test based on the (1) Shapiro-Wilk Normality test and (2) group dispersion test.

If the Shapiro-Wilk test p-value > 0.05, you may choose the parametric tests (ANOVA & Tukey-test), since your data do not seem to violate the normality assumption. However, the parametric tests can perform well with continuous data that are slightly non-normal if each group's sample size is > 15 and you have 2-9 groups in total.

You may choose the non-parametric tests (Kruskall-Wallis & post-hoc pairwise Mann-Whitney U-test) if your data violate the normality assumption and/or you have a very small sample size, but the data for all groups have the same dispersion. If your groups have a different dispersion, the non-parametric tests might not provide valid results.

"Two-group"

    • In most cases, you may want to compare two groups (e.g. control vs. treated) to analyze differentially expressed genes (DEGs). All downstream analyses will rely on your choice of two groups.
    • Now you can choose two groups for detecting differentially expressed genes. Once two groups are selected, the original gene expression matrix is instantly updated.

"DEG" (two-group)

      • iGEAK uses limma method to detect differentially expressed genes. To narrow down your targets, you can filter them based on (1) fold change (2) p-value (or adjusted p-value / FDR), and (3) both.
      • iGEAK uses R/Bioconductor limma (Smyth, 2005) package for microarray data and limma/voom (Law et al., 2014) for gene-level RNA-seq data.
      • In RNA-seq pipeline, the uploaded raw count matrix is normalized using TMM method implemented in edgeR package (Robinson et al., 2010) during “DEG” prediction process.
      • Differentially expressed genes are visualized using a heatmap (“Heatmap” tab) and a volcano plot (“VolcanoPlot” tab). When a gene set of interest is submitted, these gene’s expression levels are also displayed as boxplot/beeswarm plots.

References

      1. Law,C.W. et al. (2014) voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol., 15, R29.
      2. Robinson,M.D. et al. (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139–40.
      3. Smyth,G.K. (2005) Limma: linear models for microarray data. Bioinforma. Comput. Biol. Solut. Using R Bioconductor, 397–420.

"VolcanoPlot" & "Heatmap" (two-group)

Volcano Plot

      • VolcanoPlot provides an easier way to visualize DEGs and some outliers. Red and Blue dots represented up- and down-regulated genes passing your filters in the previous step. All other genes are marked as black dots.
      • This plot is interactive. If you want to see detailed information of gene in a certain area, please draw a rectangular box over genes. Detailed information will be displayed on the shaded text box.

Heatmap

      • A heatmap is a convenient way to visualize clusters showing the similar gene expression patterns. GEAPbox provides interactive console to create a nice heatmap for you. You can manipulate size, colors, font sizes, and scaling (z-score or log2fold compare to your choice of "control" group), using interactive consoles.
      • Probably you are only interested in a subset of genes in your list. Please copy and paste your gene list in the text area, then find the updated gene expression matrix, heatmap, and boxplots of them. Please do NOT directly type genes in the box. This will make your system slow.

"PPI" (two-group)

      • This tab provides a (tentative) PPI and transcription-control network visualization tools based on visNetwork (https://github.com/datastorm-open/visNetwork) package. Undirected PPI interaction information is extracted from BioGrid (https://thebiogrid.org, v3.4).
      • Transcrption factors (TFs) and their (conserved) target genes can also be visualized. The backbone PPI network is extended by adding TFs ("star" shaped node) and genes having their transcription factor binding sites (TFBS) within 2000 bp from transcription start sites (TSS). This information was extracted from MSigDB's C3 dataset (http://software.broadinstitute.org/gsea/msigdb). Theses gene sets contain genes that share a cis-regulatory motif that is conserved across the human, mouse, rat, and dog genomes. The motifs are catalogued (Xie et al. 2005) and represent known or likely regulatory elements in promoters and 3'-UTRs.

Internal Enrichment Analyses

iGEAK provides two independent pathway analyses based on Reactome Database (http://www.reactome.org): “ReactomePA” (Reactome-Based Pathway Analysis) tab launches a tweaked version of ReactomePA (Yu and He, 2016) analysis.

References

      1. REACTOME database: http://www.reactome.org/
      2. Yu,G. and He,Q.-Y. (2016) ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Mol. Biosyst., 12, 477–479.

"ORA" (two-group)

A "ORA" tab provides a simple Over-Representation Analysis (ORA) function based on REACTOME.

"GSEA" (two-group)

A “GSEA” tab provides a simplified GSEA algorithm (http://software.broadinstitute.org/gsea) implemented in ReactomePA package. The Reactome database is used for a reference gene set database. The whole analysis could be slow.

"Broad-GSEA" (two-group)

If you prefer to use a Broad-GSEA program (http://software.broadinstitute.org/gsea), download three GSEA input files from the "Broad-GSEA" tabs.

"VennDiagram"

      • This tab provides a tool to generate reconfigurable 2-, 3-, 4- and 5-way Venn Diagrams. This is a separate tools from GEATPbox pipeline.
      • Not just a Venn Diagram. You can easily extract set elements from it and download the info as a table.

"Orthologs" (renamed from "Symbol Conversion")

Human gene symbols are all upper-case, but mouse symbols use all lower-cases except the first character.

You may use the following Excel function for quick conversion.

    • PROPER() function does NOT work properly if human a given gene symbol contains number(s) in the middle.
        • e.g. =PROPER("ABC2DE") returns "Abc3De", not "Abc3de"
    • Instead, please combine UPPER(), LOWER(), LEFT(), and RIGHT() functions as follows:
        • e.g. =UPPER(LEFT("ABC2DE",1))&LOWER(RIGHT("ABC2DE",LEN("ABC2DE")-1))

But this approach only works when human and mouse symbols are same. In many cases, they are different.

I recommend using Human Gene Nomenclature Committee (HUGO)'s HCOP web-service to find correct orthologs genes between human and mouse.

But, the easiest way to convert human/mouse gene symbols is using iGEAK's symbolConversion tool. Currently iGEAK uses Ensembl-v92 to retrieve human/mouse gene orthologs.

Kwangmin Choi @ Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA