Input Fortmat

Input Files

iGEAK uses simple tab-delimited text files (extension name: .csv) as input & output files

      • iGEAK uses 3 input files. You need to prepare for
          1. annotation file,
          2. sample group definition file (metadata)
          3. gene expression (for microarray) or raw count (for RNA-seq) data matrix.
      • You may ask any bioinformatician(s) to make input files, but it is very easy to prepare for them.

Common mistake

Please check hidden "space" character is not included in the columns or headers (esp. after gene ID/name, sample name, group name). This is the most common mistake when you prepare input files using spreadsheet programs (e.g. Excel).

Annotation File

      • An annotation file is a simple two-column table.
      • The first column is for "unique" probe IDs (microarray) or gene IDs (or symbol) (RNA-seq) and the second column is always matched gene symbol
      • Users can add alternative group definitions in the multiple columns and decide one of group definitions during the first step ("Data Upload")
      • Gene symbols are not stable and often changes. If you want to use a new symbols mapped to unique ID in the first column, you may use DAVID's conversion tool [link].
      • Example (microarray)

Microarray (Affymetrix probeset ID)

RNA-seq (Ensembl Gene ID)

Group-Definition File (metadata)

      • Group definition file is a simple multi-column table.
      • The first column is for sample IDs and all other columns are for sample groups. Users can add alternative group definitions in the muiltiple columns and decide one of group definitions during the first step ("Data Upload")

Microarray

RNA-seq

Gene Expression Matrix

      • iGEAK-microarray uses a tab-delimited normalized (e.g. by RMA) expression matrix (txt/csv format) as an input file. If you want to prepare for input files from raw CEL files, you may try ArrayAnalysis.org (http://arrayanalysis.org).
      • iGEAK-RNAseq uses a raw count matrix generated from sequencing read counting programs such as featureCounts or HTSeq-count. If you want to prepare for input files from scratch (FASTQ or BAM), you may try Galaxy platform (https://usegalaxy.org)
          • In gene-level counting for differential gene expression study, the first column is gene symbol and these symbols are unique identifiers of raw count matrix. This raw count matrix is going to be normalized using TMM methods during differential gene expression prediction step.
      • The first column is unique identifiers such as probeset ID. These IDs should be the same as unique IDs in your annotation file.
      • The first row is a header including sample IDs. These IDs should be the same as unique IDs in your metadata file

Microarray: log2-normalized gene expression matrix

RNA-seq: raw gene count matrix

Output Files

GEATPbox exports a data table as a tab-delimited text csv file. Since Microsoft Excel doesn't open CSV files correctly by default. If you use MS-Excel program, please use the following simple solution to open CSV files.

      1. Open a new Excel window
      2. Choose the Data tab
      3. Choose the From Text option
      4. Choose your *.csv file
      5. Choose the Delimited radio box, then click Next
      6. Choose Tab

Please check the following links for detailed information (2007, 2010, 2013, 2016)

Kwangmin Choi @ Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA