Microarray Pre-processing Pipeline

This pre-processing pipeline is designed to start with Affymetrix CEL files and deliver data ready to be utilized in any application. The pipeline is completely implemented in R using standard libraries available from Comprehensive R Archive Network (cran.r-project.org) and Bioconductor (www.bioconductor.org). Except for the ComBat library which can be found at (http://sites.google.com/site/plaisier/microarray-qc-pipeline/ComBat.R), and the original unmodified library can be found at (http://statistics.byu.edu/johnson/ComBat/Download_files/ComBat.R). It goes through several steps outlined in Figure 1. First it reads in the CEL files and masks out the mis-targeted and nonspecific probes using an alternate CDF file. Then background subtraction and quantile normalization are applied (which can be modified to your favorite methods in the code) using the justRMA function from the affy package of Bioconductor. The normalized data is then used to calculate present and absent calls to identify which genes can be considered as accurately measured on the array and expressed in the sample using the panp library. The present genes are then corrected for any batch effects that are known using the ComBat library. I typically convert it from the Microsoft Excel spreadsheet format to a comma separated values  (csv) formatted file. Due to an issue in ComBat there will be two columns of ProbeIds in the Microsoft Excel spreadsheet file, and I usually just delete one of these to fix this issue. Then I remove the '.CEL' from the end of the sample names in the first row of the preprocessed data file to match the names in the phenotype file. The data is then ready for analyses.

Figure 1.  Outline of pre-processing pipeline.

Below are attached three files. The 'preprocessing.R' file contains the R code for the pre-processing pipeline. The necessary libraries will need to be installed. The 'ComBat.R' file is a slightly modified version of the ComBat.R package which outputs a PDF of the prior.plot so that this script could be automated. The last file 'u133plus2mskProbes.csv' has a count for all the probes in each probeset, which is used to exclude all probesets with less than 7 probes (use your judgement and plot of the distribution of probes in the probesets to come up with this number for your particular array platform, or use mine if you are using the Affy U133+ 2.0).

Making Probes per Probeset File:
  1. Download appropriate masking probetab text file from (http://masker.nci.nih.gov/ev/bychip.html), the file used for the U133+ 2.0 array is the ‘U133Plus2msk_probe_tab.gz’.
  2. Make sure to use gunzip to decompress the file.
  3. Then modify the python script ‘probesPerProbeset.py’ to utilize this file and change the output file name.
  4. Then run the script to produce the output file with the number of probes per probeset.
  5. Determine a threshold for the number of probes per probeset. I used R and plotted the distribution to make this choice.
  6. Set this threshold in the ‘preprocessing.R’ script.
You will then have to modify the files working directories (the 'setwd("your_directory_here")' lines) in order for the code to work correctly.
Chris Plaisier,
May 26, 2009, 6:53 PM
Chris Plaisier,
Sep 15, 2009, 5:06 PM
Chris Plaisier,
Jul 20, 2009, 2:07 PM
Chris Plaisier,
Jul 17, 2009, 3:48 PM