mCUDA-MEME

Introduction

mCUDA-MEME is an ultrafast scalable motif discovery algorithm based on MEME (version 4.4.0) algorithm for multiple GPUs using a hybrid combination of CUDA, MPI and OpenMP parallel programming models. This algorithm is a further extension of CUDA-MEME with respect to accuracy and speed and has been tested on a GPU cluster with eight compute nodes and two Fermi-based Tesla S2050 (and Tesla-based Tesla S1070) quad-GPU computing systems, running the Linux OS with the MPICH2 library. The experimental results showed that our algorithm scales well with respect to both dataset sizes and the number of GPUs. At present, OOPS and ZOOPS models are supported, which are sufficient for most motif discovery applications. This algorithm has been used in CompeteMOTIFs , a motif discovery platform developed to help biologists to find novel as well as known motifs in their peak datasets from transcription factor binding experiments such as ChIP-seq and ChIP-chip. In addtion, this algorithm has been incorporated to NVIDIA Tesla Bio Workbench and deployed in NIH Biowulf.

Download

  • CUDA-MEME 3.0.16 (10/2015)NEW
    1. Fixed a bug in the EM stage, which cause memory segmentation when using multiple threads
  • CUDA-MEME 3.0.15 (10/2013)
    1. All sequences must not be shorter than the maximum motif width. Otherwise, the program will prompt an error message and stop. Temporarily, users can remove such shorter sequences from the input, or use a smaller motif width. We will attempt to fix  this shortcoming as soon as possible!
  • CUDA-MEME 3.0.13 (04/2013)
    1. Avoided the dependence on CUDA SDK, and thus can be directly compiled using CUDA 5.0 toolkit.
    2. Automatically detect the maximum number of resident threads per multiprocessor on GPUs

Usage

mCUDA-MEME, an further extension of CUDA-MEME in terms of sensitivity and speed, enables users to use a single or multiple GPUs to accelerate motif finding. An MPI-based design provides the support for multi-GPUs. Two Makefiles are available in the directory. Makefile.gpu compiles and produces a binary running for a single GPU with no need of MPI library; Makefile.mgpu compiles and produces a binary for a GPU cluster, which requires the installation of MPI library. I have tested that mCUDA-MEME works well with MPICH/MPICH2 MPI library, but not sure about OpenMPI library.

1. Prepare

(1) Install CUDA 2.0 or higher SDK and Toolkits;
(2) Install MPICH/MPICH2. If using Makefile.gpu, do not need to install MPI library;
(3) If the tool "convert" (installed in /usr/bin/) that changes EPS to PNG format is not installed in your system, you might need to download and install ImageMagick first (http://www.imagemagick.org/script/download.php). You can change the config.h file in the src directory to specify an alternative tool by changing the value of macro "CONVERT_PATH", and then recompiling the code. If ths converting tool does not exist, you need manually convert the ESP files in the output directory (meme_out, by default) to PNG files.

2. Download

(1) download cuda-meme-vxxx.tar.gz; unzip the file.

3. Modify the Makefile

(1) modify the makefile (in the src subdirectory) as per the compute capability of your CUDA-enabled graphics hardware.
(2) Before comping the program, please check the compute capability of your GPU device. If your GPU is a Fermi, change to "-arch sm_20"; if it is capability 1.3, change to "-arch sm_13"; if it is capability 1.2, change to "-arch sm_12"; and if it is capability 1.1, change to "-arch sm_11". That is very important for the CORRECT and FAST running of CUDA-MEME.

4. Run make command.

 (1) run "make -f Makefile.gpu" command to generate release "cuda-meme" in the directory; run "make -f Makefile.gpu clean" to clean up all generated objects and the executable binary.
 
(2) run "make -f Makefile.mgpu" command to generate release "mcuda-meme" in the directory; run "make -f Makefile.mgpu" command to clean up all generated objects and the executable binary

5. Execute the program.

(1) If the environment variable "MEME_ETC_DIR" is defined, CUDA-MEME searches resources from the directory specified by this environment varaible; otherwise, use the "etc" subdirectory in the current working directory. If the resources are stored in directory mydir/etc, users can use the command: export MEME_ETC_DIR=mydir/etc to set the environment variable before launching CUDA-MEME.
(2) When running the "cuda-meme", i.e without MPI support. Typical usages: 
    (a) ./cuda-meme dataset_file -dna -mod oops
    (b) ./cuda-meme dataset_file -protein -mod oops -nmotifs 3
    (c) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp

Note: when more than one GPU devices are installed in your host, you can use option "-gpu_num" to specify the GPU used.  The first GPU is indexed 0, and the second is 1 and so on. If not specifed, the first GPU is used. You can refer to the print-out GPU information by CUDA-MEME to determine which GPU to use. Typical usages:
    (a) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -gpu_num 0
    (b) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -gpu_num 1

For the EM step, it has been optimzied for multi-core CPUs using OpenMP. You can use option "-num_threads" to specify the total number of threads used for this step. If not specified, CUDA-MEME will automatically set the number of threads to the number of available CPU cores. Typical usages:
    (a) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -gpu_num 0 -num_threads 4
    (b) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -num_threads 4

(3) When running the "mcuda-meme", i.e with MPI support. Typical usages:
    (a) mpirun -np 2 ./mcuda-meme dataset_file -dna -mod zoops
    (b) mpirun -machinefile hostfile -np 4 ./mcuda-meme dataset_file -dna -mod zoops

when running on a GPU cluster, you must make sure that the number of MPI processes running on a node must not be more than the number of available GPU devices. This constraint can be ensured using a hostfile. An example of hostfile for MPICH is as follows, where each node contains two GPUs:
Certainly, it is possible that some processes will be idle because they are not assigned any task. In this case, it will print out "-----Process # will be idel-----" on the scree for the process # that is not used. CUDA-MEME is able to assign GPU devices automatically to each MPI process. In this case, the option "-gpu_num" is disabled for "mcuda-meme". For the EM and "Get the log p-value of a weighted log-likelihood ratio" step, "mcuda-meme" uses two threads by default. Users can specify it according to the power of you multi-core CPUs. Note that when calculating the starting point, each MPI process has two threads: one for score computing, and the other for alignments. So, to achieve the highest performance, we recommend (not necessarily) that the number of GPUs in a host <= the number of CPU cores / 2;

6. Important notes:

CUDA-MEME is memory efficient for GPU devices. The peak GPU device memory is approximately equal to 480 * max_seq_length * 4 * 16. When max_seq_length exceeds 64K bases, the slower but memory more efficient substring-level parallelization will be used, instead of the default and fast sequence-level parallelization. In general, max_seq_length is less than 16K for Chip-seq sequences, so the peak GPU device memory is about 480MB. It would be good if the users can split the long sequences (>64 K bases) into several segments by overlapping some bases between consecutive segments (e.g. 100 bases depending on the maximal motif length), in order to utilize the fast sequence-level parallelization. Because the hybrid computing feature of CUDA-MEME  it consumes more host memory to store the intermeidate results. We recommend the size of host memory be more than 8GB to solve more than 2000 sequences of average 200 ~ 400 bps.

Citation

  1. Yongchao Liu, Bertil Schmidt, Weiguo Liu, Douglas L. Maskell: "CUDA-MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units". Pattern Recognition Letters, 2010, 31(14): 2170 - 2177
  2. Yongchao Liu, Bertil Schmidt, Douglas L. Maskell: "An ultrafast scalable many-core motif discovery algorithm for multiple GPUs". 10th IEEE International Workshop on High Performance Computational Biology (HiCOMB 2011), 2011, 428-434

Contact

If you have any suggestion or question, please contact Liu Yongchao (Email: yliu860 (at) gatech (dot) edu),