mCUDA-MEME is an ultrafast scalable motif discovery algorithm based on MEME (version 4.4.0) algorithm for multiple GPUs using a hybrid combination of CUDA, MPI and OpenMP parallel programming models. This algorithm is a further extension of CUDA-MEME with respect to accuracy and speed and has been tested on a GPU cluster with eight compute nodes and two Fermi-based Tesla S2050 (and Tesla-based Tesla S1070) quad-GPU computing systems, running the Linux OS with the MPICH2 library. The experimental results showed that our algorithm scales well with respect to both dataset sizes and the number of GPUs. At present, OOPS and ZOOPS models are supported, which are sufficient for most motif discovery applications. This algorithm has been used in CompeteMOTIFs , a motif discovery platform developed to help biologists to find novel as well as known motifs in their peak datasets from transcription factor binding experiments such as ChIP-seq and ChIP-chip. In addtion, this algorithm has been incorporated to NVIDIA Tesla Bio Workbench and deployed in NIH Biowulf.
mCUDA-MEME, an further extension of CUDA-MEME in terms of sensitivity and speed, enables users to use a single or multiple GPUs to accelerate motif finding. An MPI-based design provides the support for multi-GPUs. Two Makefiles are available in the directory. Makefile.gpu compiles and produces a binary running for a single GPU with no need of MPI library; Makefile.mgpu compiles and produces a binary for a GPU cluster, which requires the installation of MPI library. I have tested that mCUDA-MEME works well with MPICH/MPICH2 MPI library, but not sure about OpenMPI library.
(1) Install CUDA 2.0 or higher SDK and Toolkits;
(2) Install MPICH/MPICH2. If using Makefile.gpu, do not need to install MPI library;
(3) If the tool "convert" (installed in /usr/bin/) that changes EPS to PNG format is not installed in your system, you might need to download and install ImageMagick first (http://www.imagemagick.org/script/download.php). You can change the config.h file in the src directory to specify an alternative tool by changing the value of macro "CONVERT_PATH", and then recompiling the code. If ths converting tool does not exist, you need manually convert the ESP files in the output directory (meme_out, by default) to PNG files.
(2) Before comping the program, please check the compute capability of your GPU device. If your GPU is a Fermi, change to "-arch sm_20"; if it is capability 1.3, change to "-arch sm_13"; if it is capability 1.2, change to "-arch sm_12"; and if it is capability 1.1, change to "-arch sm_11". That is very important for the CORRECT and FAST running of CUDA-MEME.
(1) run "make -f Makefile.gpu" command to generate release "cuda-meme" in the directory; run "make -f Makefile.gpu clean" to clean up all generated objects and the executable binary.
(2) run "make -f Makefile.mgpu" command to generate release "mcuda-meme" in the directory; run "make -f Makefile.mgpu" command to clean up all generated objects and the executable binary
(2) When running the "cuda-meme", i.e without MPI support. Typical usages:
(a) ./cuda-meme dataset_file -dna -mod oops
(b) ./cuda-meme dataset_file -protein -mod oops -nmotifs 3
(c) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp
Note: when more than one GPU devices are installed in your host, you can use option "-gpu_num" to specify the GPU used. The first GPU is indexed 0, and the second is 1 and so on. If not specifed, the first GPU is used. You can refer to the print-out GPU information by CUDA-MEME to determine which GPU to use. Typical usages:
(a) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -gpu_num 0
(b) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -gpu_num 1
For the EM step, it has been optimzied for multi-core CPUs using OpenMP. You can use option "-num_threads" to specify the total number of threads used for this step. If not specified, CUDA-MEME will automatically set the number of threads to the number of available CPU cores. Typical usages:
(a) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -gpu_num 0 -num_threads 4
(b) ./cuda-meme dataset_file -maxsize 500000 -dna -mod zoops -revcomp -num_threads 4
(3) When running the "mcuda-meme", i.e with MPI support. Typical usages:
(a) mpirun -np 2 ./mcuda-meme dataset_file -dna -mod zoops
(b) mpirun -machinefile hostfile -np 4 ./mcuda-meme dataset_file -dna -mod zoops
when running on a GPU cluster, you must make sure that the number of MPI processes running on a node must not be more than the number of available GPU devices. This constraint can be ensured using a hostfile. An example of hostfile for MPICH is as follows, where each node contains two GPUs:
Certainly, it is possible that some processes will be idle because they are not assigned any task. In this case, it will print out "-----Process # will be idel-----" on the scree for the process # that is not used. CUDA-MEME is able to assign GPU devices automatically to each MPI process. In this case, the option "-gpu_num" is disabled for "mcuda-meme". For the EM and "Get the log p-value of a weighted log-likelihood ratio" step, "mcuda-meme" uses two threads by default. Users can specify it according to the power of you multi-core CPUs. Note that when calculating the starting point, each MPI process has two threads: one for score computing, and the other for alignments. So, to achieve the highest performance, we recommend (not necessarily) that the number of GPUs in a host <= the number of CPU cores / 2;
Liu Yongchao (Email: yliu860 (at) gatech (dot) edu),