run example

#test MPI GPU code

#suppose there are two GPUs in the machine. 
#suppose we place dataset "matlab_test20110219" and code "plsqr2_iccs_ncsa" in the same direcotry
[hhuang1@seismic34 temp]$ ls
matlab_test20110219         plsqr2_iccs_ncsa  run_example.txt
[hhuang1@seismic34 temp]$ cd plsqr2_iccs_ncsa/

#compile mpi gpu code
[hhuang1@seismic34 plsqr2_iccs_ncsa]$ make clean;make gpu
rm -f *.o 
mpicc  -O3 -g -c main.c  -o main.o
main.c: In function ‘MyGPTLpr’:
main.c:89: warning: comparison is always false due to limited range of data type

mpicc  -O3 -g -lm  -c lsqr.c  -o lsqr.o
gcc: -lm: linker input file unused because linking not done
mpicc  -O3 -g -c FileIO.c  -o FileIO.o
mpicc  -O3 -g -c aprod.c  -o aprod.o
mpicc  -O3 -g -c loadbalance.c  -o loadbalance.o
mpicc  -O3 -g -c plsqr3.c  -o plsqr3.o
mpicc  -O3 -g -c CPUSpMV.c  -o CPUSpMV.o
mpicc  -O3 -g -DHAVE_MPI -DHAVE_GETTIMEOFDAY -c tool/gptl.c  -o gptl.o
tool/gptl.c: In function ‘GPTLpr_summary’:
tool/gptl.c:1944: warning: cast from pointer to integer of different size
mpicc  -O3 -g -DHAVE_MPI -DHAVE_GETTIMEOFDAY -c tool/threadutil.c  -o threadutil
mpicc  -O3 -g -DHAVE_MPI -DHAVE_GETTIMEOFDAY -c tool/GPTLutil.c  -o GPTLutil.o
mpicc  -O3 -g -c tridiagonal.c  -o tridiagonal.o
/usr/local/cuda-4.1.28//bin/nvcc -g -I/usr/local/cuda-4.1.28//include -lcudart -
lcutil -lcublas -lcudpp -lcusparse -arch=sm_20     -c  -o lsqrcuda.o warning: expression has no effect warning: expression has no effect

mpicc  -O3 -lm -g -L/usr/local/cuda-4.1.28//lib64  -lcudart  -lcusparse -lcublas
 -use_fast_math   -o PLSQR_UWCS_GPU main.o lsqr.o FileIO.o aprod.o loadbalance.o
 plsqr3.o CPUSpMV.o gptl.o threadutil.o GPTLutil.o tridiagonal.o lsqrcuda.o

#run dataset, ../matlab_test20110219/config3.0_matlab is the dataset configuration file
[hhuang1@seismic34 plsqr2_iccs_ncsa]$ mpiexec -np 2 ./PLSQR_UWCS_GPU  ../matlab_test20110219/config3.0_matlab
------------------------Single Precision ------------------------PLSQR2 GPU version---------------------------
Using damping---------------------------

rank=0, PID=3516 on seismic34 
rank=1, PID=3517 on seismic34 
0: SETcudaDeviceMapHost!
----------GPU initialized, rank:0, PPN=3, device:0 of 7 GPUs on host:seismic34. Using CUSPARSE Library
0: maxThreadsPerBlock=1024 multiProcessorCount=14 maxThreadsPerMultiProcessor=1536 max_threads=21504
0: CUSparse version = 4010 
1: SETcudaDeviceMapHost!
----------GPU initialized, rank:1, PPN=3, device:1 of 7 GPUs on host:seismic34. Using CUSPARSE Library
1: maxThreadsPerBlock=1024 multiProcessorCount=14 maxThreadsPerMultiProcessor=1536 max_threads=21504
0: dirpath=../matlab_test20110219/, inputType=false, lsqr_iteration=4, asciiDataStatistic=kernel_statistic, binaryDataStatistic=NULL, vectorBInfo=../matlab_test20110219/vector 
0: damping_path=../matlab_test20110219/, damping_info=damping_statistic 
0: dataFileName[0]=kernel_01, number of nonzero per row=10
0: dataFileName[1]=kernel_02, number of nonzero per row=10
0: dataFileName[2]=kernel_03, number of nonzero per row=10
0: dataFileName[3]=kernel_04, number of nonzero per row=10
0 processing damping matrix...
0, dampingFileName=damping_01, dampingFileNumRow=5, dampingFileNumNonzero=5
1, dampingFileName=damping_02, dampingFileNumRow=5, dampingFileNumNonzero=5
 end of processing damping matrix...

whole matrix(including damping) number of row is 14 

procRank=0: sizeOfprocDistributedDataFileSet[0]=2, countOfprocDistributedDataFileSet[0]=20 :
0 2

procRank=1: sizeOfprocDistributedDataFileSet[1]=2, countOfprocDistributedDataFileSet[1]=20 :
1 3
sumOfIndex=4, sumOfCount=40, sum=40, average=200: perProcSizeOfDisDataFileSet_k=2, numNonzero_perProcDisDataFileSet_k=20 
0: kernel file set:0 2
1: The number of data file is 4

------------------finish load balancing,----------------------------------
1: perProcSizeOfDisDataFileSet_k=2, numNonzero_perProcDisDataFileSet_k=20 
1: kernel file set:1 3
1:--------------------------Data loading (kernel and damping) in all processes starts...-------------------------------
0: loadMatrixCSR_ascii is loading damping file: ../matlab_test20110219//damping_01 
c1!=2 lineNO:4
0:--------------------Data loading in all processes completed!------------------whole_matrix_numCol=10
0/2: perProcData->totalNumOfRows=14, perProcData->totalNumOfColumns=10 
1: loadMatrixCSR_ascii is loading damping file: ../matlab_test20110219//damping_02 
c1!=2 lineNO:9
1/2: perProcData->totalNumOfRows=14, perProcData->totalNumOfColumns=10 

Starting plsqr algorithm...procRank=1

Starting plsqr algorithm...procRank=0
0: convert to matrix in CSR to CSC format begins 
0: convert matrix in CSR to CSC format completes 
Parallel LSQR time=0 s
Finishing plsqr 2 algorithm =========== procRank=1
Finishing plsqr 2 algorithm =========== procRank=0

Please open result.txt and p_resultx.txt to check the solution!

mkdir: cannot create directory `NP2': File exists

#check result: p_resultx_itn=4_NProc=2.txt are the result
[hhuang1@seismic34 plsqr2_iccs_ncsa]$ ls *.txt
LoadBalanceScheduleWithOriginalRowIndex.txt  p_resultx_itn=4_NProc=2.txt
p_resultx_itn=4_NProc=2_nonZero.txt          result.txt

#compare result with matlab's result, check if they are the same
[hhuang1@seismic34 plsqr2_iccs_ncsa]$ more p_resultx_itn=4_NProc=2.txt ../matlab_test20110219/result
x[0]= 6.977329e-02
x[1]= 9.979825e-02
x[2]= 9.673197e-02
x[3]= 2.021925e-01
x[4]= -2.895069e-02
x[5]= 7.391235e-02
x[6]= 9.616780e-02
x[7]= 1.532301e-01
x[8]= -1.261227e-02
x[9]= 1.566834e-01