Watch this YouTube Video [1] to understand the concepts of programming.
Programming is the process of taking an algorithm and encoding it into notation that can be converted or compiled into instructions (strings of 0s and 1s) that machine or computer understands.
The modules in Pioneer/Markov RH8 are different, please visit this site. For example, for MPI (Message Passing Interface), check the OpenMPI versions using:
module spider OpenMPI
Also, there no PGI module for OpenACC, use NVHPC module instead
module load NVHPC
Copy all the required files from /usr/local/doc/BOOTCAMP or get it from github (https://github.com/sxg125/Basic-Programming) and cd to the bootcamp directory
cp -r /usr/local/doc/BOOTCAMP/bootcamp .
cd bootcamp
OR
git clone https://github.com/sxg125/Basic-Programming
cd Basic-Programming
Programming languages, C/C++/C#, FORTRAN, Java, Python, Perl, Matlab, Mathematica, R, PHP, Scala, Ruby, to name a few, have their own syntax (structure or the grammar of the statements) and semantics (meaning of statements).
for (i = 1; i <= 10; i++) printf ("%d",i); // syntax for C
for i = 1:10,disp(i),end % syntax for Matlab
The semantics for both the statements is to print 1 to 10.
To make the statements more human readable, all programming languages provide certain degree of freedom. The Matlab statement can be re-written as:
for i = 1:10
disp(i)
end
Algorithm describes the solution of a problem in terms of data needed to represent the problem instance and the steps to get the result. Algorithm is usually represented in flowchart diagram as showed in Fig. 1 and is applicable for all programming languages.
Fig. 1 Flowchart Diagram to add two numbers
When you are logged in to HPC, you are in a bash shell programming environment. It can represent both the process and the data. You can implement addition operations in this programming environment following the flowchart in Fig. 1.
Get the compute node to program. For Markov, use Account (-A) and Partition (-p) flags appropriately.
srun --pty bash
Type each statement below and press enter. This is the interactive method of programming or Command Line Interface/Interpreter (CLI). You can copy and paste instead. You can skip # and statements followed by it. They are the comments. Comments help the programmer to understand semantics easily.
#!/bin/bash
a=5 # declare and read
b=10
result=$(($a + $b)) # sum operation
echo $result # display a sum
You should get 15. Here a,b, and result are the place holders for the data or the variables, and the equal sign (=), plus sign (+), and echo are the commands or built-in functions or operations for bash. The details or implementation of how a and b are assigned values 5 and 10 with = and added with +, and the result displayed with echo, are hidden from users and is called data abstraction. The users interacts with the interface specified by the Abstract Data Type (ADT) as shown in Fig. 2.
Fig. 2:Data Abstraction
You can concatenate bash statements to get the same result.
echo $((5 + 10)) or expr 5 + 10
The more convenient way of programming is to write all those commands in a file which is called the script or source file. Check the content of the script "add.sh" .
cat add.sh
Get the same result as before by running the script
sh add.sh
Include the space between the variables and the equal sign as showed below in "add.sh" file and run the script.
...
a =5
b =10
...
You will see bunch of errors:
add.sh: line 2: a: command not found
add.sh: line 3: b: command not found
add.sh: line 4: + : syntax error: operand expected (error token is "+ ")
With the introduction of space, bash environment assumes the first entity “a” as a command but there is no such command in bash. The “+” operation is expecting two integer values but they are not assigned due to error assigning them. These are the compiler errors i.e. the compiler fails to convert the notation to instructions that the computer understands, and hence spews out the specific errors with added information for the users to correct them. Here, there are errors in line 2, line 3, and line 4. By correcting the first two errors, the 3rd error will be taken care of.
Let’s try the same addition with Matlab, R, and python programming languages.
Load the Matlab, R, and Python module
module load matlab
module load R
module load python
Type the following to open the Matlab Command Prompt
matlab -nodisplay
Type the following in the prompt and get the same result. Then exit
>> 10 + 5
ans =
15
>> quit
Type the following to open the Python Command Prompt
python3
Type the following in the prompt and get the same result. Then exit.
>>> 10 + 5
15
>>> quit()
Type the following to open the R Command Prompt
R
Type the following in the prompt and get the same result. Then exit.
> 5 + 10
[1] 15
q()
Check the Python, Matlab, and R script. The file extension is optional for bash script but python, Matlab, R and C need proper file extensions .py, .m, .R, and .c respectively.
cat add.py
cat add.m
cat add.R
Run the script and get the same result.
python3 add.py
matlab -nodisplay -r add
R CMD BATCH add.R
Note the -nodisplay and -r (run) flags. Also, add after -r don't have extension for Matlab. The output from R script will be at add.Rout.
Language Spectrum
See the language spectrum [2] in Fig. 3 showing the levels of language from lower (red) to higher (green).
Fig. 3: Language Spectrum.
Programming languages like Python and Matlab belong to High Level Language because they have many abstraction layers on the processors. They are also called the interpreted languages because the interpreter reads and execute the original code. The Low Level Language like C and Fortran allows direct access to registers and memory locations. So, they have superb performance. They are called compiled languages as the compiler translate the code specific to the target machine which is known as machine code. Assembly language uses Operation Codes called Opcodes.such as MOV, ADD for registry level operations.
The choice of the programming language depends on your current problem. If you are writing a kernel, operating system, or firmware for micro-controllers, high level language can never accomplish it. On the other hand, you don’t want to use it to write a web framework though low level languages can do just about anything.
Check the C equivalent code for addition.
cat add.c
output:
#include <stdio.h> //directive to include function declarations and macro definitions
int main()
{
int a = 5;
int b = 10;
int result = a + b;
printf("result=%d",result);
return 0;
}
You need to compile the C code (compiled language) first to get the executable
gcc -o add add.c
It creates executable "add". In the absence of -o flag, it generates "a.out" by default . For details, refer to HPC guide on Compiling & Linking and Debugging Segmentation Fault.
Execute the code to get the same result.
./add
Note: ./ implies that you are running the executable from the current directory.
Let’s create a compiler error by deleting the directive line "#include <stdio.h>" and compile it again.
add.c: In function ‘main’: add.c:8:5: warning: incompatible implicit declaration of built-in function ‘printf’ printf("Result=%d\n",result); ^
Note: printf is declared in the header file "stdio.h"
The basic elements of all programming languages are data types, variables, logic, loops, branches, and functions. Locate them in the example Matlab script, primeNumbers.m. The program produces the count of prime numbers between lower bound and upper bound.
cat primeNumbers.m
output:
% function that takes two input variables, lower and upper, and store the count in the output variable total.
function [total time] = prime(lower, upper)
total = 0; % By default data type of variable in Matlab is double % start a timer to benchmark the main loop ticID = tic; for i = lower : upper % for loop: check primality for each integer value in the range isprime = 1; % TRUE if i <= 1 % Conditional logic to check if i is smaller than or equal to 1 isprime = 0; % branch to this statement if it is TRUE elseif i == 2 isprime = 1; % TRUE else for j = 2 : i-1 if ( mod (i, j) == 0 ) isprime = 0; %FALSE end end end %if if isprime == 1 total = total + 1; end end %for % stop the timer time = toc(ticID)
The equivalent code in C is primeNumbers.c and python is primeNumbers.py.
In the python script (primeNumbers.py), create an error by replacing print with Print at the end:
...
print("Number of Prime Numbers = ",total) Print("Execution Time = ", elapsed_time)
Run the python code
python3 primeNumbers.py
output:
('Number of Prime Numbers = ', 12251) Traceback (most recent call last): File "primeNumbers.py", line 21, in <module> Print("Execution Time = ", elapsed_time) NameError: name 'Print' is not defined
So, despite the syntax error, the interpreter executes the statement that prints the Number of Prime Numbers.
In the compiled language, the code needs to be error free to create the executable. In Python, the instructions are converted into bytecode, or p-code (portable code) and interpreter executes one bytecode at a time. The bytecdes are compact numeric codes, constant, and references that encode the result of parsing and semantic analysis. It allows much better performance than direct interpretation of source code as in bash where the interpreter interprets one statement after the other, and much of the time is spent on lexical analysis, parsing, and launching the programs called. So, bash shell script (add.sh) prints the result but not the Python script (add.py) if the errors (Echo $b and Print b) are introduced after the addition as showed below.
sh add.sh 15 add.sh: line 6: Echo: command not found # Echo instead of echo
python add.py File "add.py", line 5 Print b ^ SyntaxError: invalid syntax # Print instead of print
More than one processor can be employed to the code to divide the task among the processors. Using OpenMP Interface, we can easily parallelize the serial C code by including pragma directive. Check the file "primeNumbersOmp.c" that has additional "omp" pragmas before the for loop as showed:
/* Each thread has its own private copies of i, j, and isprime.
Modification made on them are not visible to other threads.
So, each thread sees only the part of iterations i.e.
the integer value assigned to it to test the primality. Each thread modifies its own copy of isprime.
The shared variable upper is visible to all threads
and there is no need to create local copies.
*/
#pragma omp parallel shared (upper) private (i,j,isprime)
/*
Here, each thread calculate its own private copy of the ouput variable total
The partial value of total from each thread are combined (summation) on exit
*/
#pragma omp for reduction (+ : total)
for (i = lower; i<= upper; i++)
{
isprime = 1;
for ( j=2; j < i; j++)
{
if (i % j == 0)
...
...
The code can also be parallelized using Message Passing Interface (MPI) libraries.
It is also possible to engage hundreds and thousands of cores of Graphics Processing Unit (GPU) cards for computational intensive part of the code. OpenACC Interface allows to take advantage of GPUs by using "acc" pragmas. CUDA Programming is more powerful but has steep learning curve. Check the file "primeNumbersAcc.c" with the pragma before the for loop as showed below:
...
#pragma acc kernels
for (i = lower; i<= upper; i++)
...
Note that the incorrect use of pragmas can not only degrade the performance but can produce wrong results. Also, the hybrid implementation with both OpenMP and OpenACC is also possible to boost the performance. Please visit HPC OpenACC Guide for details.
For MPI implementation, check the source file "primeNumbersMpi.c".
cat primeNumbersMpi.c
output:
#include<stdio.h>
#include "mpi.h"
#include <stdlib.h>
#include <assert.h>
int main(int argc, char *argv[])
{
int lower,upper,total,i,j,isprime;
int rank, size;
int local_total = 0;
int global_total = 0;
double time_initial,time_current,time;
//Add in MPI startup Routines
// Launch the MPI processes in each node
MPI_Init(&argc, &argv);
//Initialize the time
time_initial = MPI_Wtime();
// Request a thread id or rank from the MPI master process which had rank or tid = 0
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//Get the number of threads or processor launched by MPI
MPI_Comm_size(MPI_COMM_WORLD, &size);
// Partial count of Prime Numbers
local_total = 0;
lower = 2;
upper = 131072;
//Broadcast the upper limit i.e. that copy is shared among the processors
MPI_Bcast ( &upper, 1, MPI_INT, 0, MPI_COMM_WORLD );
for (i = lower + rank; i<= upper; i=i+size)
{
isprime = 1;
// MPI_Bcast ( &upper, 1, MPI_INT, 0, MPI_COMM_WORLD );
for ( j=2; j<i; j++)
{
if (i % j == 0)
{
isprime = 0;
break;
}
}
local_total+=isprime;
}
//Summation Operations; local sum (local_sum) calculated in each process into global sum
MPI_Reduce(&local_total,&global_total,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
time_current=MPI_Wtime();
time = time_current - time_initial;
if (rank == 0) {
printf("Total Prime Numbers = %d\n",global_total);
printf("ElapsedTime=%.3f\n",time);
}
// Blocks until all the processes have reached this routing
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Matlab (MATrix LABoratory), a numerical computing environment provides different flavors of parallelism including GPUs. The simplest one can be achieved by replacing "for" loop with "parfor" loop (not nested) included in primeNumberPar.m file.
...
% for i = lower : upper
parfor i = lower : upper
...
For MDCS (Matlab Distributed Computing Server) and GPU, visit HPC Guide to Matlab.
Benchmarking helps to evaluate the performance of a program compared to the standard benchmark results.
Notes:
Update the slurm script file with the correct partition (-p) and account (-A). Check HPC Resource View for details.
The modules might have changed. Check Module system for details. For example, for OpenAcc, PGI module is not available so use NVHPC module.
Let's compare the execution time for the primeNumbers code/script in different languages and for various flavors of parallelism in Matlab and C programming with this benchmark. Use the SLURM job script (runPrime.slurm), which itself uses the bash environment.
Check the job file.
cat runPrime.slurm
output:
#!/bin/bash
#SBATCH -J Benchmarks
#SBATCH --time=24:00:00
#SBATCH -N 1
#SBATCH -c 4
##SBATCH -n 4
#SBATCH --mem=5g
#SBATCH -p gpu -C gpu2080 --gres=gpu:1 # use the proper partition and account for the class
echo "The job is running in $SLURM_NODELIST"
NPROCS=$(( $SLURM_NNODES * $SLURM_CPUS_PER_TASK ))
#Copy the script and other input files to the scratch directory and change directory
cp primeNumbers.c primeNumbers.m primeNumbers.sh primeNumbers.py primeNumbersPar.m primeNumbersOmp.c primeNumbersMpi.c primeNumbersAcc.c $PFSDIR
cd $PFSDIR
# Compile C program
gcc -o prime primeNumbers.c
# Compile C program with OpenMP
gcc -o primeOmp -fopenmp primeNumbersOmp.c
# Compile C program with MPI
module load OpenMPI
mpicc -o primeMpi primeNumbersMpi.c
# Compiling C Program with OpenACC
module load NVHPC/23.1-CUDA-12.0.0
nvc -Minfo=all -acc -gpu=cc75 primeNumbersAcc.c -o primeAcc
#Execute
echo "running serial ..."
time ./prime
echo "running Python ..."
python3 primeNumbers.py
echo "running parallel OpenMP ..."
export OMP_NUM_THREAD=$NPROCS
echo "Number of Threads = $NPROCS"
time ./primeOmp
echo "running parallel MPI"
mpirun ./primeMpi
echo "running in GPUs using OpenACC ..."
time ./primeAcc
#MATLAB
#Load MATLAB module
module load matlab
#MATLAB Preference Setting
matlab_prefdir="/tmp/$USER/matlab/`hostname`_PID$$"
test -d $matlab_prefdir || mkdir -p $matlab_prefdir
export MATLAB_PREFDIR="$matlab_prefdir"
#RUN MATLAB script
echo "Running Matlab Serial ..."
matlab -singleCompThread -nodisplay -r 'primeNumbers(2,131072)'
echo "Running Matlab Parallel parfor ..."
matlab -singleCompThread -nodisplay -r 'primeNumbersPar(2,131072)'
# quit
echo "running Bash ..."
time ./primeNumbers.sh
cp -r * $SLURM_SUBMIT_DIR
Submit the job:
sbatch runPrime.slurm
Check the partial output while the job is running. To cancel press Ctrl + C. You can also use "cat" command.
tail -f slurm-<jobid>.out
For MPI job, we need to assign -n 4 instead of -c 4. Let's request the same gpufermi queue:
srun -p gpu -N 2 -n 4 --pty bash
Compile the MPI code "primeNumbersMpi.c":
mpicc -o primeMpi primeNumbersMpi.c
Run the executable
mpirun ./primeMpi
output:
Total Prime Numbers = 12251
ElapsedTime=1.591
Performance Table - Serial, parallel (4 processors), and GPU.
(Note: Matlab Parfor may take longer in the first run)
Note the superb performance from C programming language compared to MATLAB. Also, we can't expect 4 times speed up by employing 4 processors. There are scheduling and communication overheads in thread management. Bash shell scripting has the worst performance due to interpretation of each statement at a time. Bash shell is, therefore, recommended for simple script. It is excellent at pipe operations. Perls excels at text analysis while Python is more general purpose popular language with a larger active user community.
References:
[1] What is Programming (Khan Academy) - YouTube Video
[2] Language Spectrum: http://www.codecommit.com/blog/java/defining-high-mid-and-low-level-languages
[3] Benchmark: https://people.sc.fsu.edu/~jburkardt/c_src/prime_openmp/prime_openmp.html
[4] Prime Mpi: https://people.sc.fsu.edu/~jburkardt/c_src/prime_mpi/prime_mpi.html
[5] GitHub MPI Tutorial: https://github.com/wesleykendall/mpitutorial