Basic Programming

Watch this YouTube Video [1] to understand the concepts of programming.

Programming is the process of taking an algorithm and encoding it into notation that can be converted or compiled into instructions (strings of 0s and 1s) that machine or computer understands.

Important Notes

The modules in Pioneer/Markov RH8 are different, please visit this site. For example, for MPI (Message Passing Interface), check the OpenMPI versions using:
- module spider OpenMPI
Also, there no PGI module for OpenACC, use NVHPC module instead
- module load NVHPC

Copy all the required files from /usr/local/doc/BOOTCAMP or get it from github (https://github.com/sxg125/Basic-Programming) and cd to the bootcamp directory

cp -r /usr/local/doc/BOOTCAMP/bootcamp .

cd bootcamp

git clone https://github.com/sxg125/Basic-Programming

cd Basic-Programming

Syntax and Semantics

Programming languages, C/C++/C#, FORTRAN, Java, Python, Perl, Matlab, Mathematica, R, PHP, Scala, Ruby, to name a few, have their own syntax (structure or the grammar of the statements) and semantics (meaning of statements).

for (i = 1; i <= 10; i++) printf ("%d",i); // syntax for C

for i = 1:10,disp(i),end % syntax for Matlab

The semantics for both the statements is to print 1 to 10.

To make the statements more human readable, all programming languages provide certain degree of freedom. The Matlab statement can be re-written as:

for i = 1:10

disp(i)

end

Algorithm

Algorithm describes the solution of a problem in terms of data needed to represent the problem instance and the steps to get the result. Algorithm is usually represented in flowchart diagram as showed in Fig. 1 and is applicable for all programming languages.

Fig. 1 Flowchart Diagram to add two numbers

Programming Environment

When you are logged in to HPC, you are in a bash shell programming environment. It can represent both the process and the data. You can implement addition operations in this programming environment following the flowchart in Fig. 1.

Get the compute node to program. For Markov, use Account (-A) and Partition (-p) flags appropriately.

srun --pty bash

Type each statement below and press enter. This is the interactive method of programming or Command Line Interface/Interpreter (CLI). You can copy and paste instead. You can skip # and statements followed by it. They are the comments. Comments help the programmer to understand semantics easily.

#!/bin/bash

a=5 # declare and read

b=10

result=$(($a + $b)) # sum operation

echo $result # display a sum

You should get 15. Here a,b, and result are the place holders for the data or the variables, and the equal sign (=), plus sign (+), and echo are the commands or built-in functions or operations for bash. The details or implementation of how a and b are assigned values 5 and 10 with = and added with +, and the result displayed with echo, are hidden from users and is called data abstraction. The users interacts with the interface specified by the Abstract Data Type (ADT) as shown in Fig. 2.

Fig. 2:Data Abstraction

You can concatenate bash statements to get the same result.

echo $((5 + 10)) or expr 5 + 10

The more convenient way of programming is to write all those commands in a file which is called the script or source file. Check the content of the script "add.sh" .

cat add.sh

Get the same result as before by running the script

sh add.sh

Include the space between the variables and the equal sign as showed below in "add.sh" file and run the script.

...

a =5

b =10

...

You will see bunch of errors:

add.sh: line 2: a: command not found

add.sh: line 3: b: command not found

add.sh: line 4: + : syntax error: operand expected (error token is "+ ")

With the introduction of space, bash environment assumes the first entity “a” as a command but there is no such command in bash. The “+” operation is expecting two integer values but they are not assigned due to error assigning them. These are the compiler errors i.e. the compiler fails to convert the notation to instructions that the computer understands, and hence spews out the specific errors with added information for the users to correct them. Here, there are errors in line 2, line 3, and line 4. By correcting the first two errors, the 3rd error will be taken care of.

Let’s try the same addition with Matlab, R, and python programming languages.

Load the Matlab, R, and Python module

module load matlab

module load R

module load python

Type the following to open the Matlab Command Prompt

matlab -nodisplay

Type the following in the prompt and get the same result. Then exit

>> 10 + 5

ans =

>> quit

Type the following to open the Python Command Prompt

python3

Type the following in the prompt and get the same result. Then exit.

>>> 10 + 5

>>> quit()

Type the following to open the R Command Prompt

Type the following in the prompt and get the same result. Then exit.

> 5 + 10

[1] 15

q()

Check the Python, Matlab, and R script. The file extension is optional for bash script but python, Matlab, R and C need proper file extensions .py, .m, .R, and .c respectively.

cat add.py

cat add.m

cat add.R

Run the script and get the same result.

python3 add.py

matlab -nodisplay -r add

R CMD BATCH add.R

Note the -nodisplay and -r (run) flags. Also, add after -r don't have extension for Matlab. The output from R script will be at add.Rout.

Language Spectrum

See the language spectrum [2] in Fig. 3 showing the levels of language from lower (red) to higher (green).

Fig. 3: Language Spectrum.

Programming languages like Python and Matlab belong to High Level Language because they have many abstraction layers on the processors. They are also called the interpreted languages because the interpreter reads and execute the original code. The Low Level Language like C and Fortran allows direct access to registers and memory locations. So, they have superb performance. They are called compiled languages as the compiler translate the code specific to the target machine which is known as machine code. Assembly language uses Operation Codes called Opcodes.such as MOV, ADD for registry level operations.

The choice of the programming language depends on your current problem. If you are writing a kernel, operating system, or firmware for micro-controllers, high level language can never accomplish it. On the other hand, you don’t want to use it to write a web framework though low level languages can do just about anything.

Check the C equivalent code for addition.

cat add.c

output:

#include <stdio.h> //directive to include function declarations and macro definitions

int main()

{

int a = 5;

int b = 10;

int result = a + b;

printf("result=%d",result);

return 0;

}

You need to compile the C code (compiled language) first to get the executable

gcc -o add add.c

It creates executable "add". In the absence of -o flag, it generates "a.out" by default . For details, refer to HPC guide on Compiling & Linking and Debugging Segmentation Fault.

Execute the code to get the same result.

./add

Note: ./ implies that you are running the executable from the current directory.

Let’s create a compiler error by deleting the directive line "#include <stdio.h>" and compile it again.

add.c: In function ‘main’: add.c:8:5: warning: incompatible implicit declaration of built-in function ‘printf’ printf("Result=%d\n",result); ^

Note: printf is declared in the header file "stdio.h"

Basic Elements

The basic elements of all programming languages are data types, variables, logic, loops, branches, and functions. Locate them in the example Matlab script, primeNumbers.m. The program produces the count of prime numbers between lower bound and upper bound.

cat primeNumbers.m

output:

% function that takes two input variables, lower and upper, and store the count in the output variable total.

function [total time] = prime(lower, upper)

total = 0; % By default data type of variable in Matlab is double % start a timer to benchmark the main loop ticID = tic; for i = lower : upper % for loop: check primality for each integer value in the range isprime = 1; % TRUE if i <= 1 % Conditional logic to check if i is smaller than or equal to 1 isprime = 0; % branch to this statement if it is TRUE elseif i == 2 isprime = 1; % TRUE else for j = 2 : i-1 if ( mod (i, j) == 0 ) isprime = 0; %FALSE end end end %if if isprime == 1 total = total + 1; end end %for % stop the timer time = toc(ticID)

The equivalent code in C is primeNumbers.c and python is primeNumbers.py.

In the python script (primeNumbers.py), create an error by replacing print with Print at the end:

...

print("Number of Prime Numbers = ",total) Print("Execution Time = ", elapsed_time)

Run the python code

python3 primeNumbers.py

output:

('Number of Prime Numbers = ', 12251) Traceback (most recent call last): File "primeNumbers.py", line 21, in <module> Print("Execution Time = ", elapsed_time) NameError: name 'Print' is not defined

So, despite the syntax error, the interpreter executes the statement that prints the Number of Prime Numbers.

In the compiled language, the code needs to be error free to create the executable. In Python, the instructions are converted into bytecode, or p-code (portable code) and interpreter executes one bytecode at a time. The bytecdes are compact numeric codes, constant, and references that encode the result of parsing and semantic analysis. It allows much better performance than direct interpretation of source code as in bash where the interpreter interprets one statement after the other, and much of the time is spent on lexical analysis, parsing, and launching the programs called. So, bash shell script (add.sh) prints the result but not the Python script (add.py) if the errors (Echo $b and Print b) are introduced after the addition as showed below.

sh add.sh 15 add.sh: line 6: Echo: command not found # Echo instead of echo

python add.py File "add.py", line 5 Print b ^ SyntaxError: invalid syntax # Print instead of print

Parallelizing Code

More than one processor can be employed to the code to divide the task among the processors. Using OpenMP Interface, we can easily parallelize the serial C code by including pragma directive. Check the file "primeNumbersOmp.c" that has additional "omp" pragmas before the for loop as showed:

/* Each thread has its own private copies of i, j, and isprime.

Modification made on them are not visible to other threads.

So, each thread sees only the part of iterations i.e.

the integer value assigned to it to test the primality. Each thread modifies its own copy of isprime.

The shared variable upper is visible to all threads

and there is no need to create local copies.

#pragma omp parallel shared (upper) private (i,j,isprime)

Here, each thread calculate its own private copy of the ouput variable total

The partial value of total from each thread are combined (summation) on exit

#pragma omp for reduction (+ : total)

for (i = lower; i<= upper; i++)

{

isprime = 1;

for ( j=2; j < i; j++)

{

if (i % j == 0)

...

The code can also be parallelized using Message Passing Interface (MPI) libraries.

It is also possible to engage hundreds and thousands of cores of Graphics Processing Unit (GPU) cards for computational intensive part of the code. OpenACC Interface allows to take advantage of GPUs by using "acc" pragmas. CUDA Programming is more powerful but has steep learning curve. Check the file "primeNumbersAcc.c" with the pragma before the for loop as showed below:

...

#pragma acc kernels

for (i = lower; i<= upper; i++)

...

Note that the incorrect use of pragmas can not only degrade the performance but can produce wrong results. Also, the hybrid implementation with both OpenMP and OpenACC is also possible to boost the performance. Please visit HPC OpenACC Guide for details.

For MPI implementation, check the source file "primeNumbersMpi.c".

cat primeNumbersMpi.c

output:

#include<stdio.h>

#include "mpi.h"

#include <stdlib.h>

#include <assert.h>

int main(int argc, char *argv[])

{

int lower,upper,total,i,j,isprime;

int rank, size;

int local_total = 0;

int global_total = 0;

double time_initial,time_current,time;

//Add in MPI startup Routines

// Launch the MPI processes in each node

MPI_Init(&argc, &argv);

//Initialize the time

time_initial = MPI_Wtime();

// Request a thread id or rank from the MPI master process which had rank or tid = 0

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

//Get the number of threads or processor launched by MPI

MPI_Comm_size(MPI_COMM_WORLD, &size);

// Partial count of Prime Numbers

local_total = 0;

lower = 2;

upper = 131072;

//Broadcast the upper limit i.e. that copy is shared among the processors

MPI_Bcast ( &upper, 1, MPI_INT, 0, MPI_COMM_WORLD );

for (i = lower + rank; i<= upper; i=i+size)

{

isprime = 1;

// MPI_Bcast ( &upper, 1, MPI_INT, 0, MPI_COMM_WORLD );

for ( j=2; j<i; j++)

{

if (i % j == 0)

{

isprime = 0;

break;

}

local_total+=isprime;

}

//Summation Operations; local sum (local_sum) calculated in each process into global sum

MPI_Reduce(&local_total,&global_total,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);

time_current=MPI_Wtime();

time = time_current - time_initial;

if (rank == 0) {

printf("Total Prime Numbers = %d\n",global_total);

printf("ElapsedTime=%.3f\n",time);

}

// Blocks until all the processes have reached this routing

MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize();

return 0;

}

Matlab (MATrix LABoratory), a numerical computing environment provides different flavors of parallelism including GPUs. The simplest one can be achieved by replacing "for" loop with "parfor" loop (not nested) included in primeNumberPar.m file.

...

% for i = lower : upper

parfor i = lower : upper

...

For MDCS (Matlab Distributed Computing Server) and GPU, visit HPC Guide to Matlab.

Benchmarking

Benchmarking helps to evaluate the performance of a program compared to the standard benchmark results.

Notes:

Update the slurm script file with the correct partition (-p) and account (-A). Check HPC Resource View for details.
The modules might have changed. Check Module system for details. For example, for OpenAcc, PGI module is not available so use NVHPC module.

Let's compare the execution time for the primeNumbers code/script in different languages and for various flavors of parallelism in Matlab and C programming with this benchmark. Use the SLURM job script (runPrime.slurm), which itself uses the bash environment.

Check the job file.

cat runPrime.slurm

output:

#!/bin/bash

#SBATCH -J Benchmarks

#SBATCH --time=24:00:00

#SBATCH -N 1

#SBATCH -c 4

##SBATCH -n 4

#SBATCH --mem=5g

#SBATCH -p gpu -C gpu2080 --gres=gpu:1 # use the proper partition and account for the class

echo "The job is running in $SLURM_NODELIST"

NPROCS=$(( $SLURM_NNODES * $SLURM_CPUS_PER_TASK ))

#Copy the script and other input files to the scratch directory and change directory

cp primeNumbers.c primeNumbers.m primeNumbers.sh primeNumbers.py primeNumbersPar.m primeNumbersOmp.c primeNumbersMpi.c primeNumbersAcc.c $PFSDIR

cd $PFSDIR

# Compile C program

gcc -o prime primeNumbers.c

# Compile C program with OpenMP

gcc -o primeOmp -fopenmp primeNumbersOmp.c

# Compile C program with MPI

module load OpenMPI

mpicc -o primeMpi primeNumbersMpi.c

# Compiling C Program with OpenACC

module load NVHPC/23.1-CUDA-12.0.0

nvc -Minfo=all -acc -gpu=cc75 primeNumbersAcc.c -o primeAcc

#Execute

echo "running serial ..."

time ./prime

echo "running Python ..."

python3 primeNumbers.py

echo "running parallel OpenMP ..."

export OMP_NUM_THREAD=$NPROCS

echo "Number of Threads = $NPROCS"

time ./primeOmp

echo "running parallel MPI"

mpirun ./primeMpi

echo "running in GPUs using OpenACC ..."

time ./primeAcc

#MATLAB

#Load MATLAB module

module load matlab

#MATLAB Preference Setting

matlab_prefdir="/tmp/$USER/matlab/`hostname`_PID$$"

test -d $matlab_prefdir || mkdir -p $matlab_prefdir

export MATLAB_PREFDIR="$matlab_prefdir"

#RUN MATLAB script

echo "Running Matlab Serial ..."

matlab -singleCompThread -nodisplay -r 'primeNumbers(2,131072)'

echo "Running Matlab Parallel parfor ..."

matlab -singleCompThread -nodisplay -r 'primeNumbersPar(2,131072)'

# quit

echo "running Bash ..."

time ./primeNumbers.sh

cp -r * $SLURM_SUBMIT_DIR

Submit the job:

sbatch runPrime.slurm

Check the partial output while the job is running. To cancel press Ctrl + C. You can also use "cat" command.

tail -f slurm-<jobid>.out

For MPI job, we need to assign -n 4 instead of -c 4. Let's request the same gpufermi queue:

srun -p gpu -N 2 -n 4 --pty bash

Compile the MPI code "primeNumbersMpi.c":

mpicc -o primeMpi primeNumbersMpi.c

Run the executable

mpirun ./primeMpi

output:

Total Prime Numbers = 12251

ElapsedTime=1.591

Performance Table - Serial, parallel (4 processors), and GPU.

(Note: Matlab Parfor may take longer in the first run)

Note the superb performance from C programming language compared to MATLAB. Also, we can't expect 4 times speed up by employing 4 processors. There are scheduling and communication overheads in thread management. Bash shell scripting has the worst performance due to interpretation of each statement at a time. Bash shell is, therefore, recommended for simple script. It is excellent at pipe operations. Perls excels at text analysis while Python is more general purpose popular language with a larger active user community.

References:

[1] What is Programming (Khan Academy) - YouTube Video

[2] Language Spectrum: http://www.codecommit.com/blog/java/defining-high-mid-and-low-level-languages

[3] Benchmark: https://people.sc.fsu.edu/~jburkardt/c_src/prime_openmp/prime_openmp.html

[4] Prime Mpi: https://people.sc.fsu.edu/~jburkardt/c_src/prime_mpi/prime_mpi.html

[5] GitHub MPI Tutorial: https://github.com/wesleykendall/mpitutorial

Page updated

Report abuse