Algorithms 

This page is designed to provide a series of coding solutions for some common problems faced when solving problems, especially in C/ C++ CUDA for GPU-based programming. These are code solutions to mainly simple problems, but finding solutions for them online is a bit tricky.  Additionally, the page has code for some of my algorithmic innovations that have been developed during my academic career.

Algorithmic innovations

Compound Interpolated Search (CIS)

The CIS algorithm was designed in-house during my Ph.D. to search for groups within grouped data quickly. It has the additional benefit of being a multithreaded search algorithm built using two traditional search techniques: sequential search and interpolated search. The schematic diagram below is from our paper CATE: A fast and scalable CUDA implementation to conduct highly parallelized evolutionary tests on large scale genomic data for which it was originally designed. If this code is of help to you please cite the above paper, it also has a more in-depth explanation of the code architecture.

The complete code snippet can be found in CATE's GitHub in the functions.cu file under the function compound_interpolationSearch.

Here we will take an overview look at the code. First, we weight the search query based on the range of the search space. This is using a modification of the standard interpolation search algorithm.

int pos = start + ((((double)(end - start) / (high_Value - low_Value)) * (start_pos - low_Value)));

We also use a modified IF function that can account for ranges.

if ((start_pos >= low_Value_atpos && start_pos <= high_Value_atpos) || (start_pos <= low_Value_atpos && stop_pos <= high_Value_atpos) || (start_pos <= low_Value_atpos && stop_pos >= high_Value_atpos))

If the IF is satisfied it means we have found the rough starting point in the search space that satisfies our query.

Afterward, we can use two sequential searches to search the rest of the search space.

Code solutions

GPU based CUDA Negative Binomial distribution

int failures = 0;

int successes = 0;


while (successes < cuda_progeny_distribution_parameters_Array[1])

{

 float rand_num = curand_uniform(&localState);

 if (rand_num < cuda_progeny_distribution_parameters_Array[2])

 {

  successes++;

 }

 else

 {

  failures++;

 }

}

GPU based CUDA Gamma distribution

float sum = 0.0f;


for (int j = 0; j < cuda_shape; ++j)

{

 sum += generateExponential(&localState, 1.0f / cuda_scale);

}


result = sum;


__device__ float generateExponential(curandState *state, float lambda)

{

float u = curand_uniform(state);

return -logf(u) / lambda;

}

GPU CUDA process multiple lines of strings using a 1D Array

vector<string> total_lines;

int line_Count= total_lines.size();


string lines_Concat = "";

int site_Index[line_Count + 1];

site_Index[0] = 0;


for (size_t i = 0; i < line_Count; i++)

 {

  lines_Concat.append(total_lines[i]);

  site_Index[i + 1] = site_Index[i] + total_lines[i].size();

 }


char *full_Char;

full_Char = (char *)malloc((lines_Concat.size() + 1) * sizeof(char));

strcpy(full_Char, lines_Concat.c_str());

total_lines.clear();


char *cuda_full_Char;

cudaMallocManaged(&cuda_full_Char, (lines_Concat.size() + 1) * sizeof(char));


int *cuda_site_Index;

cudaMallocManaged(&cuda_site_Index, (line_Count + 1) * sizeof(int));

cudaMemcpy(cuda_full_Char, full_Char, (lines_Concat.size() + 1) * sizeof(char), cudaMemcpyHostToDevice);

cudaMemcpy(cuda_site_Index, site_Index, (line_Count + 1) * sizeof(int), cudaMemcpyHostToDevice);


__global__ void cuda_process(char *cuda_full_Char, int *cuda_site_Index, int line_Count)

{

int tid = threadIdx.x + blockIdx.x * blockDim.x;


while (tid < line_Count)

{

 int line_Start = cuda_site_Index[tid];

 int line_End = cuda_site_Index[tid + 1];


 // process the line information


 int i = site_Start;

 while (i < site_End)

 {

  i++;

 }


 tid += blockDim.x * gridDim.x;

}

}

GPU Create CUDA 2-D Arrays

float **CUDA_array;

cudaMallocManaged(&CUDA_array, rows * sizeof(float *));


for (int row = 0; row < rows; row++)

{

 cudaMalloc((void **)&(CUDA_array[row]), (columns) * sizeof(float));

}


for (int i = 0; i < rows; i++)

{

 cudaFree(CUDA_2D_array[i]);

}

cudaFree(CUDA_2D_array);

GPU Copy to CUDA 2-D Arrays

float **cuda_array_2D;

    

cudaMallocManaged(&cuda_array_2D, rows * sizeof(float *));


for (int row = 0; row < rows; row++)

{

 cudaMalloc((void **)&(array_2D[row]), columns * sizeof(float));

 cudaMemcpy(cuda_array_2D[row], cpu_array_2D[row], columns * sizeof(float), cudaMemcpyHostToDevice);

  }


for (int row = 0; row < rows; row++)

{

 cudaFree(cuda_array_2D[row]);

}

cudaFree(cuda_array_2D);