This page is designed to provide a series of coding solutions for some common problems faced when solving problems, especially in C/ C++ CUDA for GPU-based programming. These solutions are mainly simple, but finding solutions for them online can be a bit tricky. Additionally, the page has code for some of my algorithmic innovations developed during my academic career.
The CIS algorithm was designed in-house during my Ph.D. to search for groups within grouped data quickly. It has the additional benefit of being a multithreaded search algorithm built using two traditional search techniques: sequential search and interpolated search. The schematic diagram below is from our paper CATE: A fast and scalable CUDA implementation to conduct highly parallelized evolutionary tests on large scale genomic data for which it was originally designed. If this code is of help to you please cite the above paper, it also has a more in-depth explanation of the code architecture.
The complete code snippet can be found in CATE's GitHub in the functions.cu file under the function compound_interpolationSearch.
Here we will take an overview look at the code. First, we weight the search query based on the range of the search space. This is using a modification of the standard interpolation search algorithm.
int pos = start + ((((double)(end - start) / (high_Value - low_Value)) * (start_pos - low_Value)));
We also use a modified IF function that can account for ranges.
if ((start_pos >= low_Value_atpos && start_pos <= high_Value_atpos) || (start_pos <= low_Value_atpos && stop_pos <= high_Value_atpos) || (start_pos <= low_Value_atpos && stop_pos >= high_Value_atpos))
If the IF is satisfied it means we have found the rough starting point in the search space that satisfies our query.
Afterward, we can use two sequential searches to search the rest of the search space.
int failures = 0;
int successes = 0;
while (successes < cuda_progeny_distribution_parameters_Array[1])
{
float rand_num = curand_uniform(&localState);
if (rand_num < cuda_progeny_distribution_parameters_Array[2])
{
successes++;
}
else
{
failures++;
}
}
float sum = 0.0f;
for (int j = 0; j < cuda_shape; ++j)
{
sum += generateExponential(&localState, 1.0f / cuda_scale);
}
result = sum;
__device__ float generateExponential(curandState *state, float lambda)
{
float u = curand_uniform(state);
return -logf(u) / lambda;
}
vector<string> total_lines;
int line_Count= total_lines.size();
string lines_Concat = "";
int site_Index[line_Count + 1];
site_Index[0] = 0;
for (size_t i = 0; i < line_Count; i++)
{
lines_Concat.append(total_lines[i]);
site_Index[i + 1] = site_Index[i] + total_lines[i].size();
}
char *full_Char;
full_Char = (char *)malloc((lines_Concat.size() + 1) * sizeof(char));
strcpy(full_Char, lines_Concat.c_str());
total_lines.clear();
char *cuda_full_Char;
cudaMallocManaged(&cuda_full_Char, (lines_Concat.size() + 1) * sizeof(char));
int *cuda_site_Index;
cudaMallocManaged(&cuda_site_Index, (line_Count + 1) * sizeof(int));
cudaMemcpy(cuda_full_Char, full_Char, (lines_Concat.size() + 1) * sizeof(char), cudaMemcpyHostToDevice);
cudaMemcpy(cuda_site_Index, site_Index, (line_Count + 1) * sizeof(int), cudaMemcpyHostToDevice);
__global__ void cuda_process(char *cuda_full_Char, int *cuda_site_Index, int line_Count)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < line_Count)
{
int line_Start = cuda_site_Index[tid];
int line_End = cuda_site_Index[tid + 1];
// process the line information
int i = site_Start;
while (i < site_End)
{
i++;
}
tid += blockDim.x * gridDim.x;
}
}
float **CUDA_array;
cudaMallocManaged(&CUDA_array, rows * sizeof(float *));
for (int row = 0; row < rows; row++)
{
cudaMalloc((void **)&(CUDA_array[row]), (columns) * sizeof(float));
}
for (int i = 0; i < rows; i++)
{
cudaFree(CUDA_2D_array[i]);
}
cudaFree(CUDA_2D_array);
float **cuda_array_2D;
cudaMallocManaged(&cuda_array_2D, rows * sizeof(float *));
for (int row = 0; row < rows; row++)
{
cudaMalloc((void **)&(array_2D[row]), columns * sizeof(float));
cudaMemcpy(cuda_array_2D[row], cpu_array_2D[row], columns * sizeof(float), cudaMemcpyHostToDevice);
}
for (int row = 0; row < rows; row++)
{
cudaFree(cuda_array_2D[row]);
}
cudaFree(cuda_array_2D);