The general prototype implementation needs to be designed with the following concepts:
What is the total problem space which needs to be examined?
What will be the current input set and how can one determine the problem space from the inputs?
Given the current problem space, how many threads are to be launched and what will be the size of the thread blocks?
What instruction set will each thread follow, and how much of the problem space will each thread examine?
What will be the desired outputs which will be returned to the host?
First Example: Visit every unique three index combination of an array
There are N choose 3 possibilities, and for an example N of 1000 there would be 166,167,000 unique combinations of indices. In the combination case order does not matter, unlike the permutation example.
On the CPU one would simply use a triple nested loop to implement this problem. But you do not need to visit an arrangement more than once, so it makes send to limit the loop like this:
CPU 3 combo
for(int i=0;i<N;i++){
for(int j=0;j<i;j++){
for(int k=0;k<j;k++){
...do stuff
}
}
}
For the GPU implementation the challenge then becomes how to limit the number of 3 index combinations so that there is no repetition.