Generating and Evaluating all Permutations of an Array using CUDA

Code written and copyright by Oleg Konings 2014.

The author would like to thank Norbert Juffa for his assistance during the code optimization process.

"Sometimes you need to examine every possibility..."

The Problem:

How to evaluate all N! permutations of an array against a test function in the least amount of time.

Why this matters:

While for many brute force/exhaustive search problems there exists better(quicker) solutions than examining each possibility, that is not always the case.

For such situation where there is no other approach, then the implementation of such an algorithm and the hardware which executes that algorithm are the determining factors in the overall running time.

Possible 'real world' applications:

To determine the optimal configuration of items whose functionality and utility is dependent on its order and position(or geometric placement) relative to other items
To find the most pleasing sequence of musical notes
To determine how many valid words can be made from a set of distinct letters
The sub-set of directed acyclic graph problems which do not have pseudo-polynomial solutions

What existing library functions can evaluate all permutations of an array?:

The well respected C++ STL library offers the next_permutation() function which can be used to generate all permutations of an array using a CPU.

http://www.cplusplus.com/reference/algorithm/next_permutation/

What are the limitations of the std::next_permutation() function?:

Even on an overclocked CPU the generation and evaluation of 13 array elements (13! or 6,227,020,800 arrangements) can take over 2 minutes, and once you get over 15 elements a CPU implementation will take a very long time(days) to complete.

Can GPUs solve this problem more quickly? If so how?:

GPUs can concurrently have thousands of threads examine the different sections of the problem space at the same time, rather than generating/evaluating them in serial fashion like next_permutation().

Even though the clock speed of a GPU is 1/3-1/4 that of a high-end CPU, they make up for that by the principle of SIMT:

http://en.wikipedia.org/wiki/Single_instruction,_multiple_threads

For these tests I will use two typical gaming rigs;

A gaming desktop PC, with a single desktop Nvidia Geforce GTX 980 GPU, and a liquid-cooled overclocked Intel i-7 desktop CPU ( 4.5 GHz )
A gaming laptop PC, with a single mobile Nvidia Geforce GTX 980m GPU, and a factory overclocked Intel mobile laptop CPU ( 3.3 GHz )
For the CPU implementation C++ will be the language used running on Windows 8.1 x64 compiled with full 03 optimizations.
For the GPU implementation the Nvidia GPU programming language CUDA will be used in conjunction with C++.
The CUDA version is 6.5 is used with the most recent graphics drivers.
The GTX 980 GPU will NOT be overclocked and will run at the default settings

The test evaluation function:

In this deliberately designed test, the inputs are;

an initial floating point starting number,
a target floating point number
an array of size N(13 in our basic test) comprised of N distinct real floating points numbers
The Big-O complexity of this algorithm is (N!*N + K), where N is the number of elements and K is the constant factor related to the factorial decomposition process

A function was created for this test which was designed to ensure that only one permutation will have an optimal value for a given set of inputs.

In this case each value of the array is applied to the starting number in this fashion:

current_value = current_value+ Array[ perm_index[i] ]/(i+1+ perm_index[i] )

Again this test function is somewhat contrived but is designed so that only one permutation for this test set maps to an optimal value for this specific input set.

The goal is to determine which permutation of the input array produces (from the test function) a final value with the smallest absolute difference from the target value.

So in the end the CPU based permutation generation and evaluation routine in C++ using looks like this:

CPU code

void check_perm_on_lovely_cpu(const int num_elem,const float num,const float target, float &dif, int &num_good, const float *nums, int *perm){

int *Arr=(int*)malloc(num_elem*sizeof(int));

for(int i=0;i<num_elem;i++){

Arr[i]=i;

}

float cur_num=num,cur_dif=999999999999.9f;

do{

cur_num=num;

for(int i=num_elem-1;i>=0;i--){

cur_num=cur_num+nums[Arr[i]]/(float((i+1)+Arr[i]));

}

cur_dif=fabs(cur_num-target);

if(cur_dif<dif){

dif=cur_dif;

num_good=1;

memcpy(perm,Arr,num_elem*sizeof(int));

}else if(cur_dif==dif){

num_good++;

}

}while(next_permutation(Arr,Arr+num_elem));

free(Arr);

}

The CUDA GPU implementation(in general terms):

The test platform:

The proposed GPU approach:

To generate every 64 bit value mapped to each distinct permutation from 0 to N!-1, then convert that number to the unique 0-based array permutation arrangement.
This process is referred to as 'Factorial Decomposition', and this approach is more effectively implemented on a GPU than the alternate approach used by next_permutation().
Each active thread can examine a number of values and decomposes them into their corresponding permutation arrangement.
Because there exist very few serial dependencies between different threads, this enables mostly independent operation between threads which is required for an effective parallel implementation of an algorithm.

Example case of 13 Array elements:

Primary CUDA GPU kernel launch:

- - 47,508 thread blocks of size 256 threads are launched in the first kernel, with each thread in a block generating and evaluating exactly 512 distinct permutations each.
  - Each thread block will use __shared__ memory to collect the 'block-local' optimal value and the permutation responsible for that value.
  - Upon completion of the block work, those values will be copied to global memory for later examination.

Secondary 'cleanup' CUDA GPU kernel launch:

- - One 256 thread block will first evaluate any remaining permutations which were not covered by first launch
  - Upon finishing remainder of work, each thread collects, compares and evaluates the best results from global memory from the previous launch
  - The best result value and the corresponding permutation are distilled in that thread block and copied to the back to global memory
  - Results copied back over to host memory

See below for project source code, but keep in mind that is a 'proof of concept' and not intended for general public use.

~~UPDATE 3/28/2015: 2 GPU permutation generation/evaluation implementation (GTX 980 x2) for 14 array elements~~

~~UPDATE 4/1/2015~~: Increased performance of both single and multi-gpu performance by 25%. Below time for dual GTX 980 run with new implementation. 14! generation/evaluation now theoretically possible to complete in 11 seconds. Single Titan X for 13! under 1.4 seconds, see 2nd chart down.

~~Will update code soon, all listed times other than multi-gpu are out-of-date.~~

UPDATE 10/23/2015: Posted updated times for EVGA GTX Titan X in TCC mode and Dual GTX Titan X 14! times.

2 GPU solution (GTX Titan X x2) for generations and evaluation of 14! permutations of array

Will Generate and evaluate all 87,178,291,200 permutations of a 14 element array

(14!*14+reduction constant factor)

Starting GPU testing 14!:

Multi-GPU implementation

GPU #0=GeForce GTX TITAN X

GPU #1=GeForce GTX TITAN X

GPU timing: 11.287 seconds.

ans0= 8776.32, permutation number 51789820077

ans1= 8738.38, permutation number 28318741677

GPU answer is 8738.38

Permutation as determined by OK CUDA implementation is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8604.89

Using idx # 8 ,input value= -1111.2, current working return value= -8657.8

Using idx # 1 ,input value= -333.145, current working return value= -8683.43

Using idx # 6 ,input value= -27.79, current working return value= -8685.07

Using idx # 12 ,input value= -42.0099, current working return value= -8686.98

Using idx # 11 ,input value= -1.57, current working return value= -8687.05

Using idx # 9 ,input value= 0.90003, current working return value= -8687

Using idx # 13 ,input value= 3.12354, current working return value= -8686.84

Using idx # 5 ,input value= 2.47, current working return value= -8686.62

Using idx # 10 ,input value= 10.1235, current working return value= -8685.95

Using idx # 7 ,input value= 8.888, current working return value= -8685.14

Using idx # 2 ,input value= 7.1119, current working return value= -8683.71

Using idx # 3 ,input value= 127.001, current working return value= -8658.31

Using idx # 0 ,input value= 31.4234, current working return value= -8626.89

Absolute difference(-8626.89-111.493)= 8738.38

Would be willing to share the new improved multi-gpu implementation, but you are going to have to contact me directly.

Sample output runs of both CPU and GPU implementations with verification of results:

Results on Gaming Desktop PC with single GTX Titan X GPU and 4.5 GHz i7 CPU:

Output Desktop CPU vs GPU 13 Elements

The results will be validated by CPU std::next_permutation(), and the performance difference between CUDA and CPU implementations will be compared.

Running an overclocked 4.5 GHz CPU version via STL next

permutation:

4.5 Ghz i7 CPU timing 13!: 128.703 seconds.

CPU answer is: 8783.86, number of permutations which map to that optimal value= 1

Permutation as determined by std::next_permutation() is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8645.24

Using idx # 8 ,input value= -1111.2, current working return value= -8700.8

Using idx # 1 ,input value= -333.145, current working return value= -8728.56

Using idx # 6 ,input value= -27.79, current working return value= -8730.29

Using idx # 12 ,input value= -42.0099, current working return value= -8732.29

Using idx # 11 ,input value= -1.57, current working return value= -8732.38

Using idx # 9 ,input value= 0.90003, current working return value= -8732.32

Using idx # 5 ,input value= 2.47, current working return value= -8732.1

Using idx # 10 ,input value= 10.1235, current working return value= -8731.42

Using idx # 7 ,input value= 8.888, current working return value= -8730.61

Using idx # 2 ,input value= 7.1119, current working return value= -8729.19

Using idx # 3 ,input value= 127.001, current working return value= -8703.79

Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Starting GPU testing:

Will evaluate 6227020800 permutations of array and return an optimal permutation and the optimal value associated with that permutation.

num_blx= 47508, adj_size= 1

Testing 13! version.

GPU timing: 1.237 seconds.

GPU answer is: 8783.86