Generating and Evaluating all Permutations of an Array using CUDA

Code written and copyright by Oleg Konings 2014.

The author would like to thank Norbert Juffa for his assistance during the code optimization process.

"Sometimes you need to examine every possibility..."

The Problem:

How to evaluate all N! permutations of an array against a test function in the least amount of time.

Why this matters:

While for many brute force/exhaustive search problems there exists better(quicker) solutions than examining each possibility, that is not always the case.

For such situation where there is no other approach, then the implementation of such an algorithm and the hardware which executes that algorithm are the determining factors in the overall running time.

Possible 'real world' applications:

  • To determine the optimal configuration of items whose functionality and utility is dependent on its order and position(or geometric placement) relative to other items

  • To find the most pleasing sequence of musical notes

  • To determine how many valid words can be made from a set of distinct letters

  • The sub-set of directed acyclic graph problems which do not have pseudo-polynomial solutions

What existing library functions can evaluate all permutations of an array?:

The well respected C++ STL library offers the next_permutation() function which can be used to generate all permutations of an array using a CPU.

http://www.cplusplus.com/reference/algorithm/next_permutation/

What are the limitations of the std::next_permutation() function?:

Even on an overclocked CPU the generation and evaluation of 13 array elements (13! or 6,227,020,800 arrangements) can take over 2 minutes, and once you get over 15 elements a CPU implementation will take a very long time(days) to complete.

Can GPUs solve this problem more quickly? If so how?:

GPUs can concurrently have thousands of threads examine the different sections of the problem space at the same time, rather than generating/evaluating them in serial fashion like next_permutation().

Even though the clock speed of a GPU is 1/3-1/4 that of a high-end CPU, they make up for that by the principle of SIMT:

http://en.wikipedia.org/wiki/Single_instruction,_multiple_threads

For these tests I will use two typical gaming rigs;

  • A gaming desktop PC, with a single desktop Nvidia Geforce GTX 980 GPU, and a liquid-cooled overclocked Intel i-7 desktop CPU ( 4.5 GHz )

  • A gaming laptop PC, with a single mobile Nvidia Geforce GTX 980m GPU, and a factory overclocked Intel mobile laptop CPU ( 3.3 GHz )

  • For the CPU implementation C++ will be the language used running on Windows 8.1 x64 compiled with full 03 optimizations.

  • For the GPU implementation the Nvidia GPU programming language CUDA will be used in conjunction with C++.

  • The CUDA version is 6.5 is used with the most recent graphics drivers.

  • The GTX 980 GPU will NOT be overclocked and will run at the default settings

The test evaluation function:

In this deliberately designed test, the inputs are;

  • an initial floating point starting number,

  • a target floating point number

  • an array of size N(13 in our basic test) comprised of N distinct real floating points numbers

  • The Big-O complexity of this algorithm is (N!*N + K), where N is the number of elements and K is the constant factor related to the factorial decomposition process

A function was created for this test which was designed to ensure that only one permutation will have an optimal value for a given set of inputs.

In this case each value of the array is applied to the starting number in this fashion:

current_value = current_value+ Array[ perm_index[i] ]/(i+1+ perm_index[i] )

Again this test function is somewhat contrived but is designed so that only one permutation for this test set maps to an optimal value for this specific input set.

  • The goal is to determine which permutation of the input array produces (from the test function) a final value with the smallest absolute difference from the target value.

So in the end the CPU based permutation generation and evaluation routine in C++ using looks like this:

CPU code

void check_perm_on_lovely_cpu(const int num_elem,const float num,const float target, float &dif, int &num_good, const float *nums, int *perm){

int *Arr=(int*)malloc(num_elem*sizeof(int));

for(int i=0;i<num_elem;i++){

Arr[i]=i;

}

float cur_num=num,cur_dif=999999999999.9f;

do{

cur_num=num;

for(int i=num_elem-1;i>=0;i--){

cur_num=cur_num+nums[Arr[i]]/(float((i+1)+Arr[i]));

}

cur_dif=fabs(cur_num-target);

if(cur_dif<dif){

dif=cur_dif;

num_good=1;

memcpy(perm,Arr,num_elem*sizeof(int));

}else if(cur_dif==dif){

num_good++;

}

}while(next_permutation(Arr,Arr+num_elem));

free(Arr);

}

The CUDA GPU implementation(in general terms):

The test platform:

The proposed GPU approach:

  • To generate every 64 bit value mapped to each distinct permutation from 0 to N!-1, then convert that number to the unique 0-based array permutation arrangement.

  • This process is referred to as 'Factorial Decomposition', and this approach is more effectively implemented on a GPU than the alternate approach used by next_permutation().

  • Each active thread can examine a number of values and decomposes them into their corresponding permutation arrangement.

  • Because there exist very few serial dependencies between different threads, this enables mostly independent operation between threads which is required for an effective parallel implementation of an algorithm.

Example case of 13 Array elements:

Primary CUDA GPU kernel launch:

      • 47,508 thread blocks of size 256 threads are launched in the first kernel, with each thread in a block generating and evaluating exactly 512 distinct permutations each.

      • Each thread block will use __shared__ memory to collect the 'block-local' optimal value and the permutation responsible for that value.

      • Upon completion of the block work, those values will be copied to global memory for later examination.

Secondary 'cleanup' CUDA GPU kernel launch:

      • One 256 thread block will first evaluate any remaining permutations which were not covered by first launch

      • Upon finishing remainder of work, each thread collects, compares and evaluates the best results from global memory from the previous launch

      • The best result value and the corresponding permutation are distilled in that thread block and copied to the back to global memory

      • Results copied back over to host memory

See below for project source code, but keep in mind that is a 'proof of concept' and not intended for general public use.

UPDATE 3/28/2015: 2 GPU permutation generation/evaluation implementation (GTX 980 x2) for 14 array elements

UPDATE 4/1/2015: Increased performance of both single and multi-gpu performance by 25%. Below time for dual GTX 980 run with new implementation. 14! generation/evaluation now theoretically possible to complete in 11 seconds. Single Titan X for 13! under 1.4 seconds, see 2nd chart down.

Will update code soon, all listed times other than multi-gpu are out-of-date.

UPDATE 10/23/2015: Posted updated times for EVGA GTX Titan X in TCC mode and Dual GTX Titan X 14! times.

2 GPU solution (GTX Titan X x2) for generations and evaluation of 14! permutations of array

Will Generate and evaluate all 87,178,291,200 permutations of a 14 element array

(14!*14+reduction constant factor)

Starting GPU testing 14!:

Multi-GPU implementation

GPU #0=GeForce GTX TITAN X

GPU #1=GeForce GTX TITAN X

GPU timing: 11.287 seconds.

ans0= 8776.32, permutation number 51789820077

ans1= 8738.38, permutation number 28318741677

GPU answer is 8738.38

Permutation as determined by OK CUDA implementation is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8604.89

Using idx # 8 ,input value= -1111.2, current working return value= -8657.8

Using idx # 1 ,input value= -333.145, current working return value= -8683.43

Using idx # 6 ,input value= -27.79, current working return value= -8685.07

Using idx # 12 ,input value= -42.0099, current working return value= -8686.98

Using idx # 11 ,input value= -1.57, current working return value= -8687.05

Using idx # 9 ,input value= 0.90003, current working return value= -8687

Using idx # 13 ,input value= 3.12354, current working return value= -8686.84

Using idx # 5 ,input value= 2.47, current working return value= -8686.62

Using idx # 10 ,input value= 10.1235, current working return value= -8685.95

Using idx # 7 ,input value= 8.888, current working return value= -8685.14

Using idx # 2 ,input value= 7.1119, current working return value= -8683.71

Using idx # 3 ,input value= 127.001, current working return value= -8658.31

Using idx # 0 ,input value= 31.4234, current working return value= -8626.89

Absolute difference(-8626.89-111.493)= 8738.38

Would be willing to share the new improved multi-gpu implementation, but you are going to have to contact me directly.

Sample output runs of both CPU and GPU implementations with verification of results:

Results on Gaming Desktop PC with single GTX Titan X GPU and 4.5 GHz i7 CPU:

Output Desktop CPU vs GPU 13 Elements

The results will be validated by CPU std::next_permutation(), and the performance difference between CUDA and CPU implementations will be compared.

Running an overclocked 4.5 GHz CPU version via STL next

permutation:

4.5 Ghz i7 CPU timing 13!: 128.703 seconds.

CPU answer is: 8783.86, number of permutations which map to that optimal value= 1

Permutation as determined by std::next_permutation() is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8645.24

Using idx # 8 ,input value= -1111.2, current working return value= -8700.8

Using idx # 1 ,input value= -333.145, current working return value= -8728.56

Using idx # 6 ,input value= -27.79, current working return value= -8730.29

Using idx # 12 ,input value= -42.0099, current working return value= -8732.29

Using idx # 11 ,input value= -1.57, current working return value= -8732.38

Using idx # 9 ,input value= 0.90003, current working return value= -8732.32

Using idx # 5 ,input value= 2.47, current working return value= -8732.1

Using idx # 10 ,input value= 10.1235, current working return value= -8731.42

Using idx # 7 ,input value= 8.888, current working return value= -8730.61

Using idx # 2 ,input value= 7.1119, current working return value= -8729.19

Using idx # 3 ,input value= 127.001, current working return value= -8703.79

Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Starting GPU testing:

Will evaluate 6227020800 permutations of array and return an optimal permutation and the optimal value associated with that permutation.

num_blx= 47508, adj_size= 1

Testing 13! version.

GPU timing: 1.237 seconds.

GPU answer is: 8783.86

Permutation as determined by OK CUDA implementation is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8645.24

Using idx # 8 ,input value= -1111.2, current working return value= -8700.8

Using idx # 1 ,input value= -333.145, current working return value= -8728.56

Using idx # 6 ,input value= -27.79, current working return value= -8730.29

Using idx # 12 ,input value= -42.0099, current working return value= -8732.29

Using idx # 11 ,input value= -1.57, current working return value= -8732.38

Using idx # 9 ,input value= 0.90003, current working return value= -8732.32

Using idx # 5 ,input value= 2.47, current working return value= -8732.1

Using idx # 10 ,input value= 10.1235, current working return value= -8731.42

Using idx # 7 ,input value= 8.888, current working return value= -8730.61

Using idx # 2 ,input value= 7.1119, current working return value= -8729.19

Using idx # 3 ,input value= 127.001, current working return value= -8703.79

Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Absolute difference(-8672.37-111.493)= 8783.86

Results on Gaming Laptop PC with GTX 980M and 3.3 GHz i7:

Laptop CPU vs mobile GPU(old code)

Using single laptop GPU GeForce GTX 980M

NOTE: code optimized for single GTX 980 only!

Starting value= -7919.02 , goal value= 111.493, number of floating point values in array= 13

Objective: To minimize the absolute difference between a processed starting value and the target value.

The order in which the values are fed into the test function do produce a set of distinct results dependent on order

In this test case there should only be one optimal value and one corresponding permutation of inputs which generate that value.

The results will be validated by CPU std::next_permutation(), and the performance difference between CUDA and CPU implementations will be compared,

3.3 Ghz notebook i7 CPU timing 13!: 167.324 seconds.

CPU answer is: 8783.86,

number of permutations which map to that optimal value= 1

Permutation as determined by stl::next_permutation() is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8645.24

Using idx # 8 ,input value= -1111.2, current working return value= -8700.8

Using idx # 1 ,input value= -333.145, current working return value= -8728.56

Using idx # 6 ,input value= -27.79, current working return value= -8730.29

Using idx # 12 ,input value= -42.0099, current working return value= -8732.29

Using idx # 11 ,input value= -1.57, current working return value= -8732.38

Using idx # 9 ,input value= 0.90003, current working return value= -8732.32

Using idx # 5 ,input value= 2.47, current working return value= -8732.1

Using idx # 10 ,input value= 10.1235, current working return value= -8731.42

Using idx # 7 ,input value= 8.888, current working return value= -8730.61

Using idx # 2 ,input value= 7.1119, current working return value= -8729.19

Using idx # 3 ,input value= 127.001, current working return value= -8703.79

Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Starting GPU testing:

Will evaluate 6227020800 permutations of array and return an optimal permutation and the optimal value associated with that permutation.

num_blx= 47508, adj_size= 1

Testing 13! version.

mobile GPU timing 13!: 2.811 seconds.

GPU answer is: 8783.86

Permutation as determined by OK CUDA implementation is as follows:

Start value= -7919.02

Using idx # 4 ,input value= -12345.7, current working return value= -8645.24

Using idx # 8 ,input value= -1111.2, current working return value= -8700.8

Using idx # 1 ,input value= -333.145, current working return value= -8728.56

Using idx # 6 ,input value= -27.79, current working return value= -8730.29

Using idx # 12 ,input value= -42.0099, current working return value= -8732.29

Using idx # 11 ,input value= -1.57, current working return value= -8732.38

Using idx # 9 ,input value= 0.90003, current working return value= -8732.32

Using idx # 5 ,input value= 2.47, current working return value= -8732.1

Using idx # 10 ,input value= 10.1235, current working return value= -8731.42

Using idx # 7 ,input value= 8.888, current working return value= -8730.61

Using idx # 2 ,input value= 7.1119, current working return value= -8729.19

Using idx # 3 ,input value= 127.001, current working return value= -8703.79

Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Absolute difference(-8672.37-111.493)= 8783.86


If there is some thing faster than this let me know, as I think as of 4/7/2015 this is it.

/*

Permutation Generation and Evaluation Code written by Oleg J Konings with optimization assistance from Norbert Juffa

contact: okonings@tripleringtech.com

*/

#include <algorithm>

#include <iostream>

#include <utility>

#include <cstdlib>

#include <cstdio>

#include <cstring>

#include <string>

#include <cmath>

#include <vector>

#include <ctime>

#include <cuda.h>

#include <math_functions.h>

#include <vector_types.h>

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <Windows.h>

#include <MMSystem.h>

#pragma comment(lib, "winmm.lib")

using namespace std;

typedef long long ll;

#define all(c) (c).begin(),(c).end()

typedef pair<int, int> Pii;

typedef vector<int> Vi;

#define _DTH cudaMemcpyDeviceToHost

#define _DTD cudaMemcpyDeviceToDevice

#define _HTD cudaMemcpyHostToDevice

#define THREADS 256

#define MEGA 1307674368000LL

#define MAX_DIF_VAL 99999999.9f

#define NUM_ELEMENTS 13

#define NUM_ELEM NUM_ELEMENTS

#define DO_TEST 1

const int multi_gpu=0;

const int blockSize_13=65536<<1;

const int blockSize_14=65536<<1;

struct D_denoms_14_local{

float denoms[14];

};

inline int get_adj_size(const long long num_elem){

return 1;

}

inline int get_dynamic_block_size(const int adj_size, const int blkSize){ return (1 << (adj_size - 1))*blkSize; }

inline float host_get_F_val(float cur_num, float numer, float denom){ return cur_num + numer / denom; }

bool InitMMTimer(UINT wTimerRes);

void DestroyMMTimer(UINT wTimerRes, bool init);

long long fact_val(long long cur){ return cur <= 1LL ? 1LL : (cur*fact_val(cur - 1LL)); }

void check_perm_on_lovely_cpu(const int num_elem, const float num, const float target, float &dif, int &num_good, const float *nums, int *perm);

void _cpu_derive2(const long long num, vector<int> &V, int digits);

//for testing full version and verification of answers on host

const long long H_F[16] = { 1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600LL, 6227020800LL, 87178291200LL, 1307674368000LL };

const float H_denoms[15] = { 31.4234f, -333.145f, 7.1119f, 127.001f, -12345.67f, 2.47f, -27.79f, 8.888f, -1111.199992f, 0.90003f, 10.123456f, -1.57f, -42.00988f, 3.12354f,-12.7f};

//GPU info for full version

//__constant__ long long D_F[16] = { 1, 1, 2, 6, 24, 120, 720, 5040, 40320, 362880, 3628800, 39916800, 479001600LL, 6227020800LL, 87178291200LL, 1307674368000LL };

//__constant__ float D_denoms[15] = { 31.4234f, -333.145f, 7.1119f, 127.001f, -12345.67f, 2.47f, -27.79f, 8.888f, -1111.199992f, 0.90003f, 10.123456f, -1.57f, -42.00988f, 3.12354f,-12.7f };

__device__ __forceinline__ float device_get_F_val(float cur_num, float numer, float denom){ return cur_num + numer / denom; }

//CUDA kernel prototypes for first and second step of evaluating permutations of array and returning the an optimal permutation associated with the optimal value

template<int blockWork>

__global__ void _gpu_perm_14(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const int digits);

__global__ void _gpu_perm_last_step_14(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const long long bound,

const int digits,

const long long rem_start,

const int num_blox);

template<int blockWork>

__global__ void _gpu_perm_14_split(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const int digits,

const long long off_multi);

template<int blockWork>

__global__ void _gpu_perm_13(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const int digits);

__global__ void _gpu_perm_last_step_13(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const long long bound,

const int digits,

const long long rem_start,

const int num_blox);

int main(){

int compute_capability = 0;

cudaDeviceProp deviceProp;

cudaError_t err = cudaGetDeviceProperties(&deviceProp, compute_capability);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

const bool can_do_on_this_gpu = bool((deviceProp.major >= 3 && deviceProp.minor >= 5) || deviceProp.major >= 5);

string ss = can_do_on_this_gpu ? "Capable!\n" : "Not Sufficient compute capability!\n";

std::cout << ss;

if (!can_do_on_this_gpu)return 0;//

const int WDDM_Timeout_set = deviceProp.kernelExecTimeoutEnabled;

if (WDDM_Timeout_set != 0){

std::cout << "\nCurrent OS setting has kernel timeout limit! Cannot run application!\n";

return 0;

}

std::cout << "\nUsing single GPU " << deviceProp.name << '\n';

//std::cout << "\nNOTE: code optimized for single GTX 980 only!\n";

DWORD startTime = 0, endTime = 0, GPUtime = 0, CPUtime = 0;

UINT wTimerRes = 0;

bool init = false;

const int num_elements = NUM_ELEM;

const float start = -7919.02f;

const float target = 111.49317f;

std::cout << "\nStarting value= " << start << " , goal value= " << target << ", number of floating point values in array= " << num_elements << '\n';

std::cout << "Objective: To minimize the absolute difference between a processed starting value and the target value.\n";

std::cout << "The order in which the values are fed into the test function do produce a set of distinct results dependant on order\n";

std::cout << "In this test case there should only be one optimal value and one corresponding permutation of inputs which generate that value.\n";

std::cout << "The results will be validated by CPU std::next_permutation(), and the performance difference between CUDA and CPU implementations will be compared\n";

std::cout<<"NOTE: CPU version may take a long time to finish!\n";

int *result_perm = (int*)malloc(NUM_ELEM*sizeof(int));

memset(result_perm, 0, NUM_ELEM*sizeof(int));

float dif = MAX_DIF_VAL;

int num_good = 0;

float h_val_start = start;

/*

std::cout << "Running overclocked 4.5 GHz CPU version via STL next_permutation: \n";

wTimerRes = 0;

init = InitMMTimer(wTimerRes);

startTime = timeGetTime();

check_perm_on_lovely_cpu(num_elements, start, target, dif, num_good, H_denoms, result_perm);

endTime = timeGetTime();

CPUtime = endTime - startTime;

std::cout << "4.5 Ghz i7 CPU timing: " << double(CPUtime) / 1000. << " seconds.\n";

std::cout << "CPU answer is: " << dif << ", number of permutations which map to that optimal value= " << num_good << '\n';

std::cout << "\nPermutation as determined by stl::next_permutation() is as follows:\n";

cout << "Start value= " << start << '\n';

for (int i = num_elements - 1; i >= 0; i--){

std::cout << "Using idx # " << result_perm[i] << " ,input value= " << H_denoms[result_perm[i]] <<

", current working return value= " << host_get_F_val(h_val_start, H_denoms[result_perm[i]], float((i + 1) + result_perm[i])) << '\n';

h_val_start = host_get_F_val(h_val_start, H_denoms[result_perm[i]], float((i + 1) + result_perm[i]));

}

std::cout << '\n';

*/

if (DO_TEST){

std::cout << "\nStarting GPU testing:\n";

if (multi_gpu){//test with permutation evaluation, optimization,scan,reduction etc..

std::cout<<"\nMulti-GPU implementation\n";

int deviceCount;

err=cudaGetDeviceCount(&deviceCount);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

int device;

for(device=0;device<deviceCount;device++){

err= cudaGetDeviceProperties(&deviceProp,device);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

std::cout << "\nGPU #"<<device<<"=" << deviceProp.name << '\n';

}

const long long overall_result_space=H_F[NUM_ELEMENTS];

const long long result_space=43589145600LL;

const int adj_size = get_adj_size(result_space);

const int temp_blocks_sz = get_dynamic_block_size(adj_size, blockSize_14);

const int num_blx = int(result_space / long long(temp_blocks_sz));

std::cout << "\nnum_blx= " << num_blx << ", adj_size= " << adj_size << '\n';

const long long rem_start0 = result_space - (result_space - long long(num_blx)*long long(temp_blocks_sz));

const long long rem_start1 =result_space+( result_space - (result_space - long long(num_blx)*long long(temp_blocks_sz)));//chk?

std::cout<<"\nrem_start0= "<<rem_start0<<", rem_start1= "<<rem_start1<<'\n';

const unsigned int num_bytes_ans = num_blx*sizeof(float);

const unsigned int num_bytes_perm = num_blx*sizeof(int2);

D_denoms_14_local Test_eval;

for(int i=0;i<14;i++){

Test_eval.denoms[i]=H_denoms[i];

}

float GPUans0 = MAX_DIF_VAL,GPUans1 = MAX_DIF_VAL;

long long GPU_perm_number= 0LL;

int2 perm_mask_split0 = { 0 };

int2 perm_mask_split1 = { 0 };

//0

err=cudaSetDevice(0);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

float *ans_val0;

int2 *perm_val0;

err = cudaMalloc((void **)&ans_val0, num_bytes_ans);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMalloc((void **)&perm_val0, num_bytes_perm);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemset(ans_val0, 0, num_blx*sizeof(int));

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

//1

err=cudaSetDevice(1);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

float *ans_val1;

int2 *perm_val1;

err = cudaMalloc((void **)&ans_val1, num_bytes_ans);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMalloc((void **)&perm_val1, num_bytes_perm);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemset(ans_val1, 0, num_blx*sizeof(int));

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

//start with device 1

wTimerRes = 0;

init = InitMMTimer(wTimerRes);

startTime = timeGetTime();

_gpu_perm_14<blockSize_14><<<num_blx, THREADS >>>(ans_val1, perm_val1,Test_eval, start, target, NUM_ELEMENTS);

//0

//now more powerful device 0

err=cudaSetDevice(0);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

_gpu_perm_14_split<blockSize_14><<<num_blx, THREADS >>>(ans_val0, perm_val0,Test_eval, start, target, NUM_ELEMENTS,result_space);

_gpu_perm_last_step_14<<< 1, THREADS >>>(ans_val0, perm_val0,Test_eval, start, target,overall_result_space, num_elements, rem_start1, num_blx);

err = cudaMemcpy(&GPUans0, ans_val0, sizeof(float), _DTH);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemcpy(&perm_mask_split0, perm_val0, sizeof(int2), _DTH);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

//1

//now that answers are on host go back to device 1

err=cudaSetDevice(1);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

_gpu_perm_last_step_14<<< 1, THREADS >>>(ans_val1, perm_val1,Test_eval, start, target,result_space, num_elements, rem_start0, num_blx);

err = cudaMemcpy(&GPUans1, ans_val1, sizeof(float), _DTH);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemcpy(&perm_mask_split1, perm_val1, sizeof(int2), _DTH);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

endTime = timeGetTime();

GPUtime = endTime - startTime;

std::cout << "GPU timing: " << double(GPUtime) / 1000.0 << " seconds.\n";

//still device #1, free memory

err=cudaFree(ans_val1);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err=cudaFree(perm_val1);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

std::cout<<"\nFree memory from device 1\n";

//0

err=cudaSetDevice(0);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err=cudaFree(ans_val0);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err=cudaFree(perm_val0);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

std::cout<<"\nFree memory from device 0\n";

vector<int> V(NUM_ELEMENTS, 0);

std::cout<<"\nans0= "<<GPUans0<<", permutation number "<< *reinterpret_cast<long long *>(&perm_mask_split0)<<'\n';

std::cout<<"\nans1= "<<GPUans1<<", permutation number "<< *reinterpret_cast<long long *>(&perm_mask_split1)<<'\n';

if(GPUans0<GPUans1){

std::cout<<"\nGPU answer is "<<GPUans0<<'\n';

GPU_perm_number = *reinterpret_cast<long long *>(&perm_mask_split0);

_cpu_derive2(GPU_perm_number, V, NUM_ELEMENTS);

std::cout << "\nPermutation as determined by OK CUDA implementation is as follows:\n";

h_val_start = start;

cout << "Start value= " << start << '\n';

for (int i = num_elements - 1; i >= 0; i--){

std::cout << "Using idx # " << V[i] << " ,input value= " << H_denoms[V[i]] <<

", current working return value= " << host_get_F_val(h_val_start, H_denoms[V[i]], float((i + 1) + V[i])) << '\n';

h_val_start = host_get_F_val(h_val_start, H_denoms[V[i]], float((i + 1) + V[i]));

}

std::cout << "\nAbsolute difference(" << h_val_start << "-" << target << ")= " << fabs(h_val_start - target) << '\n';

std::cout << '\n';

}else{

std::cout<<"\nGPU answer is "<<GPUans1<<'\n';

GPU_perm_number = *reinterpret_cast<long long *>(&perm_mask_split1);

_cpu_derive2(GPU_perm_number, V, NUM_ELEMENTS);

std::cout << "\nPermutation as determined by OK CUDA implementation is as follows:\n";

h_val_start = start;

cout << "Start value= " << start << '\n';

for (int i = num_elements - 1; i >= 0; i--){

std::cout << "Using idx # " << V[i] << " ,input value= " << H_denoms[V[i]] <<

", current working return value= " << host_get_F_val(h_val_start, H_denoms[V[i]], float((i + 1) + V[i])) << '\n';

h_val_start = host_get_F_val(h_val_start, H_denoms[V[i]], float((i + 1) + V[i]));

}

std::cout << "\nAbsolute difference(" << h_val_start << "-" << target << ")= " << fabs(h_val_start - target) << '\n';

std::cout << '\n';

}

std::cout << "\nReseting Device 0!\n";

err = cudaDeviceReset();

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

//1

err=cudaSetDevice(1);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

std::cout << "\nReseting Device 1!\n";

err = cudaDeviceReset();

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

//0, leave with 0 set

err=cudaSetDevice(0);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

}else if(num_elements==13){

const long long result_space = H_F[NUM_ELEMENTS];

cout << "\nWill evaluate " << result_space << " permutations of array and return an optimal permutation and the optimal value associated with that permutation.\n";

const int adj_size = get_adj_size(result_space);

const int temp_blocks_sz = get_dynamic_block_size(adj_size, blockSize_13);

const int num_blx = int(result_space / long long(temp_blocks_sz));

std::cout << "\nnum_blx= " << num_blx << ", adj_size= " << adj_size << '\n';

const long long rem_start = result_space - (result_space - long long(num_blx)*long long(temp_blocks_sz));

D_denoms_14_local Test_eval;

for(int i=0;i<14;i++){

Test_eval.denoms[i]=H_denoms[i];

}

float GPUans = MAX_DIF_VAL;

long long GPU_perm_number = 0LL;

std::cout << "\nTesting 13! version.\n";

int2 perm_mask_split = { 0 };

float *ans_val;

int2 *perm_val;

const unsigned int num_bytes_ans = num_blx*sizeof(float);

const unsigned int num_bytes_perm = num_blx*sizeof(int2);

err = cudaMalloc((void **)&ans_val, num_bytes_ans);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMalloc((void **)&perm_val, num_bytes_perm);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemset(ans_val, 0, num_blx*sizeof(int));

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

wTimerRes = 0;

init = InitMMTimer(wTimerRes);

startTime = timeGetTime();


_gpu_perm_13<blockSize_13><<<num_blx, THREADS >>>(ans_val, perm_val,Test_eval, start, target, NUM_ELEMENTS);


err = cudaThreadSynchronize();

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

_gpu_perm_last_step_13<<< 1, THREADS >>>(ans_val, perm_val,Test_eval, start, target, result_space, num_elements, rem_start, num_blx);

err = cudaThreadSynchronize();

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemcpy(&GPUans, ans_val, sizeof(float), _DTH);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaMemcpy(&perm_mask_split, perm_val, sizeof(int2), _DTH);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

endTime = timeGetTime();

GPUtime = endTime - startTime;

std::cout << "GPU timing: " << double(GPUtime) / 1000.0 << " seconds.\n";

std::cout << "GPU answer is: " << GPUans << '\n';

GPU_perm_number = *reinterpret_cast<long long *>(&perm_mask_split);

DestroyMMTimer(wTimerRes, init);

err = cudaFree(ans_val);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

err = cudaFree(perm_val);

if (err != cudaSuccess){ printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__); }

vector<int> V(NUM_ELEMENTS, 0);

_cpu_derive2(GPU_perm_number, V, NUM_ELEMENTS);

std::cout << "\nPermutation as determined by OK CUDA implementation is as follows:\n";

h_val_start = start;

cout << "Start value= " << start << '\n';

for (int i = num_elements - 1; i >= 0; i--){

std::cout << "Using idx # " << V[i] << " ,input value= " << H_denoms[V[i]] <<

", current working return value= " << host_get_F_val(h_val_start, H_denoms[V[i]], float((i + 1) + V[i])) << '\n';

h_val_start = host_get_F_val(h_val_start, H_denoms[V[i]], float((i + 1) + V[i]));

}

std::cout << "\nAbsolute difference(" << h_val_start << "-" << target << ")= " << fabs(h_val_start - target) << '\n';

std::cout << '\n';

}else if(0){

}

}

free(result_perm);

return 0;

}

bool InitMMTimer(UINT wTimerRes){

TIMECAPS tc;

if (timeGetDevCaps(&tc, sizeof(TIMECAPS)) != TIMERR_NOERROR) { return false; }

wTimerRes = min(max(tc.wPeriodMin, 1), tc.wPeriodMax);

timeBeginPeriod(wTimerRes);

return true;

}

void DestroyMMTimer(UINT wTimerRes, bool init){

if (init)

timeEndPeriod(wTimerRes);

}

void _cpu_derive2(const long long num, vector<int> &V, const int digits){

long long tnum = num, c;

int B[19] = { 0 };

for (int i = 0; i<digits; i++){ B[i] = i; }

for (int d = digits - 1; d >= 0; d--){

c = long long(d);

while (c*H_F[d]>tnum){ --c; }

if (d == digits - 1){

V[d] = B[int(c)];

B[int(c)] = -1;

}

else{

int cc = 0;

for (int ii = 0; ii<digits; ii++){

if (B[ii] != -1){

if (cc == int(c)){

V[d] = B[ii];

B[ii] = -1;

break;

}

else{

cc++;

}

}

}

}

tnum -= c*H_F[d];

}

}

void check_perm_on_lovely_cpu(const int num_elem, const float num, const float target, float &dif, int &num_good, const float *nums, int *perm){

int *Arr = (int*)malloc(num_elem*sizeof(int));

for (int i = 0; i<num_elem; i++){

Arr[i] = i;

}

float cur_num = num, cur_dif = 999999999999.9f;

do{

cur_num = num;

for (int i = num_elem - 1; i >= 0; i--){

cur_num = cur_num + nums[Arr[i]] / (float((i + 1) + Arr[i]));

}

cur_dif = fabs(cur_num - target);

if (cur_dif<dif){

dif = cur_dif;

num_good = 1;

memcpy(perm, Arr, num_elem*sizeof(int));

}

else if (cur_dif == dif){

num_good++;

}

} while (next_permutation(Arr, Arr + num_elem));

free(Arr);

}

///////////////////

template<int blockWork>

__global__ void _gpu_perm_14(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const int digits){

const long long offset = long long(threadIdx.x) + long long(blockIdx.x)*long long(blockWork);

const int reps = blockWork >> 8;

const int warpIndex = threadIdx.x % 32;

__shared__ float blk_best[8];

__shared__ int2 mask_val[8];

__shared__ float D_denoms[15];

if(threadIdx.x<14){

D_denoms[threadIdx.x]=Test_vals.denoms[threadIdx.x];

}

__syncthreads();

int ii, idx, c;

float value = MAX_DIF_VAL, cur_val;

unsigned int B, cc;

int2 mask_as_int2, t2;

long long tnum;

for (ii = 0; ii<reps; ii++){

tnum = offset + long long(ii*THREADS);

cur_val = initial_val;

B = 0;

//13

c=int(tnum/6227020800LL);

cur_val = device_get_F_val(cur_val, D_denoms[c], (14.0f + float(c)));

B |= (1 << c);

tnum -= long long(c)*6227020800LL;

//12

t2.y=c=int(tnum/479001600LL);

cc = ~B;

while(c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (13.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*479001600LL;

//11

t2.y=c=int(tnum/39916800LL);

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (12.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*39916800LL;

//yip

t2.x=int(tnum);

//10

t2.y=c=t2.x/3628800;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (11.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*3628800;

//9

t2.y=c=t2.x/362880;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (10.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*362880;

//8

t2.y=c=t2.x/40320;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (9.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*40320;

//7

t2.y=c=t2.x/5040;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (8.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*5040;

//6

t2.y=c=t2.x/720;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val,D_denoms[idx], (7.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*720;

//5

t2.y=c=t2.x/120;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (6.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*120;

//4

t2.y=c=t2.x/24;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (5.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*24;

//3

t2.y=c=t2.x/6;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (4.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*6;

//2

t2.y=c=t2.x/2;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (3.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*2;

//1

c=t2.x;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (2.0f + float(idx)));

B |= (1 << idx);

//0

cc = ~B;

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (1.0f + float(idx)));

B |= (1 << idx);

//now have value associated with the current permutation, check to see if it is the best this thread has seen so far, and cache the value with the permuation

if (fabsf(cur_val - target)<value){

value = fabsf(cur_val - target);

tnum = offset + long long(ii*THREADS);

mask_as_int2 = *reinterpret_cast<int2 *>(&tnum);

}

}//end main loop

//reduce

cur_val = __shfl(value, warpIndex + 16);

t2.x = __shfl(mask_as_int2.x, warpIndex + 16);

t2.y = __shfl(mask_as_int2.y, warpIndex + 16);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 8);

t2.x = __shfl(mask_as_int2.x, warpIndex + 8);

t2.y = __shfl(mask_as_int2.y, warpIndex + 8);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 4);

t2.x = __shfl(mask_as_int2.x, warpIndex + 4);

t2.y = __shfl(mask_as_int2.y, warpIndex + 4);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 2);

t2.x = __shfl(mask_as_int2.x, warpIndex + 2);

t2.y = __shfl(mask_as_int2.y, warpIndex + 2);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 1);

t2.x = __shfl(mask_as_int2.x, warpIndex + 1);

t2.y = __shfl(mask_as_int2.y, warpIndex + 1);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

if (warpIndex == 0){

blk_best[threadIdx.x >> 5] = value;

mask_val[threadIdx.x >> 5] = mask_as_int2;

}

__syncthreads();

if (threadIdx.x == 0){

cur_val = blk_best[0];

t2 = mask_val[0];

if (blk_best[1]<cur_val){

cur_val = blk_best[1];

t2 = mask_val[1];

}

if (blk_best[2]<cur_val){

cur_val = blk_best[2];

t2 = mask_val[2];

}

if (blk_best[3]<cur_val){

cur_val = blk_best[3];

t2 = mask_val[3];

}

if (blk_best[4]<cur_val){

cur_val = blk_best[4];

t2 = mask_val[4];

}

if (blk_best[5]<cur_val){

cur_val = blk_best[5];

t2 = mask_val[5];

}

if (blk_best[6]<cur_val){

cur_val = blk_best[6];

t2 = mask_val[6];

}

if (blk_best[7]<cur_val){

cur_val = blk_best[7];

t2 = mask_val[7];

}

ans_val[blockIdx.x] = cur_val;

perm_val[blockIdx.x] = t2;

}

}

__global__ void _gpu_perm_last_step_14(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const long long bound,

const int digits,

const long long rem_start,

const int num_blox){

const long long offset = long long(threadIdx.x) + rem_start;

const int warpIndex = threadIdx.x % 32;

__shared__ float blk_best[8];

__shared__ int2 mask_val[8];

unsigned int B, cc;

int ii = 1, idx, c;

float value = MAX_DIF_VAL, cur_val;

int2 mask_as_int2, t2;

long long tnum, adj = 0LL;

for (; (offset + adj)<bound; ii++){

tnum = offset + adj;

cur_val = initial_val;

B = 0;

//13

c=int(tnum/6227020800LL);

cur_val = device_get_F_val(cur_val, Test_vals.denoms[c], (14.0f + float(c)));

B |= (1 << c);

tnum -= long long(c)*6227020800LL;

//12

t2.y=c=int(tnum/479001600LL);

cc = ~B;

while(c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (13.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*479001600LL;

//11

t2.y=c=int(tnum/39916800LL);

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (12.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*39916800LL;

//yip

t2.x=int(tnum);

//10

t2.y=c=t2.x/3628800;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (11.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*3628800;

//9

t2.y=c=t2.x/362880;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (10.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*362880;

//8

t2.y=c=t2.x/40320;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (9.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*40320;

//7

t2.y=c=t2.x/5040;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (8.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*5040;

//6

t2.y=c=t2.x/720;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val,Test_vals.denoms[idx], (7.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*720;

//5

t2.y=c=t2.x/120;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (6.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*120;

//4

t2.y=c=t2.x/24;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (5.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*24;

//3

t2.y=c=t2.x/6;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (4.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*6;

//2

t2.y=c=t2.x/2;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (3.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*2;

//1

c=t2.x;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (2.0f + float(idx)));

B |= (1 << idx);

//0

cc = ~B;

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (1.0f + float(idx)));

B |= (1 << idx);

//now have value associated with the current permutation, check to see if it is the best this thread has seen so far, and cache the value with the permuation

if (fabsf(cur_val - target)<value){

value = fabsf(cur_val - target);

tnum = offset + adj;

mask_as_int2 = *reinterpret_cast<int2 *>(&tnum);

}

adj = (long long(ii) << 8LL);

}//end main loop

//got through other block optimal results, compare to current, and cache the best value and a permutation associated with that avlue

adj = 0LL;

for (ii = 1; (threadIdx.x + int(adj))<num_blox; ii++){

idx = (threadIdx.x + int(adj));

if (ans_val[idx]<value){

value = ans_val[idx];

mask_as_int2 = perm_val[idx];

}

adj = (long long(ii) << 8LL);

}

cur_val = __shfl(value, warpIndex + 16);

t2.x = __shfl(mask_as_int2.x, warpIndex + 16);

t2.y = __shfl(mask_as_int2.y, warpIndex + 16);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 8);

t2.x = __shfl(mask_as_int2.x, warpIndex + 8);

t2.y = __shfl(mask_as_int2.y, warpIndex + 8);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 4);

t2.x = __shfl(mask_as_int2.x, warpIndex + 4);

t2.y = __shfl(mask_as_int2.y, warpIndex + 4);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 2);

t2.x = __shfl(mask_as_int2.x, warpIndex + 2);

t2.y = __shfl(mask_as_int2.y, warpIndex + 2);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 1);

t2.x = __shfl(mask_as_int2.x, warpIndex + 1);

t2.y = __shfl(mask_as_int2.y, warpIndex + 1);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

if (warpIndex == 0){

blk_best[threadIdx.x >> 5] = value;

mask_val[threadIdx.x >> 5] = mask_as_int2;

}

__syncthreads();

if (threadIdx.x == 0){

cur_val = blk_best[0];

t2 = mask_val[0];

if (blk_best[1]<cur_val){

cur_val = blk_best[1];

t2 = mask_val[1];

}

if (blk_best[2]<cur_val){

cur_val = blk_best[2];

t2 = mask_val[2];

}

if (blk_best[3]<cur_val){

cur_val = blk_best[3];

t2 = mask_val[3];

}

if (blk_best[4]<cur_val){

cur_val = blk_best[4];

t2 = mask_val[4];

}

if (blk_best[5]<cur_val){

cur_val = blk_best[5];

t2 = mask_val[5];

}

if (blk_best[6]<cur_val){

cur_val = blk_best[6];

t2 = mask_val[6];

}

if (blk_best[7]<cur_val){

cur_val = blk_best[7];

t2 = mask_val[7];

}

ans_val[0] = cur_val;

perm_val[0] = t2;

}

}

template<int blockWork>

__global__ void _gpu_perm_14_split(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const int digits,

const long long off_multi){

const long long offset = off_multi+(long long(threadIdx.x) + long long(blockIdx.x)*long long(blockWork));

const int reps = blockWork >> 8;

const int warpIndex = threadIdx.x % 32;

__shared__ float blk_best[8];

__shared__ int2 mask_val[8];

__shared__ float D_denoms[15];

if(threadIdx.x<14){

D_denoms[threadIdx.x]=Test_vals.denoms[threadIdx.x];

}

__syncthreads();

int ii, idx, c;

float value = MAX_DIF_VAL, cur_val;

unsigned int B, cc;

int2 mask_as_int2, t2;

long long tnum;

for (ii = 0; ii<reps; ii++){

tnum = offset + long long(ii*THREADS);

cur_val = initial_val;

B = 0;

//13

c=int(tnum/6227020800LL);

cur_val = device_get_F_val(cur_val, D_denoms[c], (14.0f + float(c)));

B |= (1 << c);

tnum -= long long(c)*6227020800LL;

//12

t2.y=c=int(tnum/479001600LL);

cc = ~B;

while(c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (13.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*479001600LL;

//11

t2.y=c=int(tnum/39916800LL);

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (12.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*39916800LL;

//yip

t2.x=int(tnum);

//10

t2.y=c=t2.x/3628800;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (11.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*3628800;

//9

t2.y=c=t2.x/362880;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (10.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*362880;

//8

t2.y=c=t2.x/40320;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (9.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*40320;

//7

t2.y=c=t2.x/5040;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (8.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*5040;

//6

t2.y=c=t2.x/720;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val,D_denoms[idx], (7.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*720;

//5

t2.y=c=t2.x/120;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (6.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*120;

//4

t2.y=c=t2.x/24;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (5.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*24;

//3

t2.y=c=t2.x/6;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (4.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*6;

//2

t2.y=c=t2.x/2;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (3.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*2;

//1

c=t2.x;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (2.0f + float(idx)));

B |= (1 << idx);

//0

cc = ~B;

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (1.0f + float(idx)));

B |= (1 << idx);

//now have value associated with the current permutation, check to see if it is the best this thread has seen so far, and cache the value with the permuation

if (fabsf(cur_val - target)<value){

value = fabsf(cur_val - target);

tnum = offset + long long(ii*THREADS);

mask_as_int2 = *reinterpret_cast<int2 *>(&tnum);

}

}//end main loop

//reduce

cur_val = __shfl(value, warpIndex + 16);

t2.x = __shfl(mask_as_int2.x, warpIndex + 16);

t2.y = __shfl(mask_as_int2.y, warpIndex + 16);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 8);

t2.x = __shfl(mask_as_int2.x, warpIndex + 8);

t2.y = __shfl(mask_as_int2.y, warpIndex + 8);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 4);

t2.x = __shfl(mask_as_int2.x, warpIndex + 4);

t2.y = __shfl(mask_as_int2.y, warpIndex + 4);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 2);

t2.x = __shfl(mask_as_int2.x, warpIndex + 2);

t2.y = __shfl(mask_as_int2.y, warpIndex + 2);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 1);

t2.x = __shfl(mask_as_int2.x, warpIndex + 1);

t2.y = __shfl(mask_as_int2.y, warpIndex + 1);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

if (warpIndex == 0){

blk_best[threadIdx.x >> 5] = value;

mask_val[threadIdx.x >> 5] = mask_as_int2;

}

__syncthreads();

if (threadIdx.x == 0){

cur_val = blk_best[0];

t2 = mask_val[0];

if (blk_best[1]<cur_val){

cur_val = blk_best[1];

t2 = mask_val[1];

}

if (blk_best[2]<cur_val){

cur_val = blk_best[2];

t2 = mask_val[2];

}

if (blk_best[3]<cur_val){

cur_val = blk_best[3];

t2 = mask_val[3];

}

if (blk_best[4]<cur_val){

cur_val = blk_best[4];

t2 = mask_val[4];

}

if (blk_best[5]<cur_val){

cur_val = blk_best[5];

t2 = mask_val[5];

}

if (blk_best[6]<cur_val){

cur_val = blk_best[6];

t2 = mask_val[6];

}

if (blk_best[7]<cur_val){

cur_val = blk_best[7];

t2 = mask_val[7];

}

ans_val[blockIdx.x] = cur_val;

perm_val[blockIdx.x] = t2;

}

}

template<int blockWork>

__global__ void _gpu_perm_13(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const int digits){

const long long offset = long long(threadIdx.x) + long long(blockIdx.x)*long long(blockWork);

const int reps = blockWork >> 8;

const int warpIndex = threadIdx.x % 32;

__shared__ float blk_best[8];

__shared__ int2 mask_val[8];

__shared__ float D_denoms[15];

if(threadIdx.x<14){

D_denoms[threadIdx.x]=Test_vals.denoms[threadIdx.x];

}

__syncthreads();

int ii, idx, c;

float value = MAX_DIF_VAL, cur_val;

unsigned int B, cc;

int2 mask_as_int2, t2;

long long tnum;

for (ii = 0; ii<reps; ii++){

tnum = offset + long long(ii*THREADS);

cur_val = initial_val;

B = 0;

//12

c=int(tnum/479001600LL);

cur_val = device_get_F_val(cur_val, D_denoms[c], (13.0f + float(c)));

B |= (1 << c);

tnum -= long long(c)*479001600LL;

//11

t2.y=c=int(tnum/39916800LL);

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (12.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*39916800LL;

//yip

t2.x=int(tnum);

//10

t2.y=c=t2.x/3628800;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (11.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*3628800;

//9

t2.y=c=t2.x/362880;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (10.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*362880;

//8

t2.y=c=t2.x/40320;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (9.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*40320;

//7

t2.y=c=t2.x/5040;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (8.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*5040;

//6

t2.y=c=t2.x/720;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val,D_denoms[idx], (7.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*720;

//5

t2.y=c=t2.x/120;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (6.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*120;

//4

t2.y=c=t2.x/24;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (5.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*24;

//3

t2.y=c=t2.x/6;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (4.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*6;

//2

t2.y=c=t2.x/2;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (3.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*2;

//1

c=t2.x;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (2.0f + float(idx)));

B |= (1 << idx);

//0

cc = ~B;

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, D_denoms[idx], (1.0f + float(idx)));

B |= (1 << idx);

//now have value associated with the current permutation, check to see if it is the best this thread has seen so far, and cache the value with the permuation

if (fabsf(cur_val - target)<value){

value = fabsf(cur_val - target);

tnum = offset + long long(ii*THREADS);

mask_as_int2 = *reinterpret_cast<int2 *>(&tnum);

}

}//end main loop

//reduce

cur_val = __shfl(value, warpIndex + 16);

t2.x = __shfl(mask_as_int2.x, warpIndex + 16);

t2.y = __shfl(mask_as_int2.y, warpIndex + 16);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 8);

t2.x = __shfl(mask_as_int2.x, warpIndex + 8);

t2.y = __shfl(mask_as_int2.y, warpIndex + 8);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 4);

t2.x = __shfl(mask_as_int2.x, warpIndex + 4);

t2.y = __shfl(mask_as_int2.y, warpIndex + 4);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 2);

t2.x = __shfl(mask_as_int2.x, warpIndex + 2);

t2.y = __shfl(mask_as_int2.y, warpIndex + 2);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 1);

t2.x = __shfl(mask_as_int2.x, warpIndex + 1);

t2.y = __shfl(mask_as_int2.y, warpIndex + 1);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

if (warpIndex == 0){

blk_best[threadIdx.x >> 5] = value;

mask_val[threadIdx.x >> 5] = mask_as_int2;

}

__syncthreads();

if (threadIdx.x == 0){

cur_val = blk_best[0];

t2 = mask_val[0];

if (blk_best[1]<cur_val){

cur_val = blk_best[1];

t2 = mask_val[1];

}

if (blk_best[2]<cur_val){

cur_val = blk_best[2];

t2 = mask_val[2];

}

if (blk_best[3]<cur_val){

cur_val = blk_best[3];

t2 = mask_val[3];

}

if (blk_best[4]<cur_val){

cur_val = blk_best[4];

t2 = mask_val[4];

}

if (blk_best[5]<cur_val){

cur_val = blk_best[5];

t2 = mask_val[5];

}

if (blk_best[6]<cur_val){

cur_val = blk_best[6];

t2 = mask_val[6];

}

if (blk_best[7]<cur_val){

cur_val = blk_best[7];

t2 = mask_val[7];

}

ans_val[blockIdx.x] = cur_val;

perm_val[blockIdx.x] = t2;

}

}

__global__ void _gpu_perm_last_step_13(

float* __restrict__ ans_val,

int2* __restrict__ perm_val,

const D_denoms_14_local Test_vals,

const float initial_val,

const float target,

const long long bound,

const int digits,

const long long rem_start,

const int num_blox){

const long long offset = long long(threadIdx.x) + rem_start;

const int warpIndex = threadIdx.x % 32;

__shared__ float blk_best[8];

__shared__ int2 mask_val[8];

unsigned int B, cc;

int ii = 1, idx, c;

float value = MAX_DIF_VAL, cur_val;

int2 mask_as_int2, t2;

long long tnum, adj = 0LL;

for (; (offset + adj)<bound; ii++){

tnum = offset + adj;

cur_val = initial_val;

B = 0;

//12

c=int(tnum/479001600LL);

cur_val = device_get_F_val(cur_val, Test_vals.denoms[c], (13.0f + float(c)));

B |= (1 << c);

tnum -= long long(c)*479001600LL;

//11

t2.y=c=int(tnum/39916800LL);

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (12.0f + float(idx)));

B |= (1 << idx);

tnum -= long long(t2.y)*39916800LL;

//yip

t2.x=int(tnum);

//10

t2.y=c=t2.x/3628800;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (11.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*3628800;

//9

t2.y=c=t2.x/362880;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (10.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*362880;

//8

t2.y=c=t2.x/40320;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (9.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*40320;

//7

t2.y=c=t2.x/5040;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (8.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*5040;

//6

t2.y=c=t2.x/720;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val,Test_vals.denoms[idx], (7.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*720;

//5

t2.y=c=t2.x/120;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (6.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*120;

//4

t2.y=c=t2.x/24;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (5.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*24;

//3

t2.y=c=t2.x/6;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (4.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*6;

//2

t2.y=c=t2.x/2;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (3.0f + float(idx)));

B |= (1 << idx);

t2.x-=t2.y*2;

//1

c=t2.x;

cc = ~B;

while (c-->0){

cc = cc&(cc - 1);

}

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (2.0f + float(idx)));

B |= (1 << idx);

//0

cc = ~B;

idx = __ffs(cc) - 1;

cur_val = device_get_F_val(cur_val, Test_vals.denoms[idx], (1.0f + float(idx)));

B |= (1 << idx);

//now have value associated with the current permutation, check to see if it is the best this thread has seen so far, and cache the value with the permuation

if (fabsf(cur_val - target)<value){

value = fabsf(cur_val - target);

tnum = offset + adj;

mask_as_int2 = *reinterpret_cast<int2 *>(&tnum);

}

adj = (long long(ii) << 8LL);

}//end main loop

//got through other block optimal results, compare to current, and cache the best value and a permutation associated with that avlue

adj = 0LL;

for (ii = 1; (threadIdx.x + int(adj))<num_blox; ii++){

idx = (threadIdx.x + int(adj));

if (ans_val[idx]<value){

value = ans_val[idx];

mask_as_int2 = perm_val[idx];

}

adj = (long long(ii) << 8LL);

}

cur_val = __shfl(value, warpIndex + 16);

t2.x = __shfl(mask_as_int2.x, warpIndex + 16);

t2.y = __shfl(mask_as_int2.y, warpIndex + 16);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 8);

t2.x = __shfl(mask_as_int2.x, warpIndex + 8);

t2.y = __shfl(mask_as_int2.y, warpIndex + 8);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 4);

t2.x = __shfl(mask_as_int2.x, warpIndex + 4);

t2.y = __shfl(mask_as_int2.y, warpIndex + 4);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 2);

t2.x = __shfl(mask_as_int2.x, warpIndex + 2);

t2.y = __shfl(mask_as_int2.y, warpIndex + 2);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

cur_val = __shfl(value, warpIndex + 1);

t2.x = __shfl(mask_as_int2.x, warpIndex + 1);

t2.y = __shfl(mask_as_int2.y, warpIndex + 1);

if (cur_val<value){

value = cur_val;

mask_as_int2 = t2;

}

if (warpIndex == 0){

blk_best[threadIdx.x >> 5] = value;

mask_val[threadIdx.x >> 5] = mask_as_int2;

}

__syncthreads();

if (threadIdx.x == 0){

cur_val = blk_best[0];

t2 = mask_val[0];

if (blk_best[1]<cur_val){

cur_val = blk_best[1];

t2 = mask_val[1];

}

if (blk_best[2]<cur_val){

cur_val = blk_best[2];

t2 = mask_val[2];

}

if (blk_best[3]<cur_val){

cur_val = blk_best[3];

t2 = mask_val[3];

}

if (blk_best[4]<cur_val){

cur_val = blk_best[4];

t2 = mask_val[4];

}

if (blk_best[5]<cur_val){

cur_val = blk_best[5];

t2 = mask_val[5];

}

if (blk_best[6]<cur_val){

cur_val = blk_best[6];

t2 = mask_val[6];

}

if (blk_best[7]<cur_val){

cur_val = blk_best[7];

t2 = mask_val[7];

}

ans_val[0] = cur_val;

perm_val[0] = t2;

}

}

Other work done on this topic:

Shao Voon Wong also wrote a CUDA project which only generates the permutations, and has no built in evaluation steps.

In general we both used the factorial decomposition method, but very different implementations of that method.

Comparing my current implementation to his there are these significant differences:

    • OK version has been tested up to 16 elements(N), while Shao Voon Wong's version was not able to handle N>12 without major modifications

    • OK version uses very little global memory and has a compute occupancy of 100% of GPU resources

    • Shao Voon Wong's version only generates permutations, with no evaluation step. Ultimately the only reason to generate would be to evaluate.

    • OK version, even with the additional N steps of evaluation is many multiples faster due to CUDA code optimizations

FAQ:

For 'brute force' problems how much of a performance difference is there between a high-end CPU vs a GPU?

For this specific problem there is only ~100x difference in running time when comparing a single 1.1 GHz GPU to a 4.5 GHz single core CPU.

Some other combinatorial problems can have as much as a 400x difference, such as this older project of mine: Optimal Triangle

What about an OpenCL version?

Feel free to write one, as this general approach should work in OpenCL. Please let me know your running times for equivalent data sets for comparison purposes.

Your CPU implementation is only single-core, how would a multi-core implementation perform?

In my experience this is not quite as easy to implement as you would think, and there are additional complications which would not necessarily result in a implementation which cuts the running time linearly per thread.

I did massively overclock the CPU to 4.5 GHz, which is not a typical approach. Even if you were able to get 4 threads running at 4.5 GHz on this problem, this GPU implementation would still outperform by a factor of over 20x.

But by all means try to make one and let me know the results!

Which GPUs will run successfully run the code you posted?

In order of performance: GTX Titan X, GTX 980ti, GTX 980, Quadro M6000, GTX 780ti, GTX Titan Black, Tesla K40, Quadro K6000, GTX 980m, GTX Titan, Tesla K20x

Could you split this problem between different GPUs to decrease the running time and increase 'N'?

Yes, this particular problem can be split between different GPUs with a 'merge' phase at the end. Stay tuned for an example using 4 GPUs for 18!

Why bother with GPUs when Quantum computers will be available to solve such problems?

Keep in mind the GPU I used costs $500 USD and is 'plug-and-play'. While Quantum computers may be able to brute force problems more quickly, their cost and scarcity puts them out of range to average folks.

Also GPUs are more versatile, as they can be used for fast sorting, FFTs, image processing, linear algebra sub-routines, N-body, Monte Carlo simulations, and VIDEO GAMES!

More similar problems:

Optimal Polygon

Magic Square

unrelated work

Other brute force with ranking

Who cares about this problem?

(Not many....but here is the view map over the last three weeks)

NEXT TIME: 4 GTX 980 for 18! permutation evaluations

recur

sample

FW1rVZrVEYqcsAX0iYOICQ5f459697954e50f16359f73eb52680e6d0e7edhyJJUG1rVa7dEorEsAWY6YH4Cg