Vivante GPU

Fortunately I had chance to access the one of the powefull evaluation board available in the market Freescale i.MX6Q SABRE AUTO.

I could manage to get some of the data required to understand from GPGPU perspective, however I am not able to verify the result myself.

For now we take this information to educate us on the OpenCL and device side and later some time we will rip it for sure!

At a glance....

Freescale i.MX6

The i.MX6 Quad (4 ARM Cortex-A9 cores) and i.MX6 Dual (2 ARM Cortex-A9 cores) have support for OpenCL 1.1 EP (Embedded Profile). (source)Both have a Vivante GC2000 GPU, which has 16 GFLOPS to 24 GFLOPS depending on the source. The GPU cores can be used to run OpenGL 2.0 EP shaders and OpenCL EP kernels.

What the fuck is this terms!? I can hear your inner voice, be calm. huh... that's our Indian ability to read other persons mind voice :)

Drivers

Under the Software & Tools tab of the SABRE-board there are drivers – they have not been tested with other boards, so no guarantees are given.

IMX6_GPU_SDK:

A collection of GPU code samples, for OpenCL the work is still in progress.

You can find it under “Software Development Tools” -> “Snippets, Boot Code, Headers, Monitors, etc.”

IMX_6D_Q_VIVANTE_VDK_<version>_TOOLS:

GPU profiling tools, offline compiler and an emulator with CL support which runs on Windows platforms. Be sure you pick the latest version!

You can find it under “Software Development Tools” -> “IDE – Debug, Compile and Build Tools“.

And this guys who make our life more miserable by providing some outdated tools (I mean LTIB).

OpenCL Info

I managed to run the OpenCL sample that comes with GPU SDK using LTIB.

I will keep the setup procedure aside and concentrate on GPGPU information.

>>>>>>>> ./clinfo Starting...

Available platforms: 1

Platform ID: 0

CL_PLATFORM_NAME : Vivante OpenCL Platform

CL_PLATFORM_PROFILE : EMBEDDED_PROFILE

CL_PLATFORM_VERSION : OpenCL 1.1

CL_PLATFORM_VENDOR : Vivante Corporation

CL_PLATFORM_EXTENSIONS : None

Available devices: 1

Device ID: 0

Device Ptr: 0x2abd14b0

CL_DEVICE_NAME : Vivante OpenCL Device

CL_DEVICE_VENDOR : Vivante Corporation

CL_DEVICE_TYPE : GPU

CL_DEVICE_OPENCL_C_VERSION : OpenCL C 1.1

CL_DEVICE_VENDOR_ID : 0x00564956

CL_DEVICE_PLATFORM : 0x2abd1630

CL_DEVICE_VERSION :OpenCL 1.1

CL_DEVICE_PROFILE : EMBEDDED_PROFILE

CL_DRIVER_VERSION : OpenCL 1.1

CL_DEVICE_MAX_COMPUTE_UNITS : 4

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 3

CL_DEVICE_MAX_WORK_ITEM_SIZES[0] : 1024

CL_DEVICE_MAX_WORK_ITEM_SIZES[1] : 1024

CL_DEVICE_MAX_WORK_ITEM_SIZES[2] : 1024

CL_DEVICE_MAX_WORK_GROUP_SIZE : 1024

CL_DEVICE_MAX_CLOCK_FREQUENCY : 500 MHz

CL_DEVICE_IMAGE_SUPPORT : Yes

CL_DEVICE_MAX_READ_IMAGE_ARGS : 8

CL_DEVICE_MAX_WRITE_IMAGE_ARGS : 8

CL_DEVICE_IMAGE2D_MAX_WIDTH : 8192

CL_DEVICE_IMAGE2D_MAX_HEIGHT : 8192

CL_DEVICE_IMAGE3D_MAX_WIDTH : 0

CL_DEVICE_IMAGE3D_MAX_HEIGHT : 0

CL_DEVICE_IMAGE3D_MAX_DEPTH : 0

CL_DEVICE_MAX_SAMPLERS : 8

So, the number of computing devices shows that this OpenCL driver provided by FreeScale only support GPU computing, the four ARM Cortex-A9 cpu cores are not supported, and GC2000 has 4 shaders (compute units), and the clock frequency is 500MHZ, it also support opencl image processing. That is really a good news.

CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store

CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR : 4

CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT : 4

CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT : 4

CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG : 0

CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT : 4

CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE : 0

CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR : 4

CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT : 4

CL_DEVICE_NATIVE_VECTOR_WIDTH_INT : 4

CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG : 0

CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT : 4

CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE : 0

CL_DEVICE_MAX_PARAMETER_SIZE : 256

CL_DEVICE_MEM_BASE_ADDR_ALIGN : 1024

CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE : 128

CL_DEVICE_SINGLE_FP_CONFIG:

CL_FP_DENORM : No

CL_FP_INF_NAN : No

CL_FP_ROUND_TO_NEAREST : No

CL_FP_ROUND_TO_ZERO : Yes

CL_FP_ROUND_TO_INF : No

CL_FP_FMA : No

CL_FP_SOFT_FLOAT : No

CL_DEVICE_ADDRESS_BITS : 32

CL_DEVICE_GLOBAL_MEM_SIZE : 64 MByte

CL_DEVICE_MAX_MEM_ALLOC_SIZE : 32 MByte

CL_DEVICE_GLOBAL_MEM_CACHE_TYPE : Read/Write

CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE : 64

CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : 4096

CL_DEVICE_LOCAL_MEM_SIZE : 1 KByte

CL_DEVICE_LOCAL_MEM_TYPE : Global

CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE : 4 KByte

CL_DEVICE_MAX_CONSTANT_ARGS : 9

CL_DEVICE_ERROR_CORRECTION_SUPPORT : Yes

CL_DEVICE_QUEUE_PROPERTIES:

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE :Yes

CL_QUEUE_PROFILING_ENABLE : Yes

CL_DEVICE_HOST_UNIFIED_MEMORY : Yes

CL_DEVICE_PROFILING_TIMER_RESOLUTION : 1000

CL_DEVICE_ENDIAN_LITTLE : Yes

CL_DEVICE_AVAILABLE : Yes

CL_DEVICE_COMPILER_AVAILABLE : Yes

CL_DEVICE_EXECUTION_CAPABILITIES :

CL_EXEC_KERNEL : Yes

CL_EXEC_NATIVE_KERNEL : No

these parameters show that gc2000 can deal with char (8bit), short int (16bit), int (32bit), float(32bit), but not long and double.

You can see that the local memory size of GC2000 is extremely small, only 1KBytes, And the global memory cache size is only 64bytes, that sames ridiculous, all these

limitations made the gpu to get its most data directly from global memory, as the FreeScale used a double 64bit AXI bus architecture, it extremely constrained the power of this

GPU.

>>>>>>>> Creating CLInfo context...

Context Properties:

Context Ptr : 0x33302e50

CL_CONTEXT_REFERENCE_COUNT : 1

CL_CONTEXT_NUM_DEVICES : 1

CL_CONTEXT_DEVICES : 0x2abd14b0

CL_CONTEXT_PROPERTIES : 0x00001084

0x2abd1630

0x00000000

>>>>>>>> Creating CLInfo command queue...

Command Queue Properties:

CL_QUEUE_CONTEXT : 0x33302e50

CL_QUEUE_DEVICE : 0x2abd14b0

CL_QUEUE_REFERENCE_COUNT : 1

CL_QUEUE_PROPERTIES:

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE :Yes

CL_QUEUE_PROFILING_ENABLE : Yes

>>>>>>>> Creating CLInfo program...

>>>>>>>> Building CLInfo program...

Program Properties:

CL_PROGRAM_CONTEXT : 0x33302e50

CL_PROGRAM_REFERENCE_COUNT : 1

CL_PROGRAM_NUM_DEVICES : 1

CL_PROGRAM_DEVICES : 0x2abd14b0

CL_PROGRAM_SOURCE : (Size:60)

------------------ BEGIN --------------------

__kernel void hello() { size_t i = get_global_id(0); }

------------------- END ---------------------

CL_PROGRAM_BINARY_SIZES[0]: 232

CL_PROGRAM_BINARIES:

Device Number 0:

------------------ BEGIN --------------------

SHDR CL #global_id i

hello 0

0

------------------- END ---------------------

Program Build Properties:

CL_PROGRAM_BUILD_STATUS: 0

CL_PROGRAM_BUILD_OPTIONS: ""

CL_PROGRAM_BUILD_LOG: ""

>>>>>>>> Creating CLInfo kernel...

Kernel Properties:

CL_KERNEL_FUNCTION_NAME : "hello"

CL_KERNEL_CONTEXT : 0x33302e50

CL_KERNEL_PROGRAM : 0x33300088

CL_KERNEL_NUM_ARGS : 0

CL_KERNEL_REFERENCE_COUNT : 1

Kernel Workgroup Properties:

CL_KERNEL_WORK_GROUP_SIZE : 192

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE : 16

CL_KERNEL_COMPILE_WORK_GROUP_SIZE : 0 0 0

CL_KERNEL_LOCAL_MEM_SIZE : 0

CL_KERNEL_PRIVATE_MEM_SIZE : 0

>>>>>>>> Releasing CLInfo kernel...

>>>>>>>> Releasing CLInfo program...

>>>>>>>> Releasing CLInfo command queue...

>>>>>>>> Releasing CLInfo context...

>>>>>>>> Press <Enter> to quit...

The computing power of the Vivante's GC2000 GPU in i.MX6Q

Theoretical Power

Let's first figure out how fast the GC2000 can compute theoretically.

GC2000 has four compute units, each unit can handle four float MAD (multiply and add) operation, if every unit of GC2000 can make a MAD operation in every clock cycle, and the GPU's maximum frequency is 500MHZ (0.5 GHZ), so GC2000 would have a computing power : 0.5 * 4 * 4 * 2 = 16GFLOPS.

Please don't ask me how the number '2' came in, I have no idea! :) Yeah... you are right its copy and paste work.

May be 2 stands for two mathematical operations Multiply and Add.

We first make a kernel program with no meaningful operations, just computation

between registers of GPU.

Here is one:

__kernel void vadd (__global const float4 *a, __global const float4 *b, __global float4 *c)

{

int gid = get_global_id(0);

int i; this

float4 dd1, dd2, dd3, dd4, dd5;

for(i=0; i<900000; i++)

{

dd3 += dd2*dd5;

dd2 += dd1*dd4;

dd4 += dd3*dd1;

dd1 += dd3*dd2;

dd3 += dd2*dd5;

dd2 += dd1*dd4;

dd4 += dd3*dd1;

dd1 += dd3*dd2;

}

And again please don't ask me how to run this code and get the expected result!

The result is 13.8GFLOPS, that is a rather reasonable figure.

Real power

We used the GPU to compute the multiplication of two 1000x1000 matrix.

The total computation load is 2 * 1000^3=2GFLOPS.

And once again please don't ask me how above calculation is done.

Ideally this should take only 2 / 13.8 = 0.145 seconds.

Here is the kernel program:

__kernel void mmul(const int Ndim, const int Pdim, const int Mdim, __global float4 *A, __global float4 *B, __global float *C)

{

int k, kk, m1, m2;

int i = get_global_id(0);

int j = get_global_id(1);

float4 tmp=0.0;

m1 = i*(Pdim/8); m2=j*(Pdim/8);

for(k=0; k<(Pdim/8); k++, m1++, m2++)

tmp += Atmp1[kk] * tmp2[kk];

C[i*Mdim+j] = tmp.x+tmp.y+tmp.z+tmp.w;

}

The program used 3.3 seconds, that is about 600MFLOPS, or 0.6GFLOPS. (2GLOPS / 3.3 second = 0.6 GLOPS)

For comparison, the same program ran on the X200 laptop with Intel's OpenCL driver, the result is quite surprising: 2.9GFLOPS.

Let's figure out the reason for the difference:

1. First, The Cuo 2 P8600 has two CPU cores, the working frequency is 2.4G, each core has a SSE 128bit floating point processor, it can be regarded as one shader, so P8600 has two shaders working at 2.4G, GC2000 has 4 shaders working at 0.5G, the ratio of computing ability shall be 2*2.4 : 4*0.5 = 2.4 : 1.

2. Secondly, P8600 has 3M L2 cache, and for GC2000, as stated above, the OpenCL can only use 1k local memory and 64byte memory cache.

3. And further, the memory bandwidth greatly influenced the result. Intel's P8600 memory bandwith is 800Mhz 64bit, that is 6.4G. According to I.MX6's datasheet, the double 64bit AXI can run at maximum speed of 264Mhz, the bandwith is about 2.1G. As GC2000 almost has no cache, one GLOPS needs at least one new float number, that is 4 bytes, for a 2.1G bus, the maximum throughput is 500M float numbers, cache can make things a little better, so the test result 600MGLOPS is a very reasonable figure for GC2000.

600MFLOS is not a very bad result. Half of it is enough to drive a 720p display. So, there is plenty of room for general computing.

As you see, the bottleneck of GPU computing is the system bus bandwidth and cache size of the GPU. I do not know why Vivante designed such a small cache for OpenCL. Maybe it is the driver's configuration, or the hardware's limitation. That tremendously limited the power of gc2000.

I hope the engineers of Vivante could make some optimization for their driver and future GPU design, so that their GPU could be better used for general computing.

The test platform is a i.MX6Q TV stick Hi802 with Ubuntu 11.10 linaro and vivance opencl driver , the test platform for Lenovo X200 Laptop is Ubuntu 12.04 with Intel OpenCL driver.

I hope you understood every bit of information, if YES then you are really a geek!

I feel like closing this **** work and watch some * movie (The Pursuit of Happiness) :P

Reference:

http://streamcomputing.eu/knowledge/sdks/vivante-gpu/

https://community.freescale.com/message/320692#320692