OpenCL
output = imrotate(input, angle);
Both input and output are uint8 types. angle is double type.
Based on the coordinate of the output pixel, the corresponding input pixel is located. In our implementation, 16 uint8 data are packed into uint vector 4
to maximize the utilization of the memory bandwidth (CPU-GPU data transfer and GPU device data transfer). Despite the extra computations involved in
kernel because of packing and unpacking, the overall GPU performance might still be improved.