3 Key arch. changes for making parallel execution faster:
Basically ,Perfrom a single operation using different ALU in parallel, over different data.
*Cores: On compute element on GPU , consisting of Fetch/Decode, ALU & Context.
So the mapping is done by assuming a 2-D grid:
Note: X,Y,Z order within the group matters. (but , not amongst the group though.)
A given function is run on all work items in the given grid. Inorder to determine , which is current work item
is being dealt with, GPU use : local_id, group_id, global_id
(all these are relative to some axis (1,2,3….which is basically x,y,z….))
Note: Grids can be 1,2 or 3 dim.
Main advantage of OpenCL, is that it comes with RTCG (runtime code generation) out of the box.
OpenCl | CUDA |
Grid | Grid |
Work Group | Block |
Wirk Item | Thread |
__kernel | global |
__global | device |
__local | shared |
__private | local |
imagend_t | texture<type, n , ….> |
barrier(LMF) | __syncthreads() |
get_local_id(012) | threadIdx.xyz |
get_group_id(012) | blockIdx.xyz |
get_global_id(012) | -(reimplement) |
Other terminologies:
Other key points:
import pyopencl as cl
import numpy
# Create a 256^3 element numpy array
a = numpy.random.rand(256**3).astype(numpy.float32)
# Create context -> a living space for all the opencl objects !!!!!
ctx = cl.create_some_context()
# For the given context, create command queue -> Contains details about devices, platform we live on.
queue = cl.CommandQueue(ctx)
# For the given context, create a buffer -> ALLOCATE memory space on DEVICE,
# on which we transfer data, that needs to be processed.
a_dev = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, size=a.nbytes)
# This will TRASNFER data from host to device USING the command Queue (which is literally a queue of commands)
cl.enqueue_write_buffer(queue, a_dev, a)
# Write the program for the computation we want to perform:
prg = cl.Program(ctx, """
__kernel void twice( __global float *a)
{ a[ get_global_id(0)] * = 2; }
# Run the program on the data using the Queue
prg.twice(queue, a.shape, (1,), a_dev)
function is (command queue, grid size = a.shape, workgroup size = (1,), buffer = a_dev)
# Create an empty array with same dimension as that of a
result = numpy.empty_like(a)
# Wait for the result and the transfer
cl.enqueue_read_buffer(queue, a_dev, result).wait()
The indexing and call changes are as follows:
a [ get_local_id(0) + get_local_size(0)*get_group_id(0) ] =* 2;
prg.twice(queue, a.shape, (256,), a_dev)