27rohitb.github.io

PyOpenCL (by Andreas Klochkner PT1)

Introduction:

3 Key arch. changes for making parallel execution faster:

Basically ,Perfrom a single operation using different ALU in parallel, over different data.

Mapping data onto Infinitely many ALUs:

*Cores: On compute element on GPU , consisting of Fetch/Decode, ALU & Context.

So the mapping is done by assuming a 2-D grid:

Note: X,Y,Z order within the group matters. (but , not amongst the group though.)

Addressing:

A given function is run on all work items in the given grid. Inorder to determine , which is current work item is being dealt with, GPU use : local_id, group_id, global_id (all these are relative to some axis (1,2,3….which is basically x,y,z….))

Note: Grids can be 1,2 or 3 dim.

Introduction to OpenCL:

Main advantage of OpenCL, is that it comes with RTCG (runtime code generation) out of the box.

OpenCl <-> Cuda concept mapping:

OpenCl CUDA
Grid Grid
Work Group Block
Wirk Item Thread
__kernel global
__global device
__local shared
__private local
imagend_t texture<type, n , ….>
barrier(LMF) __syncthreads()
get_local_id(012) threadIdx.xyz
get_group_id(012) blockIdx.xyz
get_global_id(012) -(reimplement)

Other terminologies:

PyOpenCL Coding HERE:

NOTE:

Other key points:

Initializing and performing the computation:

import pyopencl as cl
import numpy

# Create a 256^3 element numpy array
a = numpy.random.rand(256**3).astype(numpy.float32)

# Create context ->  a living space for all the opencl objects !!!!!
ctx = cl.create_some_context()

# For the given context, create command queue -> Contains details about devices, platform we live on.
queue = cl.CommandQueue(ctx)

# For the given context, create a buffer -> ALLOCATE memory space on DEVICE,
# on which we transfer data, that needs to be processed.
a_dev = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, size=a.nbytes)

# This will TRASNFER data from host to device USING the command Queue (which is literally a queue of commands)
cl.enqueue_write_buffer(queue, a_dev, a)

# Write the program for the computation we want to perform:
prg = cl.Program(ctx, """
	__kernel void twice( __global float *a)
	{ a[ get_global_id(0)] * = 2; }
	""").build()

# Run the program on the data using the Queue
prg.twice(queue, a.shape, (1,), a_dev)

Extracting the Result from the Computation:

# Create an empty array with same dimension as that of a
result = numpy.empty_like(a)

# Wait for the result and the transfer
cl.enqueue_read_buffer(queue, a_dev, result).wait()

After proper index generation:

The indexing and call changes are as follows:

a [ get_local_id(0) + get_local_size(0)*get_group_id(0) ] =* 2;

prg.twice(queue, a.shape, (256,), a_dev)