27rohitb.github.io

Lecture 3 by Andreas Klockner

Outline:

OpenCL implementation:

« Read the specification generated by the Khronos group for standar implementation »

Note: If there is a data transfer (ASYNC) from host to device, make sure python GC doesnt delete it before the operation is actually done.

Other important stuff:

Explicit & Implicit SIMD:

Implicit is more liek CUDA way; SIMD is implicit in the structure of the workgroup.

whereas,

Explicit is more like Intel way; the structure , for example vector data type, explicity mentioned the data and the instruction to work with.

Kernel Attributes:

__kernel __attribute__ ((...))
void foo( __global float4 *p ) ( ... }
Autovectorize:
__kernel __attribute__ ((vec_type_hint ( float4 )))
void foo( _-global float4 *p ) { ... }
Enforcing workgroup size:

This is helpful in speedup, because compiler will have extra info with it, thus perform the computation faster.

__attrbiute__ ((reqd-wrok_group_size(X, Y, Z)))

OpenCL implementation:

Key players (back in 2011):

Nvidia Cl implementation:

Keypoints:

Code Writes Code:

OpenCL generalizes code over many type of device; that is one language over different vendors & devices.

Metaprogramming support:

3 possible ways to do it:

How are high-performance codes constructed?

The key is to play the strengths of each programming environments.

Simple PyOpenCL stuff:

* meant ot look and feel like numpy array
* various other stuff :P

 Example: 

 ```python
import pyopencl.array as cl_array

a_gpu = cl_array.to_device(ctx, queue, numpy_array)

a_doubled = (2*a_gpu).get()

pyopencl.reduction contains optimzation for reductive operations such as norms etc.

For Example:


from pyopencl.reduction import ReductionKernel

dot = ReductionKernel( ctx, dtype_out=numpy.float32, neutral="0", reduc_expr="a+b", map_expr="x[i]*y[i]", 
	arguments="__global const float *x, __global const float *y")

x_dot_y = dot(x ,y).get()

Runtime Code Generation:

Via Substitution:

Example from elementwise kernel (NOTE this was for python2!!!):


source =("""
	__kernel void %(name)s(%(arguments)s)
	{	
		unsigned lid = get_local_id (0);
		unsigned gszie = get_global_size (0);
		unsigned work_item_start = get_local_size (0) * get_group_id (0);
		
		for (unsigned i = work_item_start + lid; i < n; i +=gsize)
		{
			%(operation)s;
		}
	}
	""" % {
	"arguments": ",".jpin(arg.declarator () for arg in arguments),
	"operations" : operations ,
	"name" : name,
	"loop_prep" : loop_prep,
	})

prg = cl.Program(ctx, source).build()

Via mako:

This is a template engine:

```python

from mako.template import Template

tpl = Template(“”” ___kernel void add( __global ${ type_name } *tgt, ) { int idx = get_local_id(0) + $( local_size }; }”””)

rendered_tpl = tpl.render(type_name=”float”, local_size = local_size … )

Via AST Generation:

This is basically using syntax tree generation.

« Example too cmplicated to render :P » « see lec @ 36:30 »

Template metaprogramming:

« mentioned at the end of the lecture »

Parallel Complexity:

*Time on T processors be Tp, Work per step be St.

If P > T1/T_inf, it dosent make sense.