Rohit Beriwal

Lecture 3 by Andreas Klockner

Outline:

Leftovers ( previous lecture content)
Code writes code
Case study: Generic OpenCL reduction (aka GPU reduction)
Reasoning about generated code
Automatic GPU programming

OpenCL implementation:

« Read the specification generated by the Khronos group for standar implementation »

Note: If there is a data transfer (ASYNC) from host to device, make sure python GC doesnt delete it before the operation is actually done.

Other important stuff:

Explicit & Implicit SIMD:

Implicit is more liek CUDA way; SIMD is implicit in the structure of the workgroup.

whereas,

Explicit is more like Intel way; the structure , for example vector data type, explicity mentioned the data and the instruction to work with.

Kernel Attributes:

__kernel __attribute__ ((...))
void foo( __global float4 *p ) ( ... }

Autovectorize:

__kernel __attribute__ ((vec_type_hint ( float4 )))
void foo( _-global float4 *p ) { ... }

Enforcing workgroup size:

This is helpful in speedup, because compiler will have extra info with it, thus perform the computation faster.

__attrbiute__ ((reqd-wrok_group_size(X, Y, Z)))

OpenCL implementation:

Key players (back in 2011):

Apple
Nvidia
AMD (not anymore)

Nvidia Cl implementation:

Keypoints:

Nearly identical to CUDA
Page-locked memory: Memory accessible by the GPU without host’s help. In openCL, its hidden behind: ````CL_MEM_ALLOC_HOST_PTR```; which gives a buffer (aka chunk of mem) accessible from both host & device address space. This is needed for overlapped transfers,

Code Writes Code:

OpenCL generalizes code over many type of device; that is one language over different vendors & devices.

Metaprogramming support:

3 possible ways to do it:

Simple % - operator substitution (combine with C preprocessor )
Using template engine such as jinja or mako
codepy

How are high-performance codes constructed?

The key is to play the strengths of each programming environments.

Traditional ways
1. c/c++/fortran
2. libraries
Alternatives
1. Scripting for brains
2. Gpus for “inner loops”

Simple PyOpenCL stuff:

* meant ot look and feel like numpy array
* various other stuff :P

 Example: 

 ```python
import pyopencl.array as cl_array

a_gpu = cl_array.to_device(ctx, queue, numpy_array)

a_doubled = (2*a_gpu).get()

pyopencl.reduction contains optimzation for reductive operations such as norms etc.

For Example:

from pyopencl.reduction import ReductionKernel

dot = ReductionKernel( ctx, dtype_out=numpy.float32, neutral="0", reduc_expr="a+b", map_expr="x[i]*y[i]", 
	arguments="__global const float *x, __global const float *y")

x_dot_y = dot(x ,y).get()

Runtime Code Generation:

Via Substitution:

Example from elementwise kernel (NOTE this was for python2!!!):

source =("""
	__kernel void %(name)s(%(arguments)s)
	{	
		unsigned lid = get_local_id (0);
		unsigned gszie = get_global_size (0);
		unsigned work_item_start = get_local_size (0) * get_group_id (0);
		
		for (unsigned i = work_item_start + lid; i < n; i +=gsize)
		{
			%(operation)s;
		}
	}
	""" % {
	"arguments": ",".jpin(arg.declarator () for arg in arguments),
	"operations" : operations ,
	"name" : name,
	"loop_prep" : loop_prep,
	})

prg = cl.Program(ctx, source).build()

Via mako:

This is a template engine:

```python

from mako.template import Template

tpl = Template(“”” ___kernel void add( __global ${ type_name } *tgt, ) { int idx = get_local_id(0) + $( local_size }; }”””)

rendered_tpl = tpl.render(type_name=”float”, local_size = local_size … )

Via AST Generation:

This is basically using syntax tree generation.

« Example too cmplicated to render :P » « see lec @ 36:30 »

Template metaprogramming:

« mentioned at the end of the lecture »

Parallel Complexity:

*Time on T processors be Tp, Work per step be St.

Step Complexity /Span T_inf: Minimum number of steps taken if an infinite number of processors are available.
Work Complexity/Work T1 = Sigma(St)[1, inf]: Total number of operations performed.
Parallelism T1/T_inf : Average amount of work along span.

If P > T1/T_inf, it dosent make sense.