« Read the specification generated by the Khronos group for standar implementation »
Note: If there is a data transfer (ASYNC) from host to device, make sure python GC doesnt delete it before the operation is actually done.
Implicit is more liek CUDA way; SIMD is implicit in the structure of the workgroup.
Explicit is more like Intel way; the structure , for example vector data type, explicity mentioned the data and the instruction to work with.
__kernel __attribute__ ((...))
void foo( __global float4 *p ) ( ... }
__kernel __attribute__ ((vec_type_hint ( float4 )))
void foo( _-global float4 *p ) { ... }
This is helpful in speedup, because compiler will have extra info with it, thus perform the computation faster.
__attrbiute__ ((reqd-wrok_group_size(X, Y, Z)))
Key players (back in 2011):
OpenCL generalizes code over many type of device; that is one language over different vendors & devices.
3 possible ways to do it:
The key is to play the strengths of each programming environments.
* meant ot look and feel like numpy array
* various other stuff :P
import pyopencl.array as cl_array
a_gpu = cl_array.to_device(ctx, queue, numpy_array)
a_doubled = (2*a_gpu).get()
contains optimzation for reductive operations such as norms etc.
For Example:
from pyopencl.reduction import ReductionKernel
dot = ReductionKernel( ctx, dtype_out=numpy.float32, neutral="0", reduc_expr="a+b", map_expr="x[i]*y[i]",
arguments="__global const float *x, __global const float *y")
x_dot_y = dot(x ,y).get()
Example from elementwise kernel (NOTE this was for python2!!!):
source =("""
__kernel void %(name)s(%(arguments)s)
unsigned lid = get_local_id (0);
unsigned gszie = get_global_size (0);
unsigned work_item_start = get_local_size (0) * get_group_id (0);
for (unsigned i = work_item_start + lid; i < n; i +=gsize)
""" % {
"arguments": ",".jpin(arg.declarator () for arg in arguments),
"operations" : operations ,
"name" : name,
"loop_prep" : loop_prep,
prg = cl.Program(ctx, source).build()
This is a template engine:
from mako.template import Template
tpl = Template(“”” ___kernel void add( __global ${ type_name } *tgt, ) { int idx = get_local_id(0) + $( local_size }; }”””)
rendered_tpl = tpl.render(type_name=”float”, local_size = local_size … )
This is basically using syntax tree generation.
« Example too cmplicated to render :P » « see lec @ 36:30 »
« mentioned at the end of the lecture »
*Time on T processors be Tp, Work per step be St.
If P > T1/T_inf, it dosent make sense.