Rohit Beriwal

Lecture 2 by Andreas Klockner

PASI: The challenge of Massive Parallelism.

Outline:

OpenCL Runtime:

Indepth of OpenCL:

Cl consist of two parts:

libOpenCL.so

import pyopencl as cl

   
* Devices-side: A dialect of C99
   

### Everything about dataflow:   

<< UML data flow diagram @3:24 >>

* Main part is the **context**, which is obtained from the device, which belongs to a particular platform.   

* For the given context, make a **command queue**.
* The given command queue can be used to make **memory objects** allocate and 
deallocate data and buffer, and also used to make **programs** that will run on GPU.

### CL - Component:

#### Plaforms:
   
* Platform is a collection of devices from **same** vendor.
* Multiple platforms can be used from on program-> ICD (Installable Client Driver)

#### Compute Devices:
   
* Device needs to have processor **with** interface to off-chip memory.
* Anything that fits into the programming model, i.e can follow the grid-block-thread system to process data.,

#### Context:

* For a given device we create a context.

context = cl.Context(device=None | [dev1,dev2], dev_type=None) context = cl.create_some_context( interactive=True )

dev_type = DEFAULT, ALL ,CPU, GPU


* Can span **one or more** devices.
* Created from device type or list of devices
   

#### Command Queue:

* Its a mediator between host and device, since both the entities run asynchronously.
* So , Each command/operation gets queued in this command queue and is eventually 
executed.
* We can have more than 1 queue per device.
* By default queue serializes all the operations submitted to it, so if one queue is not 
fully uitlizing the GPU , other queue can be run in parallel.
* There other properties that can be defined for a given queue, such as enabling 
```OUT_OF_ORDER_EXECUTION``` or profiling.
   
* **One device per queue, but multiple queues per devices**    
* **Note: CUDA version of Command queue is called SCREEN**
* CUDA creates context from device automatically.

#### Command queue and Events:

* Everytime we enqueue something, it returns an EVENT.
* An event is an identifier for the time spanned for doing the given operation.
* Hence, for host-device synchronization, EVENT has method ```evt.wait()```.
* We can also for multiple events at the same time.
* All the entites in the command queue can also wait for each other, this can be done 
by keyword argument like :

```python
event = cl.enqueue_XXX(queue, wait_for=[evt1, evt2, ....] )
```

#### Capturing Data dependencies: Avoid deadlock:

The above wait() methods can ensure, that different operations can be run parallely
and a deadlock condition isnt created, becuase one event can wait for the prior event
to complete. Hence "out of order mode" can be enabled for a given queue safely.

Data dependencies are captured by plotting an DAG (directed acyclic graph).
This comes in handy, when different data streams are being run on different GPUs
and data transfer is required between them or other MPIs.


#### Profiling:

* This is done by enabling profiling for a given queue.
* This results in the EVENT object to have:

```python
event.profile = QUEUED, SUBMIT, START, END

There are artifical events: MARKERs These can be used to pause the timing during profiling to get the the correct time for a particular operation.

Memory Objects: Buffers

buf = cl.Buffer(context, flags, size=0, hostbuf=None)
FLAGS: 
READ_ONLY, WRITE_ONLY/ READ_WRITE
{ ALLOC, COPY, USE}_HOST_PTR
Flags:

Programs and Kernels

prg = cl.Program(context, src)

Program can also be created from binary:

Kernel Arguments (Additional, from above):

Note: Explicit sized scalars are erro-prone:

Solution/Better:

kernel.set_scalar_arg_dtypes([numpy.int32, None, numpy.float32])
# Using None for non-scalars

CL Device Language:

It is C99 with following changes:

Cl vector data types:

About:

Usage:

Concurrency and Synchronization:

Concurrency & Sync between different levels:

Intra-group:

CLK_{LOCAL, GLOBAL}_MEM_FENCE

Inter-group:

CPU - GPU:

Miscellaneous:

There is also Atomic Operations.

Extension: Future-proof CL:

2 Mechanism:

Runtime:

Device Language:

Enable the extension from here using:

#pragma OPENCL EXTENSION
name: enable

An important extension is : cl_khr_fp64 : The Khronos group extension.