27rohitb.github.io

Lecture 1 (PyCUDA crash course):

Importing and initiazlizing pyCUDA:

import pycuda.drive as cuda
import pycuda.autoinit

Import Cuda Compiler:

from pycuda.compiler import SouceModule

Some data, say numpy array:

# Create a numpy array of float32
a = np.random.rand(16)
a = a.astype(np.float32)

Memory allocation on GPU:

a_gpu = cuda.mem_alloc(a.nbytes)

Copy data to GPU (htod: Host To Device):

# Copy direction is from LEFT to RIGHT
cuda.memcpy_htod(a_gpu, a)

Write C++ kernel that would be executed on GPU:

module = SouceModule("""
   __global__ void double_array(float *a){
      int idx = blockIdx.x * blockDim.x + threadIdx.x;
      a[idx] *= 2;
      }
   """)

Here:

Underlying concept:

We can select the number of processing elements by defining the number of threads per block when launching the kernel. Different elements of the input array are automatically mapped to different thread, giving data parallelism in processing.

Getting the Function from the kernel:

# In string is the function name as mentioned in the Kernel
function = module.get_function("double_array")

Calling / Executing the kernel:

# func ( (args to the kernel function <gpu mem locations >, block=(x,y,z), grid=(a,b,c) )
function( a_gpu, block=(16, 1, 1), grid=(1, 1, 1))

Place Holder for result:

# Empty arrary with dim same as a
final_a = np.empty_list(a)

Copying data back from gpu:

cuda.memcpy_dtoh( a_final, a_gpu )