Introduction to OpenCL
What is openCL (personal definition)
Low-level language for high-performance heterogeneous data-parallel computation
- low-level: manual memory management and parallelization. You choose the global dimensions and allocate data
- Language: a framework with C-like computation kernels. Not really a language
- High-performance: if your algorithm is a good fit for the hardware E.g., data-parallel
- Heterogeneous: Code is protable, but perfomrance is not. Different vendors/versions require different optimizations
- Data-parallel: hardware and software only support data-parallel. Task parallel support is limited on GPUs today.
Why Limited Synchronization?
- Scales well in hardware
- Only work-items within a work-group need to communicate
- GPUs run 32-128 work-gropus(Thread) in parallel
Global Synchronization
- OpenCL only supoorts global synchronization at the end of a kernel execution
Utility Functions
- information about each work-item
- get_global_id(dim)
- current work-item's ID in a particular dimension
- get_global_id(dim)
int id = get_global_id(0);
data[id] = sin(data[id]);
- get_work_dim()
- number of global dimensions in use
- get_global_size(dim)
- number of global work-items in a particular dimension
- get_local_size(), get_local_id(), get_num_groups(), get_group_id()
- information about the local dimensions
What do we need to change to use the GPU?
All we have to do is change the CL_DEVICE_TYPE
when we get the device to get a_GPU type deivce. The rest of the code will then use the GPU.
Howeverm if we have a complicated kernel while it will run on both the cpu and the gpu you won't get optimal performance.
To get optimal performance, we should modify kernel code along with the number of available cores in the machine.
OpenCL programming model
- Setup
- Get the devices (and platform)
어떤 장치를 사용할지를 선택한다.
- Create a context (for sharing between devices)
data 연산을 위해서 어디에서 처리할 지를 선택 하는 것이다.
- Create command queues (for submitting work)
선택한 장치에 어떻게 작업을 할당 할지를 선택 한다.
work을 submit하는 작업을 수행 한다. - Compilation
- Create a program
- Build the program (compile)
- Create kernels
생성된 커널을 command queue에 할당하는 방식을 취한다.
- Create memory objects
- Enqueue writes to copy data to the GPU
데이터를 CPU로부터 복사해서 GPU로 전달하는 작업을 수행 한다.
- Set the kernel arguments
- Enqueue kernel executions
kernel을 command queue로 전달해서 작업을 처리한다.
- Enqueue reads to copy data back from the GPU
- Wait for your commands to finish
왜 근데 기다려야 하는가?
이유는 OpenCL은 asynchronous이기 때문이다.
When we enqueue a command we have no idea when it will finish (it's just in the queue.)
By explicitly waiting we make sure it is finished before continuiting.