Skip to content

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).

License

Notifications You must be signed in to change notification settings

tugrul512bit/Cekirdekler

Folders and files

NameName
Last commit message
Last commit date
May 27, 2017
Mar 31, 2017
Apr 16, 2017
Mar 29, 2017
May 12, 2017
Mar 31, 2017
May 23, 2017
May 26, 2017
May 27, 2017
May 27, 2017
May 9, 2017
May 13, 2017
May 12, 2017

Repository files navigation

Cekirdekler

Very simple C# Multi-device GPGPU(OpenCL) compute API with an iterative interdevice-loadbalancing feature using multiple pipelining on read/write/compute operations for developers' custom opencl kernels.

64-bit only. "project settings -> build -> platform target -> x64" Also configuration manager needs to look like this:

Needs extra C++ dll built in 64-bit(x86_64) from https://github.com/tugrul512bit/CekirdeklerCPP which must be named KutuphaneCL.dll

The other needed dll is Microsoft's System.Threading.dll and its xml helper for .Net 2.0 - or - you can adjust "using" and use .Net 3.5+ for your own project and don't need System.Threading.dll.

In total, Cekirdekler.dll and KutuphaneCL.dll and using .Net 3.5 should be enough.

Usage: add only Cekirdekler.dll and system.threading.dll as references to your C# projects. Other files needs to exist in same folder with Cekirdekler.dll or the executable of main project.

This project is being enhanced using ZenHub:

Features

  • Implicit multi device control: from CPUs to any number of GPUs and ACCeelerators. Explicit in library-side for compatibility and performance, implicit for client-coder for the ease of GPGPU to concentrate on opencl kernel code. Selection of devices can be done implicitly or explicitly to achieve ease-of-setup or detailed device query.
  • Iterative load balancing between devices: uniquely done for each different compute(explicit control with user-given compute-id). Multiple devices get more and more fair work loads until the ratio of work distribution converges to some point. Partitionig workload completes a kernel with less latency which is applicable for hot-spot loops and some simple embarrassingly-parallel algorithms. Even better for streaming data with pipelining option enabled.
  • Pipelining for reads, computes and writes(host - device link): either by the mercy of device drivers or explicit event-based queue management. Hides the latency of least time consuming part(such as writes) behind the most time consuming part(such as compute). GPUs can run buffer copies and opencl kernels concurrently.
  • Pipelining between devices(device - host - device): Concurrently run multiple stages to overlap them in timeline and gain advantage of multiple GPUs(and FPGAa, CPUs) for even non-separable(because of atomics and low-level optimizations) kernels of a time-consuming pipeline. Each device runs a different kernel but at the same time with other devices and uses double buffers to overlap even data movements between pipeline stages.
  • Working with different numeric arrays: Either C#-arrays like float[], int[], byte[],... or C++-array wrappers like ClFloatArray, ClArray<float>, ClByteArray, ClArray<byte>
  • Automatic buffer copy optimizations for devices: If a device shares RAM with CPU, it uses map/unmap commands to reduce number of array copies(instead of read/write). If also that device is given a C++ wrapper array(such as ClArray<float>), it also uses cl_use_host_ptr flag on buffer for a zero-copy access aka" streaming". By default, all devices have their own buffers.
  • Two different usage types: First one lets the developer choose all kernel parameters as arrays more explicitly for a more explicitly readable execution, second one creates same thing using a much shorter definition to complete in less code lines and change only the necessary flags instead of all.
  • Automatic resource dispose: When C++ array wrappers are finalized(out-of-scope, garbage collected), they release resources. Also dispose method can be called explicitly by developer.
  • Uses OpenCL 1.2: C++ bindings from Khronos.org for its base.

Documentation

You can see details and tutorial here in Cekirdekler-wiki

Known Issues

  • For C++ array wrappers like Array<float> there is no out-of-bounds-check, don't cross boundaries when accessing array indexing.
  • Don't use C++ array wrappers after they are disposed. These features are not added to speed-up array indexing.
  • Don't use ClNumberCruncher or Core instances after they are disposed.
  • Pay attention to "number of array elements used" per workitem in kernel and how they are given as parameters from API compute() method.
  • Pay attenton to "partial read"/"read"/"write" array copy modifiers when your kernel is altering(or reading) whole array or just a part of it.
  • No performance output at first iteration. Load balancer needs at least several iterations to distribute fairly and performance report needs at least 2 iterations for console output.

Example

        Cekirdekler.ClNumberCruncher cr = new Cekirdekler.ClNumberCruncher(
            Cekirdekler.AcceleratorType.GPU, @"
                __kernel void hello(__global char * arr)
                {
                    printf(""hello world"");
                }
            ");

        Cekirdekler.ClArrays.ClArray<byte> array = new Cekirdekler.ClArrays.ClArray<byte>(1000);
        array.compute(cr, 1, "hello", 1000, 100); 

About

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages