Skip to content

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).

License

Notifications You must be signed in to change notification settings

tugrul512bit/Cekirdekler

Repository files navigation

Cekirdekler

Very simple C# Multi-device GPGPU(OpenCL) compute API with an iterative interdevice-loadbalancing feature using multiple pipelining on read/write/compute operations for developers' custom opencl kernels.

64-bit only. "project settings -> build -> platform target -> x64" Also configuration manager needs to look like this:

Needs extra C++ dll built in 64-bit(x86_64) from https://github.com/tugrul512bit/CekirdeklerCPP which must be named KutuphaneCL.dll

The other needed dll is Microsoft's System.Threading.dll and its xml helper for .Net 2.0 - or - you can adjust "using" and use .Net 3.5+ for your own project and don't need System.Threading.dll.

In total, Cekirdekler.dll and KutuphaneCL.dll and using .Net 3.5 should be enough.

Usage: add only Cekirdekler.dll and system.threading.dll as references to your C# projects. Other files needs to exist in same folder with Cekirdekler.dll or the executable of main project.

This project is being enhanced using ZenHub:

Features

  • Implicit multi device control: from CPUs to any number of GPUs and ACCeelerators. Explicit in library-side for compatibility and performance, implicit for client-coder for the ease of GPGPU to concentrate on opencl kernel code. Selection of devices can be done implicitly or explicitly to achieve ease-of-setup or detailed device query.
  • Iterative load balancing between devices: uniquely done for each different compute(explicit control with user-given compute-id). Multiple devices get more and more fair work loads until the ratio of work distribution converges to some point. Partitionig workload completes a kernel with less latency which is applicable for hot-spot loops and some simple embarrassingly-parallel algorithms. Even better for streaming data with pipelining option enabled.
  • Pipelining for reads, computes and writes(host - device link): either by the mercy of device drivers or explicit event-based queue management. Hides the latency of least time consuming part(such as writes) behind the most time consuming part(such as compute). GPUs can run buffer copies and opencl kernels concurrently.
  • Pipelining between devices(device - host - device): Concurrently run multiple stages to overlap them in timeline and gain advantage of multiple GPUs(and FPGAa, CPUs) for even non-separable(because of atomics and low-level optimizations) kernels of a time-consuming pipeline. Each device runs a different kernel but at the same time with other devices and uses double buffers to overlap even data movements between pipeline stages.
  • Working with different numeric arrays: Either C#-arrays like float[], int[], byte[],... or C++-array wrappers like ClFloatArray, ClArray<float>, ClByteArray, ClArray<byte>
  • Automatic buffer copy optimizations for devices: If a device shares RAM with CPU, it uses map/unmap commands to reduce number of array copies(instead of read/write). If also that device is given a C++ wrapper array(such as ClArray<float>), it also uses cl_use_host_ptr flag on buffer for a zero-copy access aka" streaming". By default, all devices have their own buffers.
  • Two different usage types: First one lets the developer choose all kernel parameters as arrays more explicitly for a more explicitly readable execution, second one creates same thing using a much shorter definition to complete in less code lines and change only the necessary flags instead of all.
  • Automatic resource dispose: When C++ array wrappers are finalized(out-of-scope, garbage collected), they release resources. Also dispose method can be called explicitly by developer.
  • Uses OpenCL 1.2: C++ bindings from Khronos.org for its base.

Documentation

You can see details and tutorial here in Cekirdekler-wiki

Known Issues

  • For C++ array wrappers like Array<float> there is no out-of-bounds-check, don't cross boundaries when accessing array indexing.
  • Don't use C++ array wrappers after they are disposed. These features are not added to speed-up array indexing.
  • Don't use ClNumberCruncher or Core instances after they are disposed.
  • Pay attention to "number of array elements used" per workitem in kernel and how they are given as parameters from API compute() method.
  • Pay attenton to "partial read"/"read"/"write" array copy modifiers when your kernel is altering(or reading) whole array or just a part of it.
  • No performance output at first iteration. Load balancer needs at least several iterations to distribute fairly and performance report needs at least 2 iterations for console output.

Example

        Cekirdekler.ClNumberCruncher cr = new Cekirdekler.ClNumberCruncher(
            Cekirdekler.AcceleratorType.GPU, @"
                __kernel void hello(__global char * arr)
                {
                    printf(""hello world"");
                }
            ");

        Cekirdekler.ClArrays.ClArray<byte> array = new Cekirdekler.ClArrays.ClArray<byte>(1000);
        array.compute(cr, 1, "hello", 1000, 100); 

About

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages