Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
Cekirdekler		Cekirdekler
.gitignore		.gitignore
Kamera.cs		Kamera.cs
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
cekirdekler_dll_v122.rar		cekirdekler_dll_v122.rar
cekirdekler_dll_v123.rar		cekirdekler_dll_v123.rar
cekirdekler_dll_v124_hotfix.rar		cekirdekler_dll_v124_hotfix.rar
deviceToDevicePipeline.gif		deviceToDevicePipeline.gif
opencl64.png		opencl64.png

Repository files navigation

Cekirdekler

Very simple C# Multi-device GPGPU(OpenCL) compute API with an iterative interdevice-loadbalancing feature using multiple pipelining on read/write/compute operations for developers' custom opencl kernels.

64-bit only. "project settings -> build -> platform target -> x64" Also configuration manager needs to look like this:

Needs extra C++ dll built in 64-bit(x86_64) from https://github.com/tugrul512bit/CekirdeklerCPP which must be named KutuphaneCL.dll

The other needed dll is Microsoft's System.Threading.dll and its xml helper for .Net 2.0 - or - you can adjust "using" and use .Net 3.5+ for your own project and don't need System.Threading.dll.

In total, Cekirdekler.dll and KutuphaneCL.dll and using .Net 3.5 should be enough.

Usage: add only Cekirdekler.dll and system.threading.dll as references to your C# projects. Other files needs to exist in same folder with Cekirdekler.dll or the executable of main project.

This project is being enhanced using ZenHub:

Features

Implicit multi device control: from CPUs to any number of GPUs and ACCeelerators. Explicit in library-side for compatibility and performance, implicit for client-coder for the ease of GPGPU to concentrate on opencl kernel code. Selection of devices can be done implicitly or explicitly to achieve ease-of-setup or detailed device query.
Iterative load balancing between devices: uniquely done for each different compute(explicit control with user-given compute-id). Multiple devices get more and more fair work loads until the ratio of work distribution converges to some point. Partitionig workload completes a kernel with less latency which is applicable for hot-spot loops and some simple embarrassingly-parallel algorithms. Even better for streaming data with pipelining option enabled.
Pipelining for reads, computes and writes(host - device link): either by the mercy of device drivers or explicit event-based queue management. Hides the latency of least time consuming part(such as writes) behind the most time consuming part(such as compute). GPUs can run buffer copies and opencl kernels concurrently.
Pipelining between devices(device - host - device): Concurrently run multiple stages to overlap them in timeline and gain advantage of multiple GPUs(and FPGAa, CPUs) for even non-separable(because of atomics and low-level optimizations) kernels of a time-consuming pipeline. Each device runs a different kernel but at the same time with other devices and uses double buffers to overlap even data movements between pipeline stages.
Working with different numeric arrays: Either C#-arrays like float[], int[], byte[],... or C++-array wrappers like ClFloatArray, ClArray<float>, ClByteArray, ClArray<byte>
Automatic buffer copy optimizations for devices: If a device shares RAM with CPU, it uses map/unmap commands to reduce number of array copies(instead of read/write). If also that device is given a C++ wrapper array(such as ClArray<float>), it also uses cl_use_host_ptr flag on buffer for a zero-copy access aka" streaming". By default, all devices have their own buffers.
Two different usage types: First one lets the developer choose all kernel parameters as arrays more explicitly for a more explicitly readable execution, second one creates same thing using a much shorter definition to complete in less code lines and change only the necessary flags instead of all.
Automatic resource dispose: When C++ array wrappers are finalized(out-of-scope, garbage collected), they release resources. Also dispose method can be called explicitly by developer.
Uses OpenCL 1.2: C++ bindings from Khronos.org for its base.

Documentation

You can see details and tutorial here in Cekirdekler-wiki

Known Issues

For C++ array wrappers like Array<float> there is no out-of-bounds-check, don't cross boundaries when accessing array indexing.
Don't use C++ array wrappers after they are disposed. These features are not added to speed-up array indexing.
Don't use ClNumberCruncher or Core instances after they are disposed.
Pay attention to "number of array elements used" per workitem in kernel and how they are given as parameters from API compute() method.
Pay attenton to "partial read"/"read"/"write" array copy modifiers when your kernel is altering(or reading) whole array or just a part of it.
No performance output at first iteration. Load balancer needs at least several iterations to distribute fairly and performance report needs at least 2 iterations for console output.

Example

        Cekirdekler.ClNumberCruncher cr = new Cekirdekler.ClNumberCruncher(
            Cekirdekler.AcceleratorType.GPU, @"
                __kernel void hello(__global char * arr)
                {
                    printf(""hello world"");
                }
            ");

        Cekirdekler.ClArrays.ClArray<byte> array = new Cekirdekler.ClArrays.ClArray<byte>(1000);
        array.compute(cr, 1, "hello", 1000, 100);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cekirdekler

Features

Documentation

Known Issues

Example

About

Releases 44

Packages

Languages

License

tugrul512bit/Cekirdekler

Folders and files

Latest commit

History

Repository files navigation

Cekirdekler

Features

Documentation

Known Issues

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 44

Packages 0

Languages

Packages