Releases · tugrul512bit/Cekirdekler

03 Jun 16:40

v1.2.12

74c2071

v1.2.12

needs KutuphaneCL.dll (CekirdeklerCPP project) v1.2.12

added concurrency option to single device pipeline class, to limit its number of command queues between 1 and 16 inclusive

optimized it for performance

fixed minor bugs.

Assets 2

02 Jun 21:33

tugrul512bit

v1.2.11

a05d6fb

v1.2.11

added single gpu pipeline sub-feature

runs all stages in parallel in gpu, concurrently to host codes and also device to host - host to device transmissions are parallel to all stages kernels

            var deviceForCompute = ClPlatforms.all().gpus()[0];
            deviceForCompute.logInfo();
            DevicePipeline gpuPipeline = new DevicePipeline(deviceForCompute,@""+File.ReadAllText("..//..//..//test.cl") );
            //gpuPipeline.enableSerialMode();
            DevicePipelineStage stage1 = new DevicePipelineStage("resize", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage2 = new DevicePipelineStage("parameterSet", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage3 = new DevicePipelineStage("gaussianBlur", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage4 = new DevicePipelineStage("rotateImgRad", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage5 = new DevicePipelineStage("blendImg", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage6 = new DevicePipelineStage("postProcess", maxImgSizeResult * maxImgSizeResult, 256);

            ClArray<byte> stage1Input = imageBytes;
            ClArray<byte> stage5Input = imageBlendBytes;
            ClArray<int> parameters = new int[1024];
            ClArray<int> acculumulator = new int[1024];
            ClArray<int> parametersPipe = new int[1024];
            ClArray<int> parametersPipe2 = new int[1024];
            ClArray<int> parametersPipe3 = new int[1024];
            ClArray<int> parametersPipe4 = new int[1024];
            ClArray<int> parametersPipe5 = new int[1024];
            ClArray<int> parametersPipe6 = new int[1024];
            ClArray<byte> resultImage = resultImageBytes;
            ClArray<byte> pipeBuffer = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer2 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer3 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer4 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer5 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);

            DevicePipelineArray bufInput  = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage1Input);
            DevicePipelineArray bufBlendInput  = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage5Input);
            var bufAccumulator = new DevicePipelineArray(DevicePipelineArrayType.INTERNAL , acculumulator);
            var bufPipe1 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer);
            var bufPipe2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer2);
            var bufPipe3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer3);
            var bufPipe4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer4);
            var bufPipe5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer5);
            var bufPipeParameter = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe);
            var bufPipeParameter2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe2);
            var bufPipeParameter3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe3);
            var bufPipeParameter4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe4);
            var bufPipeParameter5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe5);
            var bufPipeParameter6 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe6);
            var bufResult = new DevicePipelineArray(DevicePipelineArrayType.OUTPUT , resultImage);

            stage1.bindArray(bufInput);
            stage1.bindArray(new DevicePipelineArray(DevicePipelineArrayType.INPUT, parameters));
            stage1.bindArray(bufPipeParameter);
            stage1.bindArray(bufPipe1);

            stage2.bindArray(bufPipe1);
            stage2.bindArray(bufPipeParameter);
            stage2.bindArray(bufPipeParameter2);
            stage2.bindArray(bufPipe2);
            stage2.bindArray(bufAccumulator);

            stage3.bindArray(bufPipe2);
            stage3.bindArray(bufPipeParameter2);
            stage3.bindArray(bufPipeParameter3);
            stage3.bindArray(bufPipe3);

            stage4.bindArray(bufPipe3);
            stage4.bindArray(bufPipeParameter3);
            stage4.bindArray(bufPipeParameter4);
            stage4.bindArray(bufPipe4);

            stage5.bindArray(bufPipe4);
            stage5.bindArray(bufPipeParameter4);
            stage5.bindArray(bufPipeParameter5);
            stage5.bindArray(bufPipe5);
            stage5.bindArray(bufBlendInput);

            stage6.bindArray(bufPipe5);
            stage6.bindArray(bufPipeParameter5);
            stage6.bindArray(bufPipeParameter6);
            stage6.bindArray(bufResult);

            gpuPipeline.addStage(stage1);
            gpuPipeline.addStage(stage2);
            gpuPipeline.addStage(stage3);
            gpuPipeline.addStage(stage4);
            gpuPipeline.addStage(stage5);
            gpuPipeline.addStage(stage6);

Assets 2

30 May 17:46

tugrul512bit

v1.2.10

66badff

v1.2.10

added async enqueue mode to number cruncher.

usage:

            for (int i = 0; i < 15; i++)
            {
                benchStart();
                cruncher.enqueueMode = true;

                // runs always in queue-0
                dataArrayA.nextParam(dataArrayB, constant).compute(cruncher, 1, "vecAdd", 1024 * 1024);

               // runs in next concurrent queue which is 1
                cruncher.enqueueModeAsyncEnable = true;
                dataArrayC.nextParam(dataArrayD, constant2).compute(cruncher, 1, "vecMul", 1024 * 1024);
                cruncher.enqueueModeAsyncEnable = false;

               // runs always in queue-0 so serialized after vecAdd
                dataArrayE.nextParam(dataArrayF, constant3).compute(cruncher, 1, "vecDiv", 1024 * 1024);

                // runs in next concurrent queue which is 2 and is serialized after vecMul
                cruncher.enqueueModeAsyncEnable = true;
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                cruncher.enqueueModeAsyncEnable = false;

                cruncher.enqueueMode = false;
                benchStop();

            }

*********vecAdd**********vecDiv*********
*********vecMul*********
*********vecAddInt******vecAddInt******vecAddInt******vecAddInt

partial overlapping saves time.

Assets 2

27 May 21:31

tugrul512bit

v1.2.9_hotpatch

a6d6746

v1.2.9_hotpatch

added ClArray.zeroCopy to let developers enable map/unmap on array level instead of just device.

if both device has streaming parameter set and array has zeroCopy field set, then kernels will access these arrays without copying(if possible).

once device buffer is created in compute(), clearing or setting zeroCopy will not have any effect on same device that was used in compute()

Assets 2

27 May 15:41

tugrul512bit

v1.2.9

e068ee4

v1.2.9

added readOnly and writeOnly properties to ClArray class so some kernels may run up to %10 faster when an array is only read and another is only written

Assets 2

26 May 20:35

tugrul512bit

v1.2.8

6b9ce8a

v1.2.8

unnecessary clSetKernelArg commands are removed

added ClArray.writeAll to get result arrays as a whole instead of just a number of elements. Similar to non-partial reads by ClArray.read = true and ClArray.partialRead=false. If multiple GPUs are used, each GPU writes only 1 of result arrays(instead of writing same, undefined behavior).

C# char array bug fixed

Enqueue mode performance query bug fixed

Assets 2

23 May 20:06

tugrul512bit

v1.2.7

324ec9c

v1.2.7

Added enqueueMode flag for numberCruncher and pipeline stage classes so they can do thousands of operations with just single synchronization between host and device (up to 60x faster for light workloads)

Assets 2

16 May 14:33

tugrul512bit

v1.2.6

da564d4

v1.2.6

Added kernel(s) repeat feature to number cruncher.

Compatible with CekirdeklerCPP v1.2.6+

Assets 2

13 May 10:45

tugrul512bit

v1.2.5

a86a854

v1.2.5

1.2.5: can query device names of explicit device instances by loginfo (returns multiline string, each line for a different device and with vendor names)

Assets 2

12 May 14:49

tugrul512bit

v1.2.4_hotfix

d7301b0

v1.2.4_hotfix

Minor bugfix in 'performanceFeed' flag related console output algorithm.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: tugrul512bit/Cekirdekler

v1.2.12

v1.2.11

v1.2.10

v1.2.9_hotpatch

v1.2.9

v1.2.8

v1.2.7

v1.2.6

v1.2.5

v1.2.4_hotfix