Skip to content

Releases: tugrul512bit/Cekirdekler

v1.2.12

03 Jun 16:40
Compare
Choose a tag to compare

needs KutuphaneCL.dll (CekirdeklerCPP project) v1.2.12

added concurrency option to single device pipeline class, to limit its number of command queues between 1 and 16 inclusive

optimized it for performance

fixed minor bugs.

v1.2.11

02 Jun 21:33
Compare
Choose a tag to compare

added single gpu pipeline sub-feature

runs all stages in parallel in gpu, concurrently to host codes and also device to host - host to device transmissions are parallel to all stages kernels

            var deviceForCompute = ClPlatforms.all().gpus()[0];
            deviceForCompute.logInfo();
            DevicePipeline gpuPipeline = new DevicePipeline(deviceForCompute,@""+File.ReadAllText("..//..//..//test.cl") );
            //gpuPipeline.enableSerialMode();
            DevicePipelineStage stage1 = new DevicePipelineStage("resize", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage2 = new DevicePipelineStage("parameterSet", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage3 = new DevicePipelineStage("gaussianBlur", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage4 = new DevicePipelineStage("rotateImgRad", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage5 = new DevicePipelineStage("blendImg", maxImgSizeResult * maxImgSizeResult, 256);
            DevicePipelineStage stage6 = new DevicePipelineStage("postProcess", maxImgSizeResult * maxImgSizeResult, 256);

            ClArray<byte> stage1Input = imageBytes;
            ClArray<byte> stage5Input = imageBlendBytes;
            ClArray<int> parameters = new int[1024];
            ClArray<int> acculumulator = new int[1024];
            ClArray<int> parametersPipe = new int[1024];
            ClArray<int> parametersPipe2 = new int[1024];
            ClArray<int> parametersPipe3 = new int[1024];
            ClArray<int> parametersPipe4 = new int[1024];
            ClArray<int> parametersPipe5 = new int[1024];
            ClArray<int> parametersPipe6 = new int[1024];
            ClArray<byte> resultImage = resultImageBytes;
            ClArray<byte> pipeBuffer = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer2 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer3 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer4 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
            ClArray<byte> pipeBuffer5 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);

            DevicePipelineArray bufInput  = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage1Input);
            DevicePipelineArray bufBlendInput  = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage5Input);
            var bufAccumulator = new DevicePipelineArray(DevicePipelineArrayType.INTERNAL , acculumulator);
            var bufPipe1 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer);
            var bufPipe2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer2);
            var bufPipe3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer3);
            var bufPipe4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer4);
            var bufPipe5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer5);
            var bufPipeParameter = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe);
            var bufPipeParameter2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe2);
            var bufPipeParameter3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe3);
            var bufPipeParameter4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe4);
            var bufPipeParameter5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe5);
            var bufPipeParameter6 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe6);
            var bufResult = new DevicePipelineArray(DevicePipelineArrayType.OUTPUT , resultImage);

            stage1.bindArray(bufInput);
            stage1.bindArray(new DevicePipelineArray(DevicePipelineArrayType.INPUT, parameters));
            stage1.bindArray(bufPipeParameter);
            stage1.bindArray(bufPipe1);

            stage2.bindArray(bufPipe1);
            stage2.bindArray(bufPipeParameter);
            stage2.bindArray(bufPipeParameter2);
            stage2.bindArray(bufPipe2);
            stage2.bindArray(bufAccumulator);

            stage3.bindArray(bufPipe2);
            stage3.bindArray(bufPipeParameter2);
            stage3.bindArray(bufPipeParameter3);
            stage3.bindArray(bufPipe3);

            stage4.bindArray(bufPipe3);
            stage4.bindArray(bufPipeParameter3);
            stage4.bindArray(bufPipeParameter4);
            stage4.bindArray(bufPipe4);

            stage5.bindArray(bufPipe4);
            stage5.bindArray(bufPipeParameter4);
            stage5.bindArray(bufPipeParameter5);
            stage5.bindArray(bufPipe5);
            stage5.bindArray(bufBlendInput);

            stage6.bindArray(bufPipe5);
            stage6.bindArray(bufPipeParameter5);
            stage6.bindArray(bufPipeParameter6);
            stage6.bindArray(bufResult);

            gpuPipeline.addStage(stage1);
            gpuPipeline.addStage(stage2);
            gpuPipeline.addStage(stage3);
            gpuPipeline.addStage(stage4);
            gpuPipeline.addStage(stage5);
            gpuPipeline.addStage(stage6);

v1.2.10

30 May 17:46
Compare
Choose a tag to compare

added async enqueue mode to number cruncher.

usage:

            for (int i = 0; i < 15; i++)
            {
                benchStart();
                cruncher.enqueueMode = true;

                // runs always in queue-0
                dataArrayA.nextParam(dataArrayB, constant).compute(cruncher, 1, "vecAdd", 1024 * 1024);

               // runs in next concurrent queue which is 1
                cruncher.enqueueModeAsyncEnable = true;
                dataArrayC.nextParam(dataArrayD, constant2).compute(cruncher, 1, "vecMul", 1024 * 1024);
                cruncher.enqueueModeAsyncEnable = false;

               // runs always in queue-0 so serialized after vecAdd
                dataArrayE.nextParam(dataArrayF, constant3).compute(cruncher, 1, "vecDiv", 1024 * 1024);

                // runs in next concurrent queue which is 2 and is serialized after vecMul
                cruncher.enqueueModeAsyncEnable = true;
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
                cruncher.enqueueModeAsyncEnable = false;

                cruncher.enqueueMode = false;
                benchStop();

            }
*********vecAdd**********vecDiv*********
*********vecMul*********
*********vecAddInt******vecAddInt******vecAddInt******vecAddInt

partial overlapping saves time.

v1.2.9_hotpatch

27 May 21:31
Compare
Choose a tag to compare

added ClArray.zeroCopy to let developers enable map/unmap on array level instead of just device.

if both device has streaming parameter set and array has zeroCopy field set, then kernels will access these arrays without copying(if possible).

once device buffer is created in compute(), clearing or setting zeroCopy will not have any effect on same device that was used in compute()

v1.2.9

27 May 15:41
Compare
Choose a tag to compare

added readOnly and writeOnly properties to ClArray class so some kernels may run up to %10 faster when an array is only read and another is only written

v1.2.8

26 May 20:35
Compare
Choose a tag to compare

unnecessary clSetKernelArg commands are removed

added ClArray.writeAll to get result arrays as a whole instead of just a number of elements. Similar to non-partial reads by ClArray.read = true and ClArray.partialRead=false. If multiple GPUs are used, each GPU writes only 1 of result arrays(instead of writing same, undefined behavior).

C# char array bug fixed

Enqueue mode performance query bug fixed

v1.2.7

23 May 20:06
Compare
Choose a tag to compare

Added enqueueMode flag for numberCruncher and pipeline stage classes so they can do thousands of operations with just single synchronization between host and device (up to 60x faster for light workloads)

v1.2.6

16 May 14:33
Compare
Choose a tag to compare

Added kernel(s) repeat feature to number cruncher.

Compatible with CekirdeklerCPP v1.2.6+

v1.2.5

13 May 10:45
Compare
Choose a tag to compare

1.2.5: can query device names of explicit device instances by loginfo (returns multiline string, each line for a different device and with vendor names)

v1.2.4_hotfix

12 May 14:49
Compare
Choose a tag to compare

Minor bugfix in 'performanceFeed' flag related console output algorithm.