Releases: tugrul512bit/Cekirdekler
v1.2.12
needs KutuphaneCL.dll (CekirdeklerCPP project) v1.2.12
added concurrency option to single device pipeline class, to limit its number of command queues between 1 and 16 inclusive
optimized it for performance
fixed minor bugs.
v1.2.11
added single gpu pipeline sub-feature
runs all stages in parallel in gpu, concurrently to host codes and also device to host - host to device transmissions are parallel to all stages kernels
var deviceForCompute = ClPlatforms.all().gpus()[0];
deviceForCompute.logInfo();
DevicePipeline gpuPipeline = new DevicePipeline(deviceForCompute,@""+File.ReadAllText("..//..//..//test.cl") );
//gpuPipeline.enableSerialMode();
DevicePipelineStage stage1 = new DevicePipelineStage("resize", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage2 = new DevicePipelineStage("parameterSet", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage3 = new DevicePipelineStage("gaussianBlur", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage4 = new DevicePipelineStage("rotateImgRad", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage5 = new DevicePipelineStage("blendImg", maxImgSizeResult * maxImgSizeResult, 256);
DevicePipelineStage stage6 = new DevicePipelineStage("postProcess", maxImgSizeResult * maxImgSizeResult, 256);
ClArray<byte> stage1Input = imageBytes;
ClArray<byte> stage5Input = imageBlendBytes;
ClArray<int> parameters = new int[1024];
ClArray<int> acculumulator = new int[1024];
ClArray<int> parametersPipe = new int[1024];
ClArray<int> parametersPipe2 = new int[1024];
ClArray<int> parametersPipe3 = new int[1024];
ClArray<int> parametersPipe4 = new int[1024];
ClArray<int> parametersPipe5 = new int[1024];
ClArray<int> parametersPipe6 = new int[1024];
ClArray<byte> resultImage = resultImageBytes;
ClArray<byte> pipeBuffer = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer2 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer3 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer4 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
ClArray<byte> pipeBuffer5 = new ClArray<byte>(maxImgSizeResult * maxImgSizeResult * 4);
DevicePipelineArray bufInput = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage1Input);
DevicePipelineArray bufBlendInput = new DevicePipelineArray(DevicePipelineArrayType.INPUT, stage5Input);
var bufAccumulator = new DevicePipelineArray(DevicePipelineArrayType.INTERNAL , acculumulator);
var bufPipe1 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer);
var bufPipe2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer2);
var bufPipe3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer3);
var bufPipe4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer4);
var bufPipe5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , pipeBuffer5);
var bufPipeParameter = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe);
var bufPipeParameter2 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe2);
var bufPipeParameter3 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe3);
var bufPipeParameter4 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe4);
var bufPipeParameter5 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe5);
var bufPipeParameter6 = new DevicePipelineArray(DevicePipelineArrayType.TRANSITION , parametersPipe6);
var bufResult = new DevicePipelineArray(DevicePipelineArrayType.OUTPUT , resultImage);
stage1.bindArray(bufInput);
stage1.bindArray(new DevicePipelineArray(DevicePipelineArrayType.INPUT, parameters));
stage1.bindArray(bufPipeParameter);
stage1.bindArray(bufPipe1);
stage2.bindArray(bufPipe1);
stage2.bindArray(bufPipeParameter);
stage2.bindArray(bufPipeParameter2);
stage2.bindArray(bufPipe2);
stage2.bindArray(bufAccumulator);
stage3.bindArray(bufPipe2);
stage3.bindArray(bufPipeParameter2);
stage3.bindArray(bufPipeParameter3);
stage3.bindArray(bufPipe3);
stage4.bindArray(bufPipe3);
stage4.bindArray(bufPipeParameter3);
stage4.bindArray(bufPipeParameter4);
stage4.bindArray(bufPipe4);
stage5.bindArray(bufPipe4);
stage5.bindArray(bufPipeParameter4);
stage5.bindArray(bufPipeParameter5);
stage5.bindArray(bufPipe5);
stage5.bindArray(bufBlendInput);
stage6.bindArray(bufPipe5);
stage6.bindArray(bufPipeParameter5);
stage6.bindArray(bufPipeParameter6);
stage6.bindArray(bufResult);
gpuPipeline.addStage(stage1);
gpuPipeline.addStage(stage2);
gpuPipeline.addStage(stage3);
gpuPipeline.addStage(stage4);
gpuPipeline.addStage(stage5);
gpuPipeline.addStage(stage6);
v1.2.10
added async enqueue mode to number cruncher.
usage:
for (int i = 0; i < 15; i++)
{
benchStart();
cruncher.enqueueMode = true;
// runs always in queue-0
dataArrayA.nextParam(dataArrayB, constant).compute(cruncher, 1, "vecAdd", 1024 * 1024);
// runs in next concurrent queue which is 1
cruncher.enqueueModeAsyncEnable = true;
dataArrayC.nextParam(dataArrayD, constant2).compute(cruncher, 1, "vecMul", 1024 * 1024);
cruncher.enqueueModeAsyncEnable = false;
// runs always in queue-0 so serialized after vecAdd
dataArrayE.nextParam(dataArrayF, constant3).compute(cruncher, 1, "vecDiv", 1024 * 1024);
// runs in next concurrent queue which is 2 and is serialized after vecMul
cruncher.enqueueModeAsyncEnable = true;
dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
dataArrayG.nextParam(dataArrayH, constant4).compute(cruncher, 1, "vecAddInt", 1024 * 1024);
cruncher.enqueueModeAsyncEnable = false;
cruncher.enqueueMode = false;
benchStop();
}
*********vecAdd**********vecDiv*********
*********vecMul*********
*********vecAddInt******vecAddInt******vecAddInt******vecAddInt
partial overlapping saves time.
v1.2.9_hotpatch
added ClArray.zeroCopy to let developers enable map/unmap on array level instead of just device.
if both device has streaming parameter set and array has zeroCopy field set, then kernels will access these arrays without copying(if possible).
once device buffer is created in compute(), clearing or setting zeroCopy will not have any effect on same device that was used in compute()
v1.2.9
added readOnly and writeOnly properties to ClArray class so some kernels may run up to %10 faster when an array is only read and another is only written
v1.2.8
unnecessary clSetKernelArg commands are removed
added ClArray.writeAll
to get result arrays as a whole instead of just a number of elements. Similar to non-partial reads by ClArray.read = true and ClArray.partialRead=false
. If multiple GPUs are used, each GPU writes only 1 of result arrays(instead of writing same, undefined behavior).
C# char array bug fixed
Enqueue mode performance query bug fixed
v1.2.7
Added enqueueMode flag for numberCruncher and pipeline stage classes so they can do thousands of operations with just single synchronization between host and device (up to 60x faster for light workloads)
v1.2.6
Added kernel(s) repeat feature to number cruncher.
Compatible with CekirdeklerCPP v1.2.6+
v1.2.5
1.2.5: can query device names of explicit device instances by loginfo (returns multiline string, each line for a different device and with vendor names)
v1.2.4_hotfix
Minor bugfix in 'performanceFeed' flag related console output algorithm.