KernelIntrinsics #562

vchuravy · 2025-02-04T15:19:32Z

The goal is to allow for kernels to be written without relying on KernelAbstractions macros

codecov · 2025-02-04T15:23:50Z

Codecov Report

Attention: Patch coverage is 0% with 18 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (9741962) to head (8ea6ad2).

Files with missing lines	Patch %	Lines
src/KernelAbstractions.jl	0.00%	12 Missing ⚠️
src/pocl/backend.jl	0.00%	6 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff          @@
##            main    #562   +/-   ##
=====================================
  Coverage   0.00%   0.00%           
=====================================
  Files         21      21           
  Lines       1575    1581    +6     
=====================================
- Misses      1575    1581    +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pxl-th · 2025-02-04T19:37:56Z

src/intrinsics.jl

+
+Returns the unique local work-item ID.
+"""
+function get_local_id end


So IIUC, backends should implement these like below, right?

function get_local_id() return (threadIdx().x, threadIdx().y, threadIdx().z) end

Yeah basically, and my goal is to replace the old internal functions the people had to override with definitions based on these functions.

vchuravy · 2025-02-05T12:29:42Z

KernelIntrinsics #562 👈 (View in Graphite)
Implement a CPU backend using POCL #556 : 1 other dependent PR (#558 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

src/KernelAbstractions.jl

src/intrinsics.jl

github-actions · 2025-02-05T15:53:27Z

Benchmark Results

	main	`8ea6ad2`...	main/8ea6ad2457a1cf...
saxpy/default/Float16/1024	0.0496 ± 0.026 ms	0.0612 ± 0.026 ms	0.811
saxpy/default/Float16/1048576	0.896 ± 0.024 ms	0.899 ± 0.023 ms	0.997
saxpy/default/Float16/16384	0.0639 ± 0.028 ms	0.0658 ± 0.028 ms	0.971
saxpy/default/Float16/2048	0.05 ± 0.023 ms	0.0583 ± 0.024 ms	0.858
saxpy/default/Float16/256	0.0603 ± 0.027 ms	0.064 ± 0.027 ms	0.943
saxpy/default/Float16/262144	0.273 ± 0.026 ms	0.273 ± 0.026 ms	0.998
saxpy/default/Float16/32768	0.0772 ± 0.028 ms	0.0779 ± 0.028 ms	0.991
saxpy/default/Float16/4096	0.0619 ± 0.025 ms	0.0642 ± 0.024 ms	0.964
saxpy/default/Float16/512	0.0597 ± 0.027 ms	0.0599 ± 0.026 ms	0.997
saxpy/default/Float16/64	0.062 ± 0.027 ms	0.063 ± 0.026 ms	0.984
saxpy/default/Float16/65536	0.108 ± 0.028 ms	0.11 ± 0.028 ms	0.98
saxpy/default/Float32/1024	0.0574 ± 0.026 ms	0.0585 ± 0.026 ms	0.982
saxpy/default/Float32/1048576	0.484 ± 0.035 ms	0.481 ± 0.031 ms	1.01
saxpy/default/Float32/16384	0.0563 ± 0.026 ms	0.0576 ± 0.026 ms	0.978
saxpy/default/Float32/2048	0.0502 ± 0.024 ms	0.0545 ± 0.024 ms	0.921
saxpy/default/Float32/256	0.0572 ± 0.026 ms	0.0616 ± 0.026 ms	0.927
saxpy/default/Float32/262144	0.167 ± 0.035 ms	0.168 ± 0.034 ms	0.994
saxpy/default/Float32/32768	0.0622 ± 0.027 ms	0.0636 ± 0.027 ms	0.977
saxpy/default/Float32/4096	0.0594 ± 0.025 ms	0.0625 ± 0.025 ms	0.95
saxpy/default/Float32/512	0.0573 ± 0.027 ms	0.0638 ± 0.026 ms	0.898
saxpy/default/Float32/64	0.0591 ± 0.026 ms	0.0635 ± 0.026 ms	0.93
saxpy/default/Float32/65536	0.0827 ± 0.029 ms	0.0821 ± 0.028 ms	1.01
saxpy/default/Float64/1024	0.0509 ± 0.026 ms	0.0607 ± 0.026 ms	0.838
saxpy/default/Float64/1048576	0.516 ± 0.042 ms	0.517 ± 0.038 ms	0.998
saxpy/default/Float64/16384	0.0582 ± 0.026 ms	0.0583 ± 0.026 ms	1
saxpy/default/Float64/2048	0.048 ± 0.023 ms	0.0519 ± 0.024 ms	0.925
saxpy/default/Float64/256	0.061 ± 0.027 ms	0.061 ± 0.026 ms	1
saxpy/default/Float64/262144	0.176 ± 0.028 ms	0.177 ± 0.029 ms	0.99
saxpy/default/Float64/32768	0.0666 ± 0.026 ms	0.0666 ± 0.026 ms	1
saxpy/default/Float64/4096	0.0537 ± 0.025 ms	0.0629 ± 0.025 ms	0.854
saxpy/default/Float64/512	0.0551 ± 0.027 ms	0.0611 ± 0.027 ms	0.901
saxpy/default/Float64/64	0.0566 ± 0.027 ms	0.0644 ± 0.027 ms	0.878
saxpy/default/Float64/65536	0.0897 ± 0.027 ms	0.0915 ± 0.027 ms	0.981
saxpy/static workgroup=(1024,)/Float16/1024	0.0482 ± 0.026 ms	0.0549 ± 0.026 ms	0.877
saxpy/static workgroup=(1024,)/Float16/1048576	0.913 ± 0.027 ms	0.904 ± 0.028 ms	1.01
saxpy/static workgroup=(1024,)/Float16/16384	0.0626 ± 0.026 ms	0.0616 ± 0.026 ms	1.02
saxpy/static workgroup=(1024,)/Float16/2048	0.0515 ± 0.024 ms	0.0577 ± 0.024 ms	0.892
saxpy/static workgroup=(1024,)/Float16/256	0.0594 ± 0.026 ms	0.0598 ± 0.026 ms	0.993
saxpy/static workgroup=(1024,)/Float16/262144	0.274 ± 0.028 ms	0.272 ± 0.027 ms	1.01
saxpy/static workgroup=(1024,)/Float16/32768	0.0771 ± 0.026 ms	0.0761 ± 0.026 ms	1.01
saxpy/static workgroup=(1024,)/Float16/4096	0.0565 ± 0.027 ms	0.0587 ± 0.026 ms	0.962
saxpy/static workgroup=(1024,)/Float16/512	0.0569 ± 0.026 ms	0.0586 ± 0.026 ms	0.971
saxpy/static workgroup=(1024,)/Float16/64	0.0592 ± 0.026 ms	0.062 ± 0.025 ms	0.956
saxpy/static workgroup=(1024,)/Float16/65536	0.108 ± 0.027 ms	0.106 ± 0.026 ms	1.02
saxpy/static workgroup=(1024,)/Float32/1024	0.0538 ± 0.026 ms	0.0586 ± 0.027 ms	0.919
saxpy/static workgroup=(1024,)/Float32/1048576	0.472 ± 0.043 ms	0.472 ± 0.038 ms	1
saxpy/static workgroup=(1024,)/Float32/16384	0.0551 ± 0.025 ms	0.0561 ± 0.026 ms	0.98
saxpy/static workgroup=(1024,)/Float32/2048	0.0481 ± 0.023 ms	0.0548 ± 0.023 ms	0.878
saxpy/static workgroup=(1024,)/Float32/256	0.0615 ± 0.026 ms	0.0634 ± 0.026 ms	0.97
saxpy/static workgroup=(1024,)/Float32/262144	0.162 ± 0.035 ms	0.164 ± 0.035 ms	0.986
saxpy/static workgroup=(1024,)/Float32/32768	0.0595 ± 0.026 ms	0.0614 ± 0.026 ms	0.969
saxpy/static workgroup=(1024,)/Float32/4096	0.0584 ± 0.026 ms	0.0602 ± 0.026 ms	0.971
saxpy/static workgroup=(1024,)/Float32/512	0.0581 ± 0.026 ms	0.0601 ± 0.026 ms	0.968
saxpy/static workgroup=(1024,)/Float32/64	0.0576 ± 0.026 ms	0.0599 ± 0.025 ms	0.961
saxpy/static workgroup=(1024,)/Float32/65536	0.0806 ± 0.029 ms	0.0827 ± 0.028 ms	0.975
saxpy/static workgroup=(1024,)/Float64/1024	0.057 ± 0.026 ms	0.0536 ± 0.026 ms	1.06
saxpy/static workgroup=(1024,)/Float64/1048576	0.501 ± 0.04 ms	0.517 ± 0.041 ms	0.969
saxpy/static workgroup=(1024,)/Float64/16384	0.0572 ± 0.026 ms	0.0595 ± 0.025 ms	0.962
saxpy/static workgroup=(1024,)/Float64/2048	0.0486 ± 0.023 ms	0.0539 ± 0.023 ms	0.901
saxpy/static workgroup=(1024,)/Float64/256	0.0575 ± 0.026 ms	0.0608 ± 0.026 ms	0.946
saxpy/static workgroup=(1024,)/Float64/262144	0.175 ± 0.031 ms	0.174 ± 0.029 ms	1.01
saxpy/static workgroup=(1024,)/Float64/32768	0.0665 ± 0.026 ms	0.0665 ± 0.025 ms	1
saxpy/static workgroup=(1024,)/Float64/4096	0.0546 ± 0.026 ms	0.0589 ± 0.026 ms	0.927
saxpy/static workgroup=(1024,)/Float64/512	0.0527 ± 0.026 ms	0.0593 ± 0.026 ms	0.889
saxpy/static workgroup=(1024,)/Float64/64	0.0584 ± 0.026 ms	0.0585 ± 0.026 ms	0.998
saxpy/static workgroup=(1024,)/Float64/65536	0.0883 ± 0.027 ms	0.0888 ± 0.027 ms	0.994
time_to_load	1.19 ± 0.003 s	1.15 ± 0.01 s	1.04

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

vchuravy · 2025-02-06T10:09:13Z

src/intrinsics.jl

+function get_global_size end
+
+"""
+    get_global_id()::@NamedTuple{x::Int32, y::Int32, z::Int32}


Should this be Int32 or Int64?

OpenCL defines these as Csize_t

Is there any way to allow the platform to define the specific <:Integer index type? Metal uses uint3 by default, which are three UInt32 values. I liked that CUDA Thrust allowed the indices to be templated, so I could use Int64 only when dealing with billions of datapoints.

GPU dev call: We should make these Int/size_t

maleadt · 2025-02-06T10:44:54Z

So the idea is to decouple the back-ends from KA.jl, instead implementing KernelIntrinsics.jl? What's the advantage; do you envision packages other than KA.jl to build their kernel DSL on top of KernelIntrinsics.jl?

github-actions

Some suggestions could not be made:

src/pocl/nanoOpenCL.jl
- lines 670-674

anicusan · 2025-02-07T12:28:01Z

Will KA/KI still be a greatest common denominator of the GPU backends, or are you looking to introduce optional intrinsics? How will the groupreduce API do in terms of portability?

vchuravy · 2025-02-07T12:53:58Z

Will KA/KI still be a greatest common denominator of the GPU backends

The intrinsics proposed here are the greatest common denominator. I could see us adding some more intrinsics for reductions, but that is TBD.

This comment was marked as outdated.

Sign in to view

pxl-th reviewed Feb 4, 2025

View reviewed changes

vchuravy mentioned this pull request Feb 5, 2025

Implement a CPU backend using POCL #556

Merged

2 tasks

vchuravy force-pushed the vc/pocl branch from 2121d5c to a61ac84 Compare February 5, 2025 12:38

vchuravy force-pushed the vc/intrinsics branch from 2d12b29 to 1298a56 Compare February 5, 2025 12:42

This comment was marked as outdated.

Sign in to view

vchuravy force-pushed the vc/pocl branch from 5d82cd4 to 4ee5b78 Compare February 5, 2025 15:15

vchuravy force-pushed the vc/intrinsics branch from 1298a56 to 1831242 Compare February 5, 2025 15:15

github-actions bot reviewed Feb 5, 2025

View reviewed changes

src/KernelAbstractions.jl Outdated Show resolved Hide resolved

src/intrinsics.jl Outdated Show resolved Hide resolved

src/intrinsics.jl Outdated Show resolved Hide resolved

vchuravy force-pushed the vc/intrinsics branch from 1831242 to 63244eb Compare February 5, 2025 15:29

vchuravy force-pushed the vc/pocl branch from 4ee5b78 to af03bce Compare February 5, 2025 15:32

vchuravy force-pushed the vc/intrinsics branch from 63244eb to af5c7db Compare February 5, 2025 15:32

vchuravy force-pushed the vc/pocl branch from af03bce to d065562 Compare February 5, 2025 15:33

vchuravy force-pushed the vc/intrinsics branch from af5c7db to ad4c968 Compare February 5, 2025 15:33

vchuravy force-pushed the vc/pocl branch 2 times, most recently from c51072f to a0e822e Compare February 5, 2025 17:26

vchuravy force-pushed the vc/intrinsics branch from ad4c968 to 1c8b7e9 Compare February 5, 2025 17:26

vchuravy force-pushed the vc/pocl branch from a0e822e to deb3251 Compare February 5, 2025 17:45

vchuravy force-pushed the vc/intrinsics branch from 1c8b7e9 to 55dde0c Compare February 5, 2025 17:46

vchuravy force-pushed the vc/pocl branch from deb3251 to 6fb1cea Compare February 5, 2025 18:15

vchuravy force-pushed the vc/intrinsics branch from 55dde0c to f1e98cc Compare February 5, 2025 18:15

vchuravy commented Feb 6, 2025

View reviewed changes

vchuravy force-pushed the vc/pocl branch from 6fb1cea to 8098378 Compare February 6, 2025 14:48

vchuravy force-pushed the vc/intrinsics branch from f1e98cc to 6841207 Compare February 6, 2025 14:48

github-actions bot reviewed Feb 6, 2025

View reviewed changes

vchuravy force-pushed the vc/pocl branch from ba4eee7 to a6ae55b Compare February 7, 2025 11:31

vchuravy force-pushed the vc/intrinsics branch from aec6c0a to e91b33b Compare February 7, 2025 11:31

vchuravy changed the base branch from vc/pocl to 02-07-allow_opt-out_of_implicit_bounds-checking February 7, 2025 11:31

vchuravy mentioned this pull request Feb 7, 2025

Allow opt-out of implicit bounds-checking #563

Merged

vchuravy force-pushed the 02-07-allow_opt-out_of_implicit_bounds-checking branch from 48e3752 to e565304 Compare February 7, 2025 13:51

vchuravy force-pushed the vc/intrinsics branch from e91b33b to 7a5e159 Compare February 7, 2025 13:51

vchuravy changed the base branch from 02-07-allow_opt-out_of_implicit_bounds-checking to vc/pocl February 7, 2025 13:52

vchuravy mentioned this pull request Feb 7, 2025

[0.10] Forbid divergent execution of work-group barriers #558

Merged

vchuravy force-pushed the vc/pocl branch from 90a10d7 to f038d8c Compare February 10, 2025 15:08

vchuravy force-pushed the vc/intrinsics branch from 7a5e159 to 0a8301c Compare February 10, 2025 15:08

vchuravy force-pushed the vc/pocl branch 2 times, most recently from 777c099 to 3bb80ac Compare February 12, 2025 15:23

vchuravy force-pushed the vc/intrinsics branch 2 times, most recently from 0d72b34 to 1f3b249 Compare February 12, 2025 15:25

vchuravy force-pushed the vc/pocl branch from 3bb80ac to f88ee87 Compare February 12, 2025 15:39

vchuravy force-pushed the vc/intrinsics branch from 1f3b249 to a0fbc48 Compare February 12, 2025 15:39

vchuravy force-pushed the vc/pocl branch from f88ee87 to 0121280 Compare February 13, 2025 17:58

vchuravy force-pushed the vc/intrinsics branch from a0fbc48 to 7f02b2e Compare February 13, 2025 17:58

vchuravy force-pushed the vc/pocl branch from 0121280 to ed2ee63 Compare February 14, 2025 15:31

vchuravy force-pushed the vc/intrinsics branch from 7f02b2e to bb0c213 Compare February 14, 2025 15:31

vchuravy force-pushed the vc/pocl branch from ed2ee63 to 0dcdc8b Compare February 17, 2025 12:41

vchuravy changed the base branch from vc/pocl to graphite-base/562 February 17, 2025 12:42

vchuravy force-pushed the vc/intrinsics branch from bb0c213 to 16524fd Compare February 17, 2025 12:46

vchuravy force-pushed the graphite-base/562 branch from ed2ee63 to 9741962 Compare February 17, 2025 12:46

vchuravy changed the base branch from graphite-base/562 to main February 17, 2025 12:47

define basic intrinsics

8ea6ad2

vchuravy force-pushed the vc/intrinsics branch from 16524fd to 8ea6ad2 Compare February 17, 2025 12:47

vchuravy mentioned this pull request Mar 5, 2025

Lower-level kernel form? #578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KernelIntrinsics #562

KernelIntrinsics #562

vchuravy commented Feb 4, 2025

This comment was marked as outdated.

This comment was marked as outdated.

codecov bot commented Feb 4, 2025 •

edited

Loading

pxl-th Feb 4, 2025

vchuravy Feb 5, 2025

vchuravy commented Feb 5, 2025 •

edited

Loading

This comment was marked as outdated.

github-actions bot commented Feb 5, 2025 •

edited

Loading

vchuravy Feb 6, 2025

vchuravy Feb 6, 2025

anicusan Feb 13, 2025

vchuravy Feb 17, 2025

maleadt commented Feb 6, 2025

github-actions bot left a comment

anicusan commented Feb 7, 2025

vchuravy commented Feb 7, 2025

KernelIntrinsics #562

Are you sure you want to change the base?

KernelIntrinsics #562

Conversation

vchuravy commented Feb 4, 2025

This comment was marked as outdated.

This comment was marked as outdated.

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

pxl-th Feb 4, 2025

Choose a reason for hiding this comment

vchuravy Feb 5, 2025

Choose a reason for hiding this comment

vchuravy commented Feb 5, 2025 • edited Loading

This comment was marked as outdated.

github-actions bot commented Feb 5, 2025 • edited Loading

Benchmark Results

Benchmark Plots

vchuravy Feb 6, 2025

Choose a reason for hiding this comment

vchuravy Feb 6, 2025

Choose a reason for hiding this comment

anicusan Feb 13, 2025

Choose a reason for hiding this comment

vchuravy Feb 17, 2025

Choose a reason for hiding this comment

maleadt commented Feb 6, 2025

github-actions bot left a comment

Choose a reason for hiding this comment

anicusan commented Feb 7, 2025

vchuravy commented Feb 7, 2025

codecov bot commented Feb 4, 2025 •

edited

Loading

vchuravy commented Feb 5, 2025 •

edited

Loading

github-actions bot commented Feb 5, 2025 •

edited

Loading