-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use unified memory for scalar indexing of permutation matrices #313
Conversation
How should we approach doing the same for scalar indexing inside of GPUArrays.jl? |
I recently reworked indexing in GPUArrays to make exactly this possible, see JuliaGPU/GPUArrays.jl#499 and JuliaGPU/CUDA.jl#2138 for an implementation. |
5ee17c9
to
5616177
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Does this include scalar indexing inside of GPUArrays.jl? If not, should an issue be filed so we don't forget to eventually get to it?
I also noticed some stuff that's not really in the scope of this PR so I'll submit a separate one.
9a7d39f
to
db1865c
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Benchmark suite | Current: 45fe9d1 | Previous: 71b784e | Ratio |
---|---|---|---|
private array/construct |
27791.666666666668 ns |
23715.25 ns |
1.17 |
private array/broadcast |
461833.5 ns |
474145.5 ns |
0.97 |
private array/random/randn/Float32 |
997792 ns |
994125 ns |
1.00 |
private array/random/randn!/Float32 |
622708 ns |
644458.5 ns |
0.97 |
private array/random/rand!/Int64 |
565833 ns |
569958 ns |
0.99 |
private array/random/rand!/Float32 |
581375 ns |
606250 ns |
0.96 |
private array/random/rand/Int64 |
885500 ns |
831750 ns |
1.06 |
private array/random/rand/Float32 |
963375 ns |
897625 ns |
1.07 |
private array/copyto!/gpu_to_gpu |
489791 ns |
660666 ns |
0.74 |
private array/copyto!/cpu_to_gpu |
744542 ns |
555208 ns |
1.34 |
private array/copyto!/gpu_to_cpu |
566020.5 ns |
709417 ns |
0.80 |
private array/accumulate/1d |
1431521 ns |
1430125 ns |
1.00 |
private array/accumulate/2d |
1475146 ns |
1499500 ns |
0.98 |
private array/iteration/findall/int |
2276479.5 ns |
2210520.5 ns |
1.03 |
private array/iteration/findall/bool |
2036750 ns |
2041209 ns |
1.00 |
private array/iteration/findfirst/int |
1685416.5 ns |
1704833 ns |
0.99 |
private array/iteration/findfirst/bool |
1667834 ns |
1645334 ns |
1.01 |
private array/iteration/scalar |
2383604 ns |
2430625 ns |
0.98 |
private array/iteration/logical |
3429416.5 ns |
3432895.5 ns |
1.00 |
private array/iteration/findmin/1d |
1792438 ns |
1763667 ns |
1.02 |
private array/iteration/findmin/2d |
1377625 ns |
1353479 ns |
1.02 |
private array/reductions/reduce/1d |
795479.5 ns |
730853.5 ns |
1.09 |
private array/reductions/reduce/2d |
725292 ns |
709708 ns |
1.02 |
private array/reductions/mapreduce/1d |
783999.5 ns |
800041 ns |
0.98 |
private array/reductions/mapreduce/2d |
718375 ns |
713125 ns |
1.01 |
private array/permutedims/4d |
951520.5 ns |
949333 ns |
1.00 |
private array/permutedims/2d |
924896 ns |
930958 ns |
0.99 |
private array/permutedims/3d |
999959 ns |
1018708.5 ns |
0.98 |
private array/copy |
865333 ns |
582583 ns |
1.49 |
latency/precompile |
4410091792 ns |
4403995333 ns |
1.00 |
latency/ttfp |
6887169042 ns |
6895957979 ns |
1.00 |
latency/import |
725338979.5 ns |
723655188 ns |
1.00 |
integration/metaldevrt |
753042 ns |
757604 ns |
0.99 |
integration/byval/slices=1 |
1542104 ns |
1623541 ns |
0.95 |
integration/byval/slices=3 |
8907292 ns |
8853854 ns |
1.01 |
integration/byval/reference |
1589042 ns |
1573521 ns |
1.01 |
integration/byval/slices=2 |
2730792 ns |
2624459 ns |
1.04 |
kernel/indexing |
462750 ns |
455583 ns |
1.02 |
kernel/indexing_checked |
435021 ns |
461916 ns |
0.94 |
kernel/launch |
10958 ns |
10875 ns |
1.01 |
metal/synchronization/stream |
19208 ns |
19250 ns |
1.00 |
metal/synchronization/context |
19708 ns |
19791 ns |
1.00 |
shared array/construct |
23972.25 ns |
23972.166666666668 ns |
1.00 |
shared array/broadcast |
466584 ns |
478708 ns |
0.97 |
shared array/random/randn/Float32 |
1017250 ns |
987500 ns |
1.03 |
shared array/random/randn!/Float32 |
626250 ns |
641062.5 ns |
0.98 |
shared array/random/rand!/Int64 |
573875 ns |
576520.5 ns |
1.00 |
shared array/random/rand!/Float32 |
585292 ns |
592333.5 ns |
0.99 |
shared array/random/rand/Int64 |
849417 ns |
870458 ns |
0.98 |
shared array/random/rand/Float32 |
888729 ns |
935229 ns |
0.95 |
shared array/copyto!/gpu_to_gpu |
537375 ns |
546667 ns |
0.98 |
shared array/copyto!/cpu_to_gpu |
83875 ns |
94125 ns |
0.89 |
shared array/copyto!/gpu_to_cpu |
85709 ns |
84208 ns |
1.02 |
shared array/accumulate/1d |
1428958.5 ns |
1434979 ns |
1.00 |
shared array/accumulate/2d |
1491271 ns |
1497729 ns |
1.00 |
shared array/iteration/findall/int |
2012833 ns |
1971125 ns |
1.02 |
shared array/iteration/findall/bool |
1778541 ns |
1777500 ns |
1.00 |
shared array/iteration/findfirst/int |
1415437.5 ns |
1410291 ns |
1.00 |
shared array/iteration/findfirst/bool |
1392125 ns |
1388708 ns |
1.00 |
shared array/iteration/scalar |
189208 ns |
189562.5 ns |
1.00 |
shared array/iteration/logical |
3203417 ns |
3205291 ns |
1.00 |
shared array/iteration/findmin/1d |
1491375 ns |
1479229 ns |
1.01 |
shared array/iteration/findmin/2d |
1389708 ns |
1373083.5 ns |
1.01 |
shared array/reductions/reduce/1d |
678083 ns |
616666 ns |
1.10 |
shared array/reductions/reduce/2d |
711229.5 ns |
716854.5 ns |
0.99 |
shared array/reductions/mapreduce/1d |
686437 ns |
686417 ns |
1.00 |
shared array/reductions/mapreduce/2d |
716667 ns |
710584 ns |
1.01 |
shared array/permutedims/4d |
950125 ns |
960250 ns |
0.99 |
shared array/permutedims/2d |
923583 ns |
925458.5 ns |
1.00 |
shared array/permutedims/3d |
998875 ns |
1015208.5 ns |
0.98 |
shared array/copy |
867500 ns |
598354.5 ns |
1.45 |
This comment was automatically generated by workflow using github-action-benchmark.
No description provided.