Add example that profiles parallel sum #774

FattiMei · 2024-07-15T20:25:13Z

Description

This program profiles the parallel sum kernel when summing two arrays of increasing size, interpreting the data as char, int, float... the results are plotted with matplotlib, it's an external dependency but it's a common one.

Rationale

When experimenting with GPU computing is useful to estimate an upper bound on performance, this PR offers an example of how you could use pyopencl events to profile a kernel (I suppose we are profiling only the execution time without the data transfers).
The results show a powerful idea in parallel computing: using shorter types improves throughput

Possible corners

I assumed that the profiling times were in nanoseconds
I wanted to work with fp16, but not all devices support it so I commented it out. Maybe we could query at runtime if fp16 arithmetic is supported

inducer

Thanks for contributing this. Some comments below.

inducer · 2024-07-16T01:13:02Z

examples/demo_flops.py

+            FLOPS = 1e9 * sums / (event.profile.end - event.profile.start)
+            GFLOPS = FLOPS / 1e6
+
+            data[row, col] = GFLOPS


Arguably, this workload will be bandwidth-bound, so GB/s will be the more appropriate measure.

This decision was made because it's common to evaluate gpu performance based on TFLOPS (and this number is computed with similar workloads) and especially highlights the fact that of course the flops go up when working with smaller types

examples/demo_flops.py

inducer · 2024-07-16T01:15:50Z

examples/demo_flops.py

+            header  = f'#define T {literal}\n'
+            kernel  = cl.Program(ctx, header + src).build().sum
+
+            event   = kernel(queue, (sums,), None, x, y, z)


It's generally good practice to do a few "warmup" rounds before timing, to better measure the steady-state rate.

There are problem with caches however. In cpu runs I get crazy GFLOPS for medium size arrays because they already live in the cache, gpu doesn't seem to suffer from this problem.
But with the new commits one could decide to do no warmup runs and only one hot run so it's ok

inducer · 2024-07-16T01:16:13Z

examples/demo_flops.py

Could you look over the CI failures?

Add example that profiles parallel sum

15d981a

inducer reviewed Jul 16, 2024

View reviewed changes

FattiMei and others added 4 commits July 16, 2024 07:12

refactor: comply with ruff requirements

dfef25a

Add warm-up runs and multiple measurements per run

df99f35

Add matplotlib dependency to examples ci

72b1ab4

Merge branch 'main' into main

766873e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example that profiles parallel sum #774

Add example that profiles parallel sum #774

FattiMei commented Jul 15, 2024

inducer left a comment

inducer Jul 16, 2024

FattiMei Jul 16, 2024

inducer Jul 16, 2024

FattiMei Jul 16, 2024

inducer Jul 16, 2024

Add example that profiles parallel sum #774

Are you sure you want to change the base?

Add example that profiles parallel sum #774

Conversation

FattiMei commented Jul 15, 2024

Description

Rationale

Possible corners

inducer left a comment

Choose a reason for hiding this comment

inducer Jul 16, 2024

Choose a reason for hiding this comment

FattiMei Jul 16, 2024

Choose a reason for hiding this comment

inducer Jul 16, 2024

Choose a reason for hiding this comment

FattiMei Jul 16, 2024

Choose a reason for hiding this comment

inducer Jul 16, 2024

Choose a reason for hiding this comment