Firedrake's strategy for GPU support #4586

connorjward · 2025-09-24T14:25:32Z

connorjward
Sep 24, 2025
Maintainer

Following exploratory work this summer with @indiamai (some notes here), we have decided on an approach for enabling GPU support within Firedrake. PETSc has considerable GPU capability, so the plan here focusses on how we enable FE assembly to take place on the GPU.

What happens at the moment?

At present, code generation for assembly looks like this:

TSFC lowers UFL into GEM (an unscheduled tensor IR)
TSFC lowers GEM into Impero (a scheduled tensor IR)
TSFC lowers Impero to loopy
PyOP2 (soon pyop3) takes this loopy code and generates its own loopy code that wraps the local kernel in the necessary gathers and (atomic) scatters
The resulting loopy code is then lowered to C and compiled

What is the new plan?

Over the project we learned that the machine learning community follow a very similar looking approach when they run their models. They also have domain-specific languages that are JIT-compiled via multiple intermediate representations (using MLIR) to high performance code. Of particular note for us is that along the way it is usual to have an IR consisting of linear algebra and tensor operations (using the linalg and tensor dialects of MLIR).

Since Firedrake also has a tensor representation (GEM) the plan is therefore to take GEM and convert it to something that these other tools can process. Compared to lowering to Impero and loopy this has a number of advantages:

The entrypoint for the MLIR toolchain (linalg and tensor dialects) is at a higher level than the loop-based abstractions of loopy. This means that we can hand over responsibility for compilation at a higher level and thus need less bespoke code.
Lowering to loops, and thus to CUDA, is no longer the recommend approach for programming GPUs as this cannot take advantage of tensor cores and other specialised bits of the hardware that do not fit well with SIMT.
We would prefer to use 'standard' tooling backed by a larger community.

As a starting point we plan on using IREE as backend. IREE provides both a linalg/tensor compiler and runtime so it should be able to handle both the compilation to CPU/GPU backends as well as coordinate the execution of the computation.

We should look to use xDSL to perform the GEM to linalg/tensor transformation. xDSL is a rewrite of MLIR in Python so it should make writing any of our own compiler passes easier.

As a later goal we plan on targeting TileIR (eg). TileIR is a brand new MLIR dialect and compiler from NVIDIA. It uses a block-based abstraction very similar to Triton. The abstraction is lower than linalg so targeting TileIR will constitute writing the right lowering passes.

What are the challenges?

FEM vs ML tensors

The tensors in FEM are typically a lot smaller than those found in ML models. Clever batching strategies will therefore be needed in pyop3 and TSFC to effectively use the hardware.

Determinants

It is apparently uncommon for machine learning models to need to compute determinants and therefore support is spotty. We will need to make sure that whatever backend we end up using will support this.

Gathers and scatters

Pack (gather) and unpack (scatter) operations are critical for FE assembly. They are expressible using the tensor dialect (gather, scatter) but we have found support to be patchy (e.g. with Triton) since it is not a common operation in ML workloads. In principle things should work but it is something to watch for.

What comes next?

We have multiple grant applications in the pipeline that would give us the funding to make this a reality. In the meantime we will also hopefully have a Master's student to attempt to progress things.

miguelcoolchips · 2025-09-29T09:31:31Z

miguelcoolchips
Sep 29, 2025

This is super interesting. Thanks a lot for sharing! Do you have some estimates of potential speed-ups? Also, what is the timeline for when it could land on Firedrake?

1 reply

connorjward Sep 29, 2025
Maintainer Author

Do you have some estimates of potential speed-ups?

For most users the speed up will be modest. Most Firedrake codes are memory bandwidth bound and so the speed improvement will be proportional to the improvement in memory bandwidth, which is usually x4 or so. The main reason we are so driven to get working on GPUs is so we can continue to run on the latest HPC machines, which are pretty much all GPU machines now.

Also, what is the timeline for when it could land on Firedrake?

If the earlier of our two grant applications is funded we are still looking at late next year for initial functionality to be merged.

tkarna · 2025-10-27T07:51:39Z

tkarna
Oct 27, 2025
Collaborator

This sounds great! What is the rationale of adopting IREE as the backend instead of upstream MLIR itself? You can generate/compile/orchestrate both GPU and host code with MLIR too. Admittedly IREE might be more feature complete in terms of op gpu support at the moment compared to upstream MLIR but I'm hopeful we can close the gap in not-so-distant future.

1 reply

connorjward Oct 27, 2025
Maintainer Author

AIUI MLIR is just an intermediate representation and not a compiler. The idea is similar to how we currently target C as the backend language and use GCC/Clang as the backend compiler. MLIR is the target language and IREE will be the compiler.

I know MLIR provides passes that make it easy to build compilers, but there is still a huge amount of complexity involved in optimising linear algebra code for CPU/GPU execution (tiling, loop interchange, etc) that we do not want to have to maintain ourselves. Hence why we want something like IREE.

If there's something I'm missing then please let me know!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Firedrake's strategy for GPU support #4586

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Firedrake's strategy for GPU support #4586

Uh oh!

connorjward Sep 24, 2025 Maintainer

What happens at the moment?

What is the new plan?

What are the challenges?

FEM vs ML tensors

Determinants

Gathers and scatters

What comes next?

Replies: 2 comments · 2 replies

Uh oh!

miguelcoolchips Sep 29, 2025

Uh oh!

connorjward Sep 29, 2025 Maintainer Author

Uh oh!

tkarna Oct 27, 2025 Collaborator

Uh oh!

connorjward Oct 27, 2025 Maintainer Author

connorjward
Sep 24, 2025
Maintainer

Replies: 2 comments 2 replies

miguelcoolchips
Sep 29, 2025

connorjward Sep 29, 2025
Maintainer Author

tkarna
Oct 27, 2025
Collaborator

connorjward Oct 27, 2025
Maintainer Author