Firedrake's strategy for GPU support #4586
connorjward
started this conversation in
General
Replies: 2 comments 2 replies
-
|
This is super interesting. Thanks a lot for sharing! Do you have some estimates of potential speed-ups? Also, what is the timeline for when it could land on Firedrake? |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
This sounds great! What is the rationale of adopting IREE as the backend instead of upstream MLIR itself? You can generate/compile/orchestrate both GPU and host code with MLIR too. Admittedly IREE might be more feature complete in terms of op gpu support at the moment compared to upstream MLIR but I'm hopeful we can close the gap in not-so-distant future. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Following exploratory work this summer with @indiamai (some notes here), we have decided on an approach for enabling GPU support within Firedrake. PETSc has considerable GPU capability, so the plan here focusses on how we enable FE assembly to take place on the GPU.
What happens at the moment?
At present, code generation for assembly looks like this:
What is the new plan?
Over the project we learned that the machine learning community follow a very similar looking approach when they run their models. They also have domain-specific languages that are JIT-compiled via multiple intermediate representations (using MLIR) to high performance code. Of particular note for us is that along the way it is usual to have an IR consisting of linear algebra and tensor operations (using the
linalgandtensordialects of MLIR).Since Firedrake also has a tensor representation (GEM) the plan is therefore to take GEM and convert it to something that these other tools can process. Compared to lowering to Impero and loopy this has a number of advantages:
linalgandtensordialects) is at a higher level than the loop-based abstractions of loopy. This means that we can hand over responsibility for compilation at a higher level and thus need less bespoke code.As a starting point we plan on using IREE as backend. IREE provides both a
linalg/tensorcompiler and runtime so it should be able to handle both the compilation to CPU/GPU backends as well as coordinate the execution of the computation.We should look to use xDSL to perform the GEM to
linalg/tensortransformation. xDSL is a rewrite of MLIR in Python so it should make writing any of our own compiler passes easier.As a later goal we plan on targeting TileIR (eg). TileIR is a brand new MLIR dialect and compiler from NVIDIA. It uses a block-based abstraction very similar to Triton. The abstraction is lower than
linalgso targeting TileIR will constitute writing the right lowering passes.What are the challenges?
FEM vs ML tensors
The tensors in FEM are typically a lot smaller than those found in ML models. Clever batching strategies will therefore be needed in pyop3 and TSFC to effectively use the hardware.
Determinants
It is apparently uncommon for machine learning models to need to compute determinants and therefore support is spotty. We will need to make sure that whatever backend we end up using will support this.
Gathers and scatters
Pack (gather) and unpack (scatter) operations are critical for FE assembly. They are expressible using the
tensordialect (gather, scatter) but we have found support to be patchy (e.g. with Triton) since it is not a common operation in ML workloads. In principle things should work but it is something to watch for.What comes next?
We have multiple grant applications in the pipeline that would give us the funding to make this a reality. In the meantime we will also hopefully have a Master's student to attempt to progress things.
Beta Was this translation helpful? Give feedback.
All reactions