[WIP] PFT rewrite-based do-concurrent parallelization #230

skatrak · 2024-12-12T17:12:50Z

This is a proof of concept on a PFT rewrite-based approach to do OpenMP-based parallelization of do concurrent Fotran loops. The main advantage of this approach over an MLIR pass-based one is that it should allow us to avoid re-implementing and sharing significant pieces of PFT to MLIR lowering between Flang lowering and the MLIR pass, potentially also making it much simpler to keep feature parity.

The current WIP replicates the PFT structure of an !$omp parallel do when encountering a do concurrent loop. It is still in very early stages and the resulting PFT cannot be lowered to MLIR yet, as it seems to be missing some symbol updates. However, it can already be tested:

! test.f90
subroutine foo()
  implicit none
  integer :: i

  do concurrent(i=1:10)
  end do

  !$omp parallel do
  do i=1,10
  end do
end subroutine

$ flang-new -fc1 -fdebug-unparse -fopenmp test.f90
SUBROUTINE foo
 IMPLICIT NONE
 INTEGER i
!$OMP PARALLEL DO
 DO i=1_4,10_4
 END DO
!$OMP PARALLEL DO
 DO i=1_4,10_4
 END DO
END SUBROUTINE

$ flang-new -fc1 -fdebug-dump-parse-tree -fopenmp test.f90
Program -> ProgramUnit -> SubroutineSubprogram
| SubroutineStmt
| | Name = 'foo'
| SpecificationPart
| | ImplicitPart -> ImplicitPartStmt -> ImplicitStmt ->
| | DeclarationConstruct -> SpecificationConstruct -> TypeDeclarationStmt
| | | DeclarationTypeSpec -> IntrinsicTypeSpec -> IntegerTypeSpec ->
| | | EntityDecl
| | | | Name = 'i'
| ExecutionPart -> Block
| | ExecutionPartConstruct -> ExecutableConstruct -> OpenMPConstruct -> OpenMPLoopConstruct
| | | OmpBeginLoopDirective
| | | | OmpLoopDirective -> llvm::omp::Directive = parallel do
| | | | OmpClauseList ->
| | | DoConstruct
| | | | NonLabelDoStmt
| | | | | LoopControl -> LoopBounds
| | | | | | Scalar -> Name = 'i'
| | | | | | Scalar -> Expr = '1_4'
| | | | | | | LiteralConstant -> IntLiteralConstant = '1'
| | | | | | Scalar -> Expr = '10_4'
| | | | | | | LiteralConstant -> IntLiteralConstant = '10'
| | | | Block
| | | | EndDoStmt ->
| | ExecutionPartConstruct -> ExecutableConstruct -> OpenMPConstruct -> OpenMPLoopConstruct
| | | OmpBeginLoopDirective
| | | | OmpLoopDirective -> llvm::omp::Directive = parallel do
| | | | OmpClauseList ->
| | | DoConstruct
| | | | NonLabelDoStmt
| | | | | LoopControl -> LoopBounds
| | | | | | Scalar -> Name = 'i'
| | | | | | Scalar -> Expr = '1_4'
| | | | | | | LiteralConstant -> IntLiteralConstant = '1'
| | | | | | Scalar -> Expr = '10_4'
| | | | | | | LiteralConstant -> IntLiteralConstant = '10'
| | | | Block
| | | | EndDoStmt ->
| EndSubroutineStmt ->

This is a proof of concept on a PFT rewrite-based approach to do OpenMP-based parallelization of `do concurrent` Fotran loops. The main advantage of this approach over an MLIR pass-based one is that it should allow us to avoid re-implementing and sharing significant pieces of PFT to MLIR lowering between Flang lowering and the MLIR pass. The current WIP replicates the PFT structure of an `!$omp parallel do` when encountering a `do concurrent` loop. It is still in very early stages and the resulting PFT cannot be lowered to MLIR yet, as it seems to be missing some symbol updates. However, it can already be tested: ```sh $ cat test.f90 subroutine foo() implicit none integer :: i do concurrent(i=1:10) end do !$omp parallel do do i=1,10 end do end subroutine $ flang-new -fc1 -fdebug-unparse -fopenmp test.f90 SUBROUTINE foo IMPLICIT NONE INTEGER i !$OMP PARALLEL DO DO i=1_4,10_4 END DO !$OMP PARALLEL DO DO i=1_4,10_4 END DO END SUBROUTINE $ flang-new -fc1 -fdebug-dump-parse-tree -fopenmp test.f90 Program -> ProgramUnit -> SubroutineSubprogram | SubroutineStmt | | Name = 'foo' | SpecificationPart | | ImplicitPart -> ImplicitPartStmt -> ImplicitStmt -> | | DeclarationConstruct -> SpecificationConstruct -> TypeDeclarationStmt | | | DeclarationTypeSpec -> IntrinsicTypeSpec -> IntegerTypeSpec -> | | | EntityDecl | | | | Name = 'i' | ExecutionPart -> Block | | ExecutionPartConstruct -> ExecutableConstruct -> OpenMPConstruct -> OpenMPLoopConstruct | | | OmpBeginLoopDirective | | | | OmpLoopDirective -> llvm::omp::Directive = parallel do | | | | OmpClauseList -> | | | DoConstruct | | | | NonLabelDoStmt | | | | | LoopControl -> LoopBounds | | | | | | Scalar -> Name = 'i' | | | | | | Scalar -> Expr = '1_4' | | | | | | | LiteralConstant -> IntLiteralConstant = '1' | | | | | | Scalar -> Expr = '10_4' | | | | | | | LiteralConstant -> IntLiteralConstant = '10' | | | | Block | | | | EndDoStmt -> | | ExecutionPartConstruct -> ExecutableConstruct -> OpenMPConstruct -> OpenMPLoopConstruct | | | OmpBeginLoopDirective | | | | OmpLoopDirective -> llvm::omp::Directive = parallel do | | | | OmpClauseList -> | | | DoConstruct | | | | NonLabelDoStmt | | | | | LoopControl -> LoopBounds | | | | | | Scalar -> Name = 'i' | | | | | | Scalar -> Expr = '1_4' | | | | | | | LiteralConstant -> IntLiteralConstant = '1' | | | | | | Scalar -> Expr = '10_4' | | | | | | | LiteralConstant -> IntLiteralConstant = '10' | | | | Block | | | | EndDoStmt -> | EndSubroutineStmt -> ```

ergawy · 2024-12-13T05:17:21Z

Thanks Sergio for working on this.

At this stage, this definitely looks simpler than the pass solution. The initial WIP for the pass (here: https://github.com/llvm/llvm-project/pull/77285/files) was similar to your proposal in terms of simplicity; if you remove all comments, tests, and boilerplate, you end up with a few lines of logic to do the actual conversion. However, I do understand that PFT rewriting is going to probably be much simpler than the pass when we map to target teams distribute parallel do.

One important point I would like to make clear: the current issues we are facing now with do concurrent mapping are not related to rewriting the loops to host or device; this is quite stable after taking in the changes recently made to combined and composite constructs, and to wrapper ops. The main problems we need to solve at this point cross the do concurrent domain into the OpenMP-proper domain. The main problem worth noting is handling mapping without polluting the user codebase with OpenMP (or other parrallelism model) constructs. To do that, we need to extend our implicit mapping logic quite extensively. This is a problem we will face even if we do PFT rewriting. There are other problems as well; all of which do not need us to touch do concurrent mapping at all and do the work in OpenMP-land.

Additionally, the PFT rewriting approach is quite simpler but, I think, is quite limiting as well. For the following reasons:

It won't be easy to extend it to analyze the mapped loops if they are safe to be parallized or not (which is something we need to do eventually at least to warn the user whenever possible). See: https://github.com/llvm/llvm-project/blob/main/flang/docs/DoConcurrent.md. At least it won't be as easy to do that on PFT level compared to the MLIR level. This is something that we have to consider if we want to upstream this work.
With the pass, we can detect at least slightly more complex loop-nests and automatically map them to proper parallel constructs. We don't have to limit oursevles to rewriting the AST of one do concurrent loop at a time.
With the pass, we can convert do concurrent loops created by the compiler itself if we choose to. For example, OptimizedBufferization.cpp generates loop nests from hlfir.elemental operations. We do not have that flexibility with the PFT rewriting.

Admittedly, the pass looks like a lot of code compared to the PFT rewriting at the current stage of the PR. However, much of that code are:

Utils that can be easily shared between OpenMP-proper and the pass; for example, creating map.info ops and maping temporary values.
Utils that can be easily reused if we want to target other models than OpenMP; for example, loop-nest detection and induction variable detection. This is something that we have to consider if we want to upstream our work.
Boilerplate to setup the pass and patterns and such.
All of this, accounts for more than 50% of the code of the pass. So the core is as not as huge as it looks.

The pass has been validated on LBL's inference engine (which is a quite large codebase with annoying features):

On the CPU, the inference engine already runs with OpenMP-like speedups.
On the GPU, we got to a stage where it compiles and links but we need to solve some runtime issues the biggest of which the mapping issue I explained above (which is not do concurrent specific).
So, I think it is more productive to focus our efforts here.

I have to admit that I am biased though. The pass is one of my ugly babies that I contributed since I joined the team. Therefore, adding Michael Klemm and Michael Kruse to chime in. Maybe they have further input. And it is a very nice dicussion to have reglardless of the result, so thanks for opening the WIP.

Meinersbur · 2024-12-16T12:37:51Z

I share @ergawy concerns here. DO CONCURRENT should regularly need program analysis, for instance regarding localization rules. Just adding a default(firstprivate) clause will probably not do it, some variable may need to be lastprivate or reduction. https://flang.llvm.org/docs/DoConcurrent.html shows how much more complexity there is. Another problem is when the DO CONCURRENT is already nested in an OpenMP-parallel construct, including loop and simd.

For our first implementation that explicitly requires to be user-enabled using -fparallelize-do-concurrent we can make some liberal assumptions, but I think the end goal is that the compiler is able to parallelize itself without making additional assumptions required by !OMP PARALLEL DO.

mjklemm · 2024-12-17T10:01:01Z

I also agree that for a proper translation of DO CONCURRENT we need an analysis-based approach and cannot simply rely on a mechanical translation.

Just adding a default(firstprivate) clause will probably not do it, some variable may need to be lastprivate or reduction

firstprivate won't cut it anyways, as it's the wrong thread-privatization clause for DO CONCURRENT.

skatrak · 2025-01-08T13:13:26Z

Thank you @ergawy, @Meinersbur and @mjklemm for your comments. The idea with this was mainly to give a preview of how this feature could be implemented following a different approach than what we currently have. The main benefit would be that we would be able to work directly in terms of Fortran code to OpenMP construct translations, and rely on the existing infrastructure to lower these resulting high-level constructs.

I understand that there are many edge cases and analyses that are needed, and that we can't really translate every case in a straightforward way, but I'm not sure I follow the specific concern about doing these at the PFT level as opposed to MLIR. We'd be looking for specific language patterns, so the PFT seems to be a good place to do this, since it's also where semantic checks are done. It also seems like it would be easier to e.g. add a firstprivate(x) clause to the PFT than an MLIR privatizer for x based on its type, adding it to a private operand of the applicable OpenMP MLIR operation and then replacing nested references with the new entry block argument created for it.

In any case, I'm not against the current pass approach. It's already developed and supports many cases, so it makes sense for it to be the preferred approach unless we find out about important limitations or we find there's a much simpler alternative. I was hoping the PFT rewrite approach would potentially be that second case, but I can see that most of the important issues both approaches will have to deal with are actually the same.

I think we just need to focus on making sure OpenMP lowering and the do concurrent transformation pass are able to effectively share code, to keep to a minimum the chance for divergence in supported features by both and to avoid making things harder for ourselves by having to re-implement significant amounts of MLIR code generation.

skatrak requested review from ergawy and kparzysz December 12, 2024 17:12

ergawy requested a review from mjklemm December 13, 2024 05:18

mjklemm closed this Dec 17, 2024

mjklemm reopened this Dec 17, 2024

skatrak closed this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] PFT rewrite-based do-concurrent parallelization #230

[WIP] PFT rewrite-based do-concurrent parallelization #230

Uh oh!

skatrak commented Dec 12, 2024

Uh oh!

ergawy commented Dec 13, 2024 •

edited

Loading

Uh oh!

Meinersbur commented Dec 16, 2024

Uh oh!

mjklemm commented Dec 17, 2024

Uh oh!

skatrak commented Jan 8, 2025

Uh oh!

Uh oh!

[WIP] PFT rewrite-based do-concurrent parallelization #230

[WIP] PFT rewrite-based do-concurrent parallelization #230

Uh oh!

Conversation

skatrak commented Dec 12, 2024

Uh oh!

ergawy commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Meinersbur commented Dec 16, 2024

Uh oh!

mjklemm commented Dec 17, 2024

Uh oh!

skatrak commented Jan 8, 2025

Uh oh!

Uh oh!

ergawy commented Dec 13, 2024 •

edited

Loading