-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[MLIR][OpenMP] Skip host omp ops when compiling for the target device #85239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-flang-openmp @llvm/pr-subscribers-mlir-openmp Author: Jan Leyonberg (jsjodin) ChangesThis patch separates the lowering dispatch for host and target devices. For the target device, if the current operation is not a top-level operation (e.g. omp.target) or is inside a target device code region it will be ignored, since it belongs to the host code. This is an alternative approach to #84611, the new test in this PR was taken from there. Full diff: https://github.com/llvm/llvm-project/pull/85239.diff 6 Files Affected:
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index bef227f2c583ed..9ad0b99078005b 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -2922,6 +2922,178 @@ convertDeclareTargetAttr(Operation *op, mlir::omp::DeclareTargetAttr attribute,
return success();
}
+static bool isInternalTargetDeviceOp(Operation *op) {
+ // Assumes no reverse offloading
+ if (op->getParentOfType<omp::TargetOp>())
+ return true;
+
+ auto parentFn = op->getParentOfType<LLVM::LLVMFuncOp>();
+ if (auto declareTargetIface =
+ llvm::dyn_cast<mlir::omp::DeclareTargetInterface>(
+ parentFn.getOperation()))
+ if (declareTargetIface.isDeclareTarget() &&
+ declareTargetIface.getDeclareTargetDeviceType() !=
+ mlir::omp::DeclareTargetDeviceType::host)
+ return true;
+
+ return false;
+}
+
+/// Given an OpenMP MLIR operation, create the corresponding LLVM IR
+/// (including OpenMP runtime calls).
+static LogicalResult
+convertCommonOperation(Operation *op, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+
+ llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+
+ return llvm::TypeSwitch<Operation *, LogicalResult>(op)
+ .Case([&](omp::BarrierOp) {
+ ompBuilder->createBarrier(builder.saveIP(), llvm::omp::OMPD_barrier);
+ return success();
+ })
+ .Case([&](omp::TaskwaitOp) {
+ ompBuilder->createTaskwait(builder.saveIP());
+ return success();
+ })
+ .Case([&](omp::TaskyieldOp) {
+ ompBuilder->createTaskyield(builder.saveIP());
+ return success();
+ })
+ .Case([&](omp::FlushOp) {
+ // No support in Openmp runtime function (__kmpc_flush) to accept
+ // the argument list.
+ // OpenMP standard states the following:
+ // "An implementation may implement a flush with a list by ignoring
+ // the list, and treating it the same as a flush without a list."
+ //
+ // The argument list is discarded so that, flush with a list is treated
+ // same as a flush without a list.
+ ompBuilder->createFlush(builder.saveIP());
+ return success();
+ })
+ .Case([&](omp::ParallelOp op) {
+ return convertOmpParallel(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::ReductionOp reductionOp) {
+ return convertOmpReductionOp(reductionOp, builder, moduleTranslation);
+ })
+ .Case([&](omp::MasterOp) {
+ return convertOmpMaster(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::CriticalOp) {
+ return convertOmpCritical(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::OrderedRegionOp) {
+ return convertOmpOrderedRegion(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::OrderedOp) {
+ return convertOmpOrdered(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::WsLoopOp) {
+ return convertOmpWsLoop(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::SimdLoopOp) {
+ return convertOmpSimdLoop(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::AtomicReadOp) {
+ return convertOmpAtomicRead(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::AtomicWriteOp) {
+ return convertOmpAtomicWrite(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::AtomicUpdateOp op) {
+ return convertOmpAtomicUpdate(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::AtomicCaptureOp op) {
+ return convertOmpAtomicCapture(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::SectionsOp) {
+ return convertOmpSections(*op, builder, moduleTranslation);
+ })
+ .Case([&](omp::SingleOp op) {
+ return convertOmpSingle(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::TeamsOp op) {
+ return convertOmpTeams(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::TaskOp op) {
+ return convertOmpTaskOp(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::TaskGroupOp op) {
+ return convertOmpTaskgroupOp(op, builder, moduleTranslation);
+ })
+ .Case<omp::YieldOp, omp::TerminatorOp, omp::ReductionDeclareOp,
+ omp::CriticalDeclareOp>([](auto op) {
+ // `yield` and `terminator` can be just omitted. The block structure
+ // was created in the region that handles their parent operation.
+ // `reduction.declare` will be used by reductions and is not
+ // converted directly, skip it.
+ // `critical.declare` is only used to declare names of critical
+ // sections which will be used by `critical` ops and hence can be
+ // ignored for lowering. The OpenMP IRBuilder will create unique
+ // name for critical section names.
+ return success();
+ })
+ .Case([&](omp::ThreadprivateOp) {
+ return convertOmpThreadprivate(*op, builder, moduleTranslation);
+ })
+ .Case<omp::DataOp, omp::EnterDataOp, omp::ExitDataOp, omp::UpdateDataOp>(
+ [&](auto op) {
+ return convertOmpTargetData(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::TargetOp) {
+ return convertOmpTarget(*op, builder, moduleTranslation);
+ })
+ .Case<omp::MapInfoOp, omp::DataBoundsOp, omp::PrivateClauseOp>(
+ [&](auto op) {
+ // No-op, should be handled by relevant owning operations e.g.
+ // TargetOp, EnterDataOp, ExitDataOp, DataOp etc. and then
+ // discarded
+ return success();
+ })
+ .Default([&](Operation *inst) {
+ return inst->emitError("unsupported OpenMP operation: ")
+ << inst->getName();
+ });
+}
+
+static LogicalResult
+convertInternalTargetOp(Operation *op, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+ return convertCommonOperation(op, builder, moduleTranslation);
+}
+
+static LogicalResult
+convertTopLevelTargetOp(Operation *op, llvm::IRBuilderBase &builder,
+ LLVM::ModuleTranslation &moduleTranslation) {
+ return llvm::TypeSwitch<Operation *, LogicalResult>(op)
+ .Case<omp::DataOp, omp::EnterDataOp, omp::ExitDataOp, omp::UpdateDataOp>(
+ [&](auto op) {
+ return convertOmpTargetData(op, builder, moduleTranslation);
+ })
+ .Case([&](omp::TargetOp) {
+ return convertOmpTarget(*op, builder, moduleTranslation);
+ })
+ // Skip omp ops that are not legal top level ops for the target device
+ .Case<omp::BarrierOp, omp::TaskwaitOp, omp::TaskyieldOp, omp::FlushOp,
+ omp::ParallelOp, omp::ReductionOp, omp::MasterOp, omp::CriticalOp,
+ omp::OrderedRegionOp, omp::OrderedOp, omp::WsLoopOp,
+ omp::SimdLoopOp, omp::AtomicReadOp, omp::AtomicWriteOp,
+ omp::AtomicUpdateOp, omp::AtomicCaptureOp, omp::SectionsOp,
+ omp::SingleOp, omp::TeamsOp, omp::TaskOp, omp::TaskGroupOp,
+ omp::YieldOp, omp::TerminatorOp, omp::ReductionDeclareOp,
+ omp::ThreadprivateOp, omp::DistributeOp, omp::MapInfoOp,
+ omp::DataBoundsOp, omp::CriticalDeclareOp>(
+ [&](auto op) {
+ return success();
+ })
+ .Default([&](Operation *inst) {
+ return inst->emitError("unsupported OpenMP operation: ")
+ << inst->getName();
+ });
+}
+
namespace {
/// Implementation of the dialect interface that converts operations belonging
@@ -2937,8 +3109,8 @@ class OpenMPDialectLLVMIRTranslationInterface
convertOperation(Operation *op, llvm::IRBuilderBase &builder,
LLVM::ModuleTranslation &moduleTranslation) const final;
- /// Given an OpenMP MLIR attribute, create the corresponding LLVM-IR, runtime
- /// calls, or operation amendments
+ /// Given an OpenMP MLIR attribute, create the corresponding LLVM-IR,
+ /// runtime calls, or operation amendments
LogicalResult
amendOperation(Operation *op, ArrayRef<llvm::Instruction *> instructions,
NamedAttribute attribute,
@@ -3043,116 +3215,15 @@ LogicalResult OpenMPDialectLLVMIRTranslationInterface::convertOperation(
LLVM::ModuleTranslation &moduleTranslation) const {
llvm::OpenMPIRBuilder *ompBuilder = moduleTranslation.getOpenMPBuilder();
+ if (ompBuilder->Config.isTargetDevice()) {
+ if (isInternalTargetDeviceOp(op)) {
+ return convertInternalTargetOp(op, builder, moduleTranslation);
+ } else {
+ return convertTopLevelTargetOp(op, builder, moduleTranslation);
+ }
+ }
- return llvm::TypeSwitch<Operation *, LogicalResult>(op)
- .Case([&](omp::BarrierOp) {
- ompBuilder->createBarrier(builder.saveIP(), llvm::omp::OMPD_barrier);
- return success();
- })
- .Case([&](omp::TaskwaitOp) {
- ompBuilder->createTaskwait(builder.saveIP());
- return success();
- })
- .Case([&](omp::TaskyieldOp) {
- ompBuilder->createTaskyield(builder.saveIP());
- return success();
- })
- .Case([&](omp::FlushOp) {
- // No support in Openmp runtime function (__kmpc_flush) to accept
- // the argument list.
- // OpenMP standard states the following:
- // "An implementation may implement a flush with a list by ignoring
- // the list, and treating it the same as a flush without a list."
- //
- // The argument list is discarded so that, flush with a list is treated
- // same as a flush without a list.
- ompBuilder->createFlush(builder.saveIP());
- return success();
- })
- .Case([&](omp::ParallelOp op) {
- return convertOmpParallel(op, builder, moduleTranslation);
- })
- .Case([&](omp::ReductionOp reductionOp) {
- return convertOmpReductionOp(reductionOp, builder, moduleTranslation);
- })
- .Case([&](omp::MasterOp) {
- return convertOmpMaster(*op, builder, moduleTranslation);
- })
- .Case([&](omp::CriticalOp) {
- return convertOmpCritical(*op, builder, moduleTranslation);
- })
- .Case([&](omp::OrderedRegionOp) {
- return convertOmpOrderedRegion(*op, builder, moduleTranslation);
- })
- .Case([&](omp::OrderedOp) {
- return convertOmpOrdered(*op, builder, moduleTranslation);
- })
- .Case([&](omp::WsLoopOp) {
- return convertOmpWsLoop(*op, builder, moduleTranslation);
- })
- .Case([&](omp::SimdLoopOp) {
- return convertOmpSimdLoop(*op, builder, moduleTranslation);
- })
- .Case([&](omp::AtomicReadOp) {
- return convertOmpAtomicRead(*op, builder, moduleTranslation);
- })
- .Case([&](omp::AtomicWriteOp) {
- return convertOmpAtomicWrite(*op, builder, moduleTranslation);
- })
- .Case([&](omp::AtomicUpdateOp op) {
- return convertOmpAtomicUpdate(op, builder, moduleTranslation);
- })
- .Case([&](omp::AtomicCaptureOp op) {
- return convertOmpAtomicCapture(op, builder, moduleTranslation);
- })
- .Case([&](omp::SectionsOp) {
- return convertOmpSections(*op, builder, moduleTranslation);
- })
- .Case([&](omp::SingleOp op) {
- return convertOmpSingle(op, builder, moduleTranslation);
- })
- .Case([&](omp::TeamsOp op) {
- return convertOmpTeams(op, builder, moduleTranslation);
- })
- .Case([&](omp::TaskOp op) {
- return convertOmpTaskOp(op, builder, moduleTranslation);
- })
- .Case([&](omp::TaskGroupOp op) {
- return convertOmpTaskgroupOp(op, builder, moduleTranslation);
- })
- .Case<omp::YieldOp, omp::TerminatorOp, omp::ReductionDeclareOp,
- omp::CriticalDeclareOp>([](auto op) {
- // `yield` and `terminator` can be just omitted. The block structure
- // was created in the region that handles their parent operation.
- // `reduction.declare` will be used by reductions and is not
- // converted directly, skip it.
- // `critical.declare` is only used to declare names of critical
- // sections which will be used by `critical` ops and hence can be
- // ignored for lowering. The OpenMP IRBuilder will create unique
- // name for critical section names.
- return success();
- })
- .Case([&](omp::ThreadprivateOp) {
- return convertOmpThreadprivate(*op, builder, moduleTranslation);
- })
- .Case<omp::DataOp, omp::EnterDataOp, omp::ExitDataOp, omp::UpdateDataOp>(
- [&](auto op) {
- return convertOmpTargetData(op, builder, moduleTranslation);
- })
- .Case([&](omp::TargetOp) {
- return convertOmpTarget(*op, builder, moduleTranslation);
- })
- .Case<omp::MapInfoOp, omp::DataBoundsOp, omp::PrivateClauseOp>(
- [&](auto op) {
- // No-op, should be handled by relevant owning operations e.g.
- // TargetOp, EnterDataOp, ExitDataOp, DataOp etc. and then
- // discarded
- return success();
- })
- .Default([&](Operation *inst) {
- return inst->emitError("unsupported OpenMP operation: ")
- << inst->getName();
- });
+ return convertCommonOperation(op, builder, moduleTranslation);
}
void mlir::registerOpenMPDialectTranslation(DialectRegistry ®istry) {
diff --git a/mlir/test/Target/LLVMIR/omptarget-parallel-wsloop.mlir b/mlir/test/Target/LLVMIR/omptarget-parallel-wsloop.mlir
index 8ab50f05f07167..b0fe642238f14f 100644
--- a/mlir/test/Target/LLVMIR/omptarget-parallel-wsloop.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-parallel-wsloop.mlir
@@ -4,10 +4,10 @@
// for nested omp do loop inside omp target region
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memory_space", 5 : ui32>>, llvm.data_layout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8", llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true } {
- llvm.func @target_parallel_wsloop(%arg0: !llvm.ptr) attributes {
+ llvm.func @target_parallel_wsloop(%arg0: !llvm.ptr) attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>,
target_cpu = "gfx90a",
- target_features = #llvm.target_features<["+gfx9-insts", "+wavefrontsize64"]>
- } {
+ target_features = #llvm.target_features<["+gfx9-insts", "+wavefrontsize64"]>}
+ {
omp.parallel {
%loop_ub = llvm.mlir.constant(9 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
diff --git a/mlir/test/Target/LLVMIR/omptarget-teams-llvm.mlir b/mlir/test/Target/LLVMIR/omptarget-teams-llvm.mlir
index 96cced7a1d584b..c5f89eb2c3274c 100644
--- a/mlir/test/Target/LLVMIR/omptarget-teams-llvm.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-teams-llvm.mlir
@@ -5,7 +5,7 @@
module attributes {omp.is_target_device = true} {
llvm.func @foo(i32)
- llvm.func @omp_target_teams_shared_simple(%arg0 : i32) {
+ llvm.func @omp_target_teams_shared_simple(%arg0 : i32) attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
omp.teams {
llvm.call @foo(%arg0) : (i32) -> ()
omp.terminator
diff --git a/mlir/test/Target/LLVMIR/omptarget-wsloop-collapsed.mlir b/mlir/test/Target/LLVMIR/omptarget-wsloop-collapsed.mlir
index e246c551886cfa..0d77423abcb4f1 100644
--- a/mlir/test/Target/LLVMIR/omptarget-wsloop-collapsed.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-wsloop-collapsed.mlir
@@ -4,7 +4,7 @@
// for nested omp do loop with collapse clause inside omp target region
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memory_space", 5 : ui32>>, llvm.data_layout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8", llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true } {
- llvm.func @target_collapsed_wsloop(%arg0: !llvm.ptr) {
+ llvm.func @target_collapsed_wsloop(%arg0: !llvm.ptr) attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
%loop_ub = llvm.mlir.constant(99 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
%loop_step = llvm.mlir.constant(1 : index) : i32
diff --git a/mlir/test/Target/LLVMIR/omptarget-wsloop.mlir b/mlir/test/Target/LLVMIR/omptarget-wsloop.mlir
index 220eb85b3483ec..0f3f503dfa5377 100644
--- a/mlir/test/Target/LLVMIR/omptarget-wsloop.mlir
+++ b/mlir/test/Target/LLVMIR/omptarget-wsloop.mlir
@@ -4,7 +4,7 @@
// for nested omp do loop inside omp target region
module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memory_space", 5 : ui32>>, llvm.data_layout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8", llvm.target_triple = "amdgcn-amd-amdhsa", omp.is_gpu = true, omp.is_target_device = true } {
- llvm.func @target_wsloop(%arg0: !llvm.ptr ){
+ llvm.func @target_wsloop(%arg0: !llvm.ptr ) attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
%loop_ub = llvm.mlir.constant(9 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
%loop_step = llvm.mlir.constant(1 : i32) : i32
@@ -16,7 +16,7 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<"dlti.alloca_memo
llvm.return
}
- llvm.func @target_empty_wsloop(){
+ llvm.func @target_empty_wsloop() attributes {omp.declare_target = #omp.declaretarget<device_type = (any), capture_clause = (to)>} {
%loop_ub = llvm.mlir.constant(9 : i32) : i32
%loop_lb = llvm.mlir.constant(0 : i32) : i32
%loop_step = llvm.mlir.constant(1 : i32) : i32
diff --git a/mlir/test/Target/LLVMIR/openmp-tark-target-device.mlir b/mlir/test/Target/LLVMIR/openmp-tark-target-device.mlir
new file mode 100644
index 00000000000000..c167604bcd12c0
--- /dev/null
+++ b/mlir/test/Target/LLVMIR/openmp-tark-target-device.mlir
@@ -0,0 +1,27 @@
+// RUN: mlir-translate -mlir-to-llvmir %s | FileCheck %s
+
+// This tests the fix for https://github.com/llvm/llvm-project/issues/84606
+// We are only interested in ensuring that the -mlir-to-llmvir pass doesn't crash.
+// CHECK: {{.*}} = add i32 {{.*}}, 5
+module attributes {omp.is_target_device = true } {
+ llvm.func @_QQmain() attributes {fir.bindc_name = "main", omp.declare_target = #omp.declaretarget<device_type = (host), capture_clause = (to)>} {
+ %0 = llvm.mlir.constant(0 : i32) : i32
+ %1 = llvm.mlir.constant(1 : i64) : i64
+ %2 = llvm.alloca %1 x i32 {bindc_name = "a"} : (i64) -> !llvm.ptr<5>
+ %3 = llvm.addrspacecast %2 : !llvm.ptr<5> to !llvm.ptr
+ omp.task {
+ llvm.store %0, %3 : i32, !llvm.ptr
+ omp.terminator
+ }
+ %4 = omp.map_info var_ptr(%3 : !llvm.ptr, i32) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr {name = "a"}
+ omp.target map_entries(%4 -> %arg0 : !llvm.ptr) {
+ ^bb0(%arg0: !llvm.ptr):
+ %5 = llvm.mlir.constant(5 : i32) : i32
+ %6 = llvm.load %arg0 : !llvm.ptr -> i32
+ %7 = llvm.add %6, %5 : i32
+ llvm.store %7, %arg0 : i32, !llvm.ptr
+ omp.terminator
+ }
+ llvm.return
+ }
+}
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
.Case([&](omp::TargetOp) { | ||
return convertOmpTarget(*op, builder, moduleTranslation); | ||
}) | ||
// Skip omp ops that are not legal top level ops for the target device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, this target region would be ignored:
omp.task {
omp.target {
}
}
In general, by not processing the body of host operations, we will miss target regions nested within them. I don't know if there's a clean way to support that while following this approach.
Of course, if it isn't legal to nest target regions inside of any other constructs, this is a non-issue, but I haven't noticed anything like that in the spec. Maybe @mjklemm can enlighten us on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more, maybe it's not too difficult to support these cases. For those host ops that might contain a target region nested (immediately or not), we could just call op.walk<omp::TargetOp>(...)
and trigger lowering for these target ops. The only problem this could have (I'm not sure, though) is that some arguments of that omp.target
operation might not have been lowered to LLVM IR by then, since we would have skipped parent regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
program main
integer :: A(10), B(10), C(10)
do I = 1, 10
A(I) = 1
B(I) = 2
end do
!$omp target data map(to: A, B) map(alloc: C)
!$omp target map(from: C)
do I = 1, 10
C(I) = A(I) + B(I) ! assigns 3, A:1 + B:2
end do
!$omp end target
!$omp target update from(C) ! updates C device -> host
!$omp end target data
print *, C ! should be all 3's
end program
This is another use-case we need to be keep in mind, target data
is a host construct that can contain a region that can have more than one target
's within it as well as other host related directives (e.g. the update
in the example).
This already has a little bit of a bug in it just now that @TIFitis has looked into and attempted to address in: #85218
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case does seem to be addressed in the above however, via DataOp
in the switch case, but thought I'd bring it up just to bring light to that particular use-case.
It's possibly worth stating that I think Exit
/Enter
/Update
are also host directives from my understanding and they do not own a region like Data
that can contain device code in this case and likely should not be performed in device passes (there's been at least one error with this in the past). I don't think they can be invoked inside of a target
directive/region either as far as the specification is concerned, but that would need some verification as it isn't something I've dug into too deeply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, this target region would be ignored:
omp.task { omp.target { } }In general, by not processing the body of host operations, we will miss target regions nested within them. I don't know if there's a clean way to support that while following this approach.
Of course, if it isn't legal to nest target regions inside of any other constructs, this is a non-issue, but I haven't noticed anything like that in the spec. Maybe @mjklemm can enlighten us on that.
This exact structure is what both of my PRs (#85130 & #83966) will generate. According to the spec, this shouldn't be illegal (@mjklemm can correct me if I am wrong) - Quoting the spec
When a target construct is encountered, a new target task is generated. The target task region
encloses the target region.
This target task could be interpreted as an implicit task but in the case of the depend
clause (and I think the nowait
clause as well) it needs an explicit task created (AFAICT as implied from the following statement in the spec 5.2 Section 13.8)
If a depend clause is present, it is associated with the target task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put up a fix for handling inner regions before reading these comments, so keep that in mind if you look at the latest commit. But, yes the omp.task { omp.target { }} case was ignored in the original code. I added a case where the inner region of an op is traversed and calls the top level dispatch function recursively if it is an omp op. Probably not all operations with inner regions can contain top-level ops, but that should be handled earlier in the compiler, so we shouldn't have to check for illegal combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more, maybe it's not too difficult to support these cases. For those host ops that might contain a target region nested (immediately or not), we could just call
op.walk<omp::TargetOp>(...)
and trigger lowering for these target ops. The only problem this could have (I'm not sure, though) is that some arguments of thatomp.target
operation might not have been lowered to LLVM IR by then, since we would have skipped parent regions.
I think we are okay since omp.target is isolated from above, so I believe there shouldn't be any values that need to be translated (codegen seems to work for the omp.task { omp.target {}} case. I'm not sure if it is possible for the top level target data ops to have this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not all operations with inner regions can contain top-level ops, but that should be handled earlier in the compiler, so we shouldn't have to check for illegal combinations.
I agree, this would be way too late to realize that OpenMP operations were nested in an unsupported way.
I think we are okay since omp.target is isolated from above, so I believe there shouldn't be any values that need to be translated (codegen seems to work for the omp.task { omp.target {}} case.
I would think that the region inside of omp.target
should be translatable for that reason. My concern was more about arguments to that operation. For example, there's some processing of map arguments as part of convertOmpTarget
that is done for both host and device. Will that still work if we haven't translated them to LLVM IR as part of lowering its enclosing region?
%c1 = llvm.mlir.constant(1 : i64) : i64
%x = llvm.alloca %c1 x i32 : (i64) -> !llvm.ptr
omp.task {
%map = omp.map_info var_ptr(%x : !llvm.ptr, i32) map_clauses(tofrom) capture(ByRef) -> !llvm.ptr
omp.target map_entries(%map -> %arg0 : !llvm.ptr) {
^bb0(%arg0: !llvm.ptr):
...
omp.terminator
}
omp.terminator
}
If that code makes this approach crash, then we would have to rethink the approach. We would need to create some dummy values, or somehow detect missing LLVM IR values and create them on demand. I think the former was basically what the early outlining did with the outlined function's arguments, if I remember correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map arguments are already skipped in the common case, since they are supposed to be lowered by the omp.target op lowering and that lowering should not go beyond the omp.map_info operations, because that would include host code in the target codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, if it isn't legal to nest target regions inside of any other constructs, this is a non-issue, but I haven't noticed anything like that in the spec. Maybe @mjklemm can enlighten us on that.
It's allowed to nest target
in other OpenMP constructs like task
, parallel
, for
, etc.
omp::DataBoundsOp, omp::CriticalDeclareOp>( | ||
[&](auto op) { return success(); }) | ||
// Recursively traverse inner regions | ||
.Case<omp::ParallelOp, omp::MasterOp, omp::CriticalOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wary of such bespoke listings of ops. It makes it one more arguably hard-to-find place to update if and when we add new ops.
My initial intuition was to see if an op contains a region, then we could recurse into it and find top level target ops. Ops that don't have regions can then be ignored. This logic works for all ops listed on line 3079 except SimdLoopOp
and ReductionDeclareOp
which have regions. :( So that wont work unless we are willing to needlessly go over all the ops in these two ops knowing fully well we won't find top level target ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on trying to process the op's contents based on the existence of regions attached to it rather than listing the types. Probably, rather than the current type switch, we could do something similar to this instead:
if (isa<TargetOp>(op))
return convertOmpTarget(*op, builder, moduleTranslation);
bool interrupted = op->walk<WalkOrder::PreOrder>([&](omp::TargetOp targetOp) {
if (failed(convertOmpTarget(*targetOp, builder, moduleTranslation)))
return WalkResult::interrupt();
return WalkResult::skip();
}).wasInterrupted();
return failure(interrupted);
That should do the trick of recursively looking for target regions, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is probably easier maintenance wise since we don't really need to control what is allowed and not, and the micro-optimization to skip ops that should not contain omp.target ops is probably not worth the effort, so I will change the code to do the generic handling of sub-regions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code above works fine, so I used that. Thanks!
return llvm::TypeSwitch<Operation *, LogicalResult>(op) | ||
.Case<omp::DataOp, omp::EnterDataOp, omp::ExitDataOp, omp::UpdateDataOp>( | ||
[&](auto op) { | ||
return convertOmpTargetData(op, builder, moduleTranslation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is as risk here. @agozillon is right that omp::DataOp
is a host op that can contain a region with omp.target
ops. So, we need to recurse into this and calling convertOmpTargetData
is fine, but only from the point of view of looking inside it for other omp.target
ops. I am sure convertOmpTargetData
does more than that which on the device (i.e. when ompBuilder->Config.isTargetDevice() == true
) is unnecessary. Hopefully it doesn't lead to state that needs to be rolled back.
Secondly, we shouldn't be processing EnterDataOp
, UpdateDataOp
, ExitDataOp
at all because their actions are performed on the host and they don't have regions that we need to recurse into to look for other omp.target
ops like we have to in the case of omp.task
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in that case only omp.target is a top-level target device op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Jan for addressing previous comments. I think this approach looks good, I just have a couple of nits to try make names hopefully a bit easier to understand.
There's also a comment to try out a small modification to one of the tests, which is the main concern I had with this approach. Even if it doesn't work, I think it's still worth merging this PR, but hopefully it does.
@@ -2922,6 +2922,162 @@ convertDeclareTargetAttr(Operation *op, mlir::omp::DeclareTargetAttr attribute, | |||
return success(); | |||
} | |||
|
|||
static bool isInternalTargetDeviceOp(Operation *op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static bool isInternalTargetDeviceOp(Operation *op) { | |
static bool isTargetDeviceOp(Operation *op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make this change once we agree on the other comments.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
✅ With the latest revision this PR passed the Python code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jan for working on this and addressing my comments. I think this looks good, I'd suggest giving it just a day or so to give others a chance to express any remaining concerns before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Jan for the changes to make this work for omp.target_data
as well. In my opinion this looks ready, just notice when you rebase that some of the MLIR operations have been renamed (DataOp
-> TargetDataOp
, DataBoundsOp
-> MapBoundsOp
, ...). Some of the ops in the new tests will have to be updated as well (omp.map_info
-> omp.map.info
, ...)
aaadaad
to
eaa836c
Compare
This patch separates the lowering dispatch for host and target devices. For the target device, if the current operation is not a top-level operation (e.g. omp.target) or is inside a target device code region it will be ignored, since it belongs to the host code.
47e56ce
to
98f256c
Compare
This patch separates the lowering dispatch for host and target devices. For the target device, if the current operation is not a top-level operation (e.g. omp.target) or is inside a target device code region it will be ignored, since it belongs to the host code.
This is an alternative approach to #84611, the new test in this PR was taken from there.