-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[LinkerWrapper] Support relocatable linking for offloading #80066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-binary-utilities @llvm/pr-subscribers-clang-driver Author: Joseph Huber (jhuber6) ChangesSummary: One downside to RDC linking is that it can become quite big for very This primarily requires manually editing the embedded data in the Given this support, the following compilation path should produce two
Or similarly for HIP to effectively perform non-RDC mode compilation for
One question is whether or not this should be the default behaviour of Full diff: https://github.com/llvm/llvm-project/pull/80066.diff 5 Files Affected:
diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
// RUN: %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
// RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected<std::string> findProgram(StringRef Name, ArrayRef<StringRef> Paths) {
return *Path;
}
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+ llvm::MD5 Hasher;
+ llvm::MD5::MD5Result Hash;
+ Hasher.update(Str);
+ Hasher.final(Hash);
+ return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList &Args, StringRef Output) {
+ Expected<std::string> ObjcopyPath =
+ findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+ if (!ObjcopyPath)
+ return ObjcopyPath.takeError();
+
+ // Use the linker output file to get a unique hash. This creates a unique
+ // identifier to rename the sections to that is deterministic to the contents.
+ auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+ : MemoryBuffer::getFileOrSTDIN(Output);
+ if (!BufferOrErr)
+ return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+ Output.str().c_str());
+ std::string Suffix = "_" + getHash((*BufferOrErr)->getBuffer());
+
+ SmallVector<StringRef> ObjcopyArgs = {
+ *ObjcopyPath,
+ Output,
+ };
+
+ // Remove the old .llvm.offloading section to prevent further linking.
+ ObjcopyArgs.emplace_back("--remove-section");
+ ObjcopyArgs.emplace_back(".llvm.offloading");
+ for (StringRef Prefix : {"omp", "cuda", "hip"}) {
+ auto Section = (Prefix + "_offloading_entries").str();
+ // Rename the offloading entires to make them private to this link unit.
+ ObjcopyArgs.emplace_back("--rename-section");
+ ObjcopyArgs.emplace_back(
+ Args.MakeArgString(Section + "=" + Section + Suffix));
+
+ // Rename the __start_ / __stop_ symbols appropriately to iterate over the
+ // newly renamed section containing the offloading entries.
+ ObjcopyArgs.emplace_back("--redefine-sym");
+ ObjcopyArgs.emplace_back(Args.MakeArgString("__start_" + Section + "=" +
+ "__start_" + Section + Suffix));
+ ObjcopyArgs.emplace_back("--redefine-sym");
+ ObjcopyArgs.emplace_back(Args.MakeArgString("__stop_" + Section + "=" +
+ "__stop_" + Section + Suffix));
+ }
+
+ if (Error Err = executeCommands(*ObjcopyPath, ObjcopyArgs))
+ return Err;
+
+ return Error::success();
+}
+
/// Runs the wrapped linker job with the newly created input.
Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
llvm::TimeTraceScope TimeScope("Execute host linker");
@@ -265,6 +322,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
LinkerArgs.push_back(Arg);
if (Error Err = executeCommands(LinkerPath, LinkerArgs))
return Err;
+
+ if (Args.hasArg(OPT_relocatable))
+ if (Error Err = relocateOffloadSection(Args, ExecutableName))
+ return Err;
+
return Error::success();
}
@@ -910,7 +972,8 @@ wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
case OFK_OpenMP:
if (Error Err = offloading::wrapOpenMPBinaries(
M, BuffersToWrap,
- offloading::getOffloadEntryArray(M, "omp_offloading_entries")))
+ offloading::getOffloadEntryArray(M, "omp_offloading_entries"),
+ /*Suffix=*/"", /*Relocatable=*/Args.hasArg(OPT_relocatable)))
return std::move(Err);
break;
case OFK_Cuda:
@@ -1356,12 +1419,6 @@ Expected<SmallVector<SmallVector<OffloadFile>>>
getDeviceInput(const ArgList &Args) {
llvm::TimeTraceScope TimeScope("ExtractDeviceCode");
- // If the user is requesting a reloctable link we ignore the device code. The
- // actual linker will merge the embedded device code sections so they can be
- // linked when the executable is finally created.
- if (Args.hasArg(OPT_relocatable))
- return SmallVector<SmallVector<OffloadFile>>{};
-
StringRef Root = Args.getLastArgValue(OPT_sysroot_EQ);
SmallVector<StringRef> LibraryPaths;
for (const opt::Arg *Arg : Args.filtered(OPT_library_path, OPT_libpath))
diff --git a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
index e3ded00b573f7..3f9e0e8b98059 100644
--- a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
+++ b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
/// \param EntryArray Optional pair pointing to the `__start` and `__stop`
/// symbols holding the `__tgt_offload_entry` array.
/// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.
llvm::Error wrapOpenMPBinaries(llvm::Module &M,
llvm::ArrayRef<llvm::ArrayRef<char>> Images,
EntryArrayTy EntryArray,
- llvm::StringRef Suffix = "");
+ llvm::StringRef Suffix = "",
+ bool Relocatable = false);
/// Wraps the input fatbinary image into the module \p M as global symbols and
/// registers the images with the CUDA runtime.
diff --git a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
index 76a8eebdb3622..fec1bdbe9d8c7 100644
--- a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
+++ b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
@@ -112,7 +112,8 @@ PointerType *getBinDescPtrTy(Module &M) {
///
/// Global variable that represents BinDesc is returned.
GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
- EntryArrayTy EntryArray, StringRef Suffix) {
+ EntryArrayTy EntryArray, StringRef Suffix,
+ bool Relocatable) {
LLVMContext &C = M.getContext();
auto [EntriesB, EntriesE] = EntryArray;
@@ -129,7 +130,8 @@ GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
GlobalVariable::InternalLinkage, Data,
".omp_offloading.device_image" + Suffix);
Image->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
- Image->setSection(".llvm.offloading");
+ Image->setSection(Relocatable ? ".llvm.offloading.relocatable"
+ : ".llvm.offloading");
Image->setAlignment(Align(object::OffloadBinary::getAlignment()));
StringRef Binary(Buf.data(), Buf.size());
@@ -582,8 +584,9 @@ void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
Error offloading::wrapOpenMPBinaries(Module &M, ArrayRef<ArrayRef<char>> Images,
EntryArrayTy EntryArray,
- llvm::StringRef Suffix) {
- GlobalVariable *Desc = createBinDesc(M, Images, EntryArray, Suffix);
+ llvm::StringRef Suffix, bool Relocatable) {
+ GlobalVariable *Desc =
+ createBinDesc(M, Images, EntryArray, Suffix, Relocatable);
if (!Desc)
return createStringError(inconvertibleErrorCode(),
"No binary descriptors created.");
diff --git a/llvm/lib/Object/OffloadBinary.cpp b/llvm/lib/Object/OffloadBinary.cpp
index bfc35e41fe658..22d604b125c58 100644
--- a/llvm/lib/Object/OffloadBinary.cpp
+++ b/llvm/lib/Object/OffloadBinary.cpp
@@ -83,7 +83,7 @@ Error extractFromObject(const ObjectFile &Obj,
if (!NameOrErr)
return NameOrErr.takeError();
- if (!NameOrErr->equals(".llvm.offloading"))
+ if (!NameOrErr->starts_with(".llvm.offloading"))
continue;
}
|
@llvm/pr-subscribers-clang Author: Joseph Huber (jhuber6) ChangesSummary: One downside to RDC linking is that it can become quite big for very This primarily requires manually editing the embedded data in the Given this support, the following compilation path should produce two
Or similarly for HIP to effectively perform non-RDC mode compilation for
One question is whether or not this should be the default behaviour of Full diff: https://github.com/llvm/llvm-project/pull/80066.diff 5 Files Affected:
diff --git a/clang/test/Driver/linker-wrapper.c b/clang/test/Driver/linker-wrapper.c
index a8667c99977c5..21898faf295d4 100644
--- a/clang/test/Driver/linker-wrapper.c
+++ b/clang/test/Driver/linker-wrapper.c
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x;
// RUN: --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \
// RUN: %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK
-// RELOCATABLE-LINK-NOT: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
+// RELOCATABLE-LINK: clang{{.*}} -o {{.*}}.img --target=x86_64-unknown-linux-gnu
// RELOCATABLE-LINK: /usr/bin/ld.lld{{.*}}-r
+// RELOCATABLE-LINK: llvm-objcopy{{.*}}a.out --remove-section .llvm.offloading
diff --git a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
index b682cc293d54b..53412eb2346de 100644
--- a/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
+++ b/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp
@@ -241,6 +241,63 @@ Expected<std::string> findProgram(StringRef Name, ArrayRef<StringRef> Paths) {
return *Path;
}
+/// Returns the hashed value for a constant string.
+std::string getHash(StringRef Str) {
+ llvm::MD5 Hasher;
+ llvm::MD5::MD5Result Hash;
+ Hasher.update(Str);
+ Hasher.final(Hash);
+ return llvm::utohexstr(Hash.low(), /*LowerCase=*/true);
+}
+
+/// Renames offloading entry sections in a relocatable link so they do not
+/// conflict with a later link job.
+Error relocateOffloadSection(const ArgList &Args, StringRef Output) {
+ Expected<std::string> ObjcopyPath =
+ findProgram("llvm-objcopy", {getMainExecutable("llvm-objcopy")});
+ if (!ObjcopyPath)
+ return ObjcopyPath.takeError();
+
+ // Use the linker output file to get a unique hash. This creates a unique
+ // identifier to rename the sections to that is deterministic to the contents.
+ auto BufferOrErr = DryRun ? MemoryBuffer::getMemBuffer("")
+ : MemoryBuffer::getFileOrSTDIN(Output);
+ if (!BufferOrErr)
+ return createStringError(inconvertibleErrorCode(), "Failed to open %s",
+ Output.str().c_str());
+ std::string Suffix = "_" + getHash((*BufferOrErr)->getBuffer());
+
+ SmallVector<StringRef> ObjcopyArgs = {
+ *ObjcopyPath,
+ Output,
+ };
+
+ // Remove the old .llvm.offloading section to prevent further linking.
+ ObjcopyArgs.emplace_back("--remove-section");
+ ObjcopyArgs.emplace_back(".llvm.offloading");
+ for (StringRef Prefix : {"omp", "cuda", "hip"}) {
+ auto Section = (Prefix + "_offloading_entries").str();
+ // Rename the offloading entires to make them private to this link unit.
+ ObjcopyArgs.emplace_back("--rename-section");
+ ObjcopyArgs.emplace_back(
+ Args.MakeArgString(Section + "=" + Section + Suffix));
+
+ // Rename the __start_ / __stop_ symbols appropriately to iterate over the
+ // newly renamed section containing the offloading entries.
+ ObjcopyArgs.emplace_back("--redefine-sym");
+ ObjcopyArgs.emplace_back(Args.MakeArgString("__start_" + Section + "=" +
+ "__start_" + Section + Suffix));
+ ObjcopyArgs.emplace_back("--redefine-sym");
+ ObjcopyArgs.emplace_back(Args.MakeArgString("__stop_" + Section + "=" +
+ "__stop_" + Section + Suffix));
+ }
+
+ if (Error Err = executeCommands(*ObjcopyPath, ObjcopyArgs))
+ return Err;
+
+ return Error::success();
+}
+
/// Runs the wrapped linker job with the newly created input.
Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
llvm::TimeTraceScope TimeScope("Execute host linker");
@@ -265,6 +322,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) {
LinkerArgs.push_back(Arg);
if (Error Err = executeCommands(LinkerPath, LinkerArgs))
return Err;
+
+ if (Args.hasArg(OPT_relocatable))
+ if (Error Err = relocateOffloadSection(Args, ExecutableName))
+ return Err;
+
return Error::success();
}
@@ -910,7 +972,8 @@ wrapDeviceImages(ArrayRef<std::unique_ptr<MemoryBuffer>> Buffers,
case OFK_OpenMP:
if (Error Err = offloading::wrapOpenMPBinaries(
M, BuffersToWrap,
- offloading::getOffloadEntryArray(M, "omp_offloading_entries")))
+ offloading::getOffloadEntryArray(M, "omp_offloading_entries"),
+ /*Suffix=*/"", /*Relocatable=*/Args.hasArg(OPT_relocatable)))
return std::move(Err);
break;
case OFK_Cuda:
@@ -1356,12 +1419,6 @@ Expected<SmallVector<SmallVector<OffloadFile>>>
getDeviceInput(const ArgList &Args) {
llvm::TimeTraceScope TimeScope("ExtractDeviceCode");
- // If the user is requesting a reloctable link we ignore the device code. The
- // actual linker will merge the embedded device code sections so they can be
- // linked when the executable is finally created.
- if (Args.hasArg(OPT_relocatable))
- return SmallVector<SmallVector<OffloadFile>>{};
-
StringRef Root = Args.getLastArgValue(OPT_sysroot_EQ);
SmallVector<StringRef> LibraryPaths;
for (const opt::Arg *Arg : Args.filtered(OPT_library_path, OPT_libpath))
diff --git a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
index e3ded00b573f7..3f9e0e8b98059 100644
--- a/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
+++ b/llvm/include/llvm/Frontend/Offloading/OffloadWrapper.h
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>;
/// \param EntryArray Optional pair pointing to the `__start` and `__stop`
/// symbols holding the `__tgt_offload_entry` array.
/// \param Suffix An optional suffix appended to the emitted symbols.
+/// \param Relocatable Indicate if we need to change the offloading section.
llvm::Error wrapOpenMPBinaries(llvm::Module &M,
llvm::ArrayRef<llvm::ArrayRef<char>> Images,
EntryArrayTy EntryArray,
- llvm::StringRef Suffix = "");
+ llvm::StringRef Suffix = "",
+ bool Relocatable = false);
/// Wraps the input fatbinary image into the module \p M as global symbols and
/// registers the images with the CUDA runtime.
diff --git a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
index 76a8eebdb3622..fec1bdbe9d8c7 100644
--- a/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
+++ b/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp
@@ -112,7 +112,8 @@ PointerType *getBinDescPtrTy(Module &M) {
///
/// Global variable that represents BinDesc is returned.
GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
- EntryArrayTy EntryArray, StringRef Suffix) {
+ EntryArrayTy EntryArray, StringRef Suffix,
+ bool Relocatable) {
LLVMContext &C = M.getContext();
auto [EntriesB, EntriesE] = EntryArray;
@@ -129,7 +130,8 @@ GlobalVariable *createBinDesc(Module &M, ArrayRef<ArrayRef<char>> Bufs,
GlobalVariable::InternalLinkage, Data,
".omp_offloading.device_image" + Suffix);
Image->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
- Image->setSection(".llvm.offloading");
+ Image->setSection(Relocatable ? ".llvm.offloading.relocatable"
+ : ".llvm.offloading");
Image->setAlignment(Align(object::OffloadBinary::getAlignment()));
StringRef Binary(Buf.data(), Buf.size());
@@ -582,8 +584,9 @@ void createRegisterFatbinFunction(Module &M, GlobalVariable *FatbinDesc,
Error offloading::wrapOpenMPBinaries(Module &M, ArrayRef<ArrayRef<char>> Images,
EntryArrayTy EntryArray,
- llvm::StringRef Suffix) {
- GlobalVariable *Desc = createBinDesc(M, Images, EntryArray, Suffix);
+ llvm::StringRef Suffix, bool Relocatable) {
+ GlobalVariable *Desc =
+ createBinDesc(M, Images, EntryArray, Suffix, Relocatable);
if (!Desc)
return createStringError(inconvertibleErrorCode(),
"No binary descriptors created.");
diff --git a/llvm/lib/Object/OffloadBinary.cpp b/llvm/lib/Object/OffloadBinary.cpp
index bfc35e41fe658..22d604b125c58 100644
--- a/llvm/lib/Object/OffloadBinary.cpp
+++ b/llvm/lib/Object/OffloadBinary.cpp
@@ -83,7 +83,7 @@ Error extractFromObject(const ObjectFile &Obj,
if (!NameOrErr)
return NameOrErr.takeError();
- if (!NameOrErr->equals(".llvm.offloading"))
+ if (!NameOrErr->starts_with(".llvm.offloading"))
continue;
}
|
This is related to the discussions at the #77018 issue. |
@@ -181,5 +181,6 @@ __attribute__((visibility("protected"), used)) int x; | |||
// RUN: --linker-path=/usr/bin/ld.lld -- -r --whole-archive %t.a --no-whole-archive \ | |||
// RUN: %t.o -o a.out 2>&1 | FileCheck %s --check-prefix=RELOCATABLE-LINK | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check device fatbin bundling is done and device wrapper fatbin variable using internal linkage and in postfixed section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test for the HIP and CUDA cases as well as an additional check on passing -r
to the wrapper image generation.
Summary: The standard GPU compilation process embeds each intermediate object file into the host file at the `.llvm.offloading` section so it can be linked later. We also use a sepcial section called something like `omp_offloading_entries` to store all the globals that need to be registered by the runtime. The linker-wrapper's job is to link the embedded device code stored at this section and then emit code to register the linked image and the kernels and globals in the offloading entry section. One downside to RDC linking is that it can become quite big for very large projects that wish to make use of static linking. This patch changes the support for relocatable linking via `-r` to support a kind of "partial" RDC compilation for offloading languages. This primarily requires manually editing the embedded data in the output object file for the relocatable link. We need to rename the output section to make it distinct from the input sections that will be merged. We then delete the old embedded object code so it won't be linked further. We then need to rename the old offloading section so that it is private to the module. A runtime solution could also be done to defer entires that don't belong to the given GPU executable, but this is easier. Note that this does not work with COFF linking, only the ELF method for handling offloading entries, that could be made to work similarly. Given this support, the following compilation path should produce two distinct images for OpenMP offloading. ``` $ clang foo.c -fopenmp --offload-arch=native -c $ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang main.c merged.o -fopenmp --offload-arch=native $ ./a.out ``` Or similarly for HIP to effectively perform non-RDC mode compilation for a subset of files. ``` $ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c $ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o $ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc $ ./a.out ``` One question is whether or not this should be the default behaviour of `-r` when run through the linker-wrapper or a special option. Standard `-r` behavior is still possible if used without invoking the linker-wrapper and it guranteed to be correct.
So, the idea is to carry two separate embedded offloading sections -- one for already fully linked GPU executables, and another for GPU objects to be linked at the final link stage.
Typo in 'special' in the description. |
It's more or less doing
Thanks, will fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@@ -265,6 +329,11 @@ Error runLinker(ArrayRef<StringRef> Files, const ArgList &Args) { | |||
LinkerArgs.push_back(Arg); | |||
if (Error Err = executeCommands(LinkerPath, LinkerArgs)) | |||
return Err; | |||
|
|||
if (Args.hasArg(OPT_relocatable)) | |||
if (Error Err = relocateOffloadSection(Args, ExecutableName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just return relocateOffloadSection(Args, ExecutableName)
@@ -20,10 +20,12 @@ using EntryArrayTy = std::pair<GlobalVariable *, GlobalVariable *>; | |||
/// \param EntryArray Optional pair pointing to the `__start` and `__stop` | |||
/// symbols holding the `__tgt_offload_entry` array. | |||
/// \param Suffix An optional suffix appended to the emitted symbols. | |||
/// \param Relocatable Indicate if we need to change the offloading section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "Indicate whether the binary is a relocatable object" may work a bit better for describing intent. Current description seems to describe an implementation detail.
✅ With the latest revision this PR passed the C/C++ code formatter. |
Supporting such mixed mode opens an interesting set of issues we may need to consider going forward:
This would be OK for something like CUDA where cross-TU references are usually limited to host, but would be surprising for someone who would expect C++-like behavior, which sort of the ultimate goal for offloading use case. This will eventually become a problem if/when we grow large enough subset of independent offload-enabled libraries. The top-level user will have a hard time figuring out what's visible and what is not, unless the libraries deliberately expose only host-level APIs, if/when they fully link GPU side code. |
I'm assuming you're talking about GPU-side constructors? I don't think the CUDA runtime supports those, but OpenMP runs them when the image is loaded, so it would handle both independantly.
This has the same semantics as a
The idea is that users already get C++-like behavior with the new driver and |
Yes. I'm thinking of the expectations from a C++ user standpoint, and this is one of the areas where there will be observable differences. First, because there will be subsets of the code that are no longer part of the main GPU-side executable. Second, the side effects of the initializers will be different depending on whether we do link such subsets separately or not. E.g. the initializer call order will change. The global state changes in one subset will not be visible in the other. Weak symbol resolution will produce different results. Etc.
Yes. That will set the default expectations that things work just like in C++, which is a great thing. But introduction of partial subset linking will break some of those "just works" assumptions and it may be triggered by the parts of the build outside of user's control (e.g. by a third-party library). Side note: we do need a good term for this kind of subset linking. "partial linking" already has established meaning and it's not a good fit here as we actually produce a fully linked GPU executable.
We do need to document how it works. Documenting what does not work, or works differently is also important, IMO. |
It'll definitely have an effect different from full linking, but the idea is that it would be the desired effect if someone went out of their way to do this GPU subset linking thing.
This was one of the things I was wondering about, since we could alternatively make a new flag for this outside of
Yeah, coming up with a name is difficult. You could just call it device linking, since it's more or less just doing the device link step ahead of time instead of passing it to when we make the final executable.
Yes, I should probably update this with some documentation. I'm not sure where it would go however, maybe just in the |
That would only be true when someone owns the whole build. That will not be the case in practice. A large enough project is usually a bunch of libraries created by different teams and vendors. They may or may not be built together and how a particular library is built is often controlled by its owner and may not be visible to the end user. The owners may consider switching to device linking to be benign or irrelevant to the end users, but it will be observable by those upstream users. Being aware of the quirks introduced by device linking will be required for the owners of those libraries. You do know how it all works under the hood. Pretty much nobody else on the planet does. :-) Anyways. I think we're in agreement that we do need to document possible implications. clang-linker-wrapper docs would do. |
To me, if the person owning the library chooses to do this in their build system it would speak to an intentional decision not to leak the GPU parts of their code. with shipped via a static library. More or less, this option would give people a way to ship static libraries with the same GPU / Offloading semantics as a shared library. Another advantage of this option is that it allows someone to build an application with GPU offloading using
Very true, I need to remind myself that I'm probably the only one who knows how this thing works in detail.
Will do. |
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks
Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the
.llvm.offloading
section so it can belinked later. We also use a special section called something like
omp_offloading_entries
to store all the globals that need to beregistered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.
One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via
-r
to support a kindof "partial" RDC compilation for offloading languages.
This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entries that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.
Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.
One question is whether or not this should be the default behavior of
-r
when run through the linker-wrapper or a special option. Standard-r
behavior is still possible if used without invoking thelinker-wrapper and it guaranteed to be correct.