You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to build an IOR with GPU Direct Storage support and have been running into some issues. I believe there are some mistakes in how this has been implemented in the build system:
Expected behavior
From the configure --help:
...
--with-gpfs support configurable GPFS [default=check]
--with-cuda support configurable CUDA [default=check]
--with-gpuDirect support configurable GPUDirect [default=check]
...
Seems to suggest that it will check if the correct binaries/libraries/headers are present to support gpuDirect / cuda (note that this is indeed seems to be the behaviour for GPFS).
Thus, I would expect that:
./configure
Succesfully autodetects that I have CUDA installed, and enables gpuDirect.
Attempt 1
However, when doing
./configure
make -j 128 V=1
It errors with:
mpicc -g -O2 -Lcheck/lib64 -Wl,--enable-new-dtags -Wl,-rpath=check/lib64 -Lcheck/lib64 -Wl,--enable-new-dtags -Wl,-rpath=check/lib64 -o md-workbench md_workbench-md-workbench-main.o md_workbench-aiori.o md_workbench-aiori-DUMMY.o md_workbench-aiori-MPIIO.o md_workbench-aiori-MMAP.o md_workbench-aiori-POSIX.o libaiori.a -lcufile -lcudart -lgpfs -lm
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `update_write_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:100: undefined reference to `update_write_memory_pattern_gpu'
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `generate_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:140: undefined reference to `generate_memory_pattern_gpu'
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `verify_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:186: undefined reference to `verify_memory_pattern_gpu'
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: /sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld: libaiori.a(libaiori_a-utilities.o): in function `libaiori.a(libaiori_a-utilities.o): in function `update_write_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:100: undefined reference to `update_write_memory_patternupdate_write_memory_pattern_gpu':
'
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld100: : undefined reference to `libaiori.a(libaiori_a-utilities.o)update_write_memory_pattern_gpu: in function `'
generate_memory_pattern':
/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:: 140: undefined reference to `libaiori.a(libaiori_a-utilities.o)generate_memory_pattern_gpu: in function `'
generate_memory_pattern/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld':
: /home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:libaiori.a(libaiori_a-utilities.o)140: in function `: undefined reference to `verify_memory_patterngenerate_memory_pattern_gpu':
'
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:/sw/arch/RHEL8/EB_production/2023/software/binutils/2.40-GCCcore-12.3.0/bin/ld186: : undefined reference to `libaiori.a(libaiori_a-utilities.o)verify_memory_pattern_gpu: in function `'
verify_memory_pattern':
/home/casparl/.local/easybuild/sources/i/IOR/ior-4.0.0/src/utilities.c:186: undefined reference to `verify_memory_pattern_gpu'
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:1026: ior] Error 1
The undefined references are defined in utilities-gpu.cu, and indeed, that file doesn't seem to be compiled into an object file.
It is also clear that link paths like -Lcheck/lib64 are not intended to be there: it is taking the actual default value for the with-cuda argument, and passing that as a search dir for the linker. Note that those come from e.g. this line. I think you should only append to LDFLAGS and CPPFLAGSif the user has passed a non-standard location as argument. When the argument is still the default, you should just try to locate the headers. In my case, they are on the CPATH and the compiler will find them just fine - no need to append anything.
On a side note: I see you are setting an rpath in you LDFLAGS, you might want to reconsider that. It is not really standard behaviour and could e.g. cause issues when CUDA installations are in different locations on the build machine compared to the machine on which it is run (not unthinkeable in an HPC system). In my humble opinion, it's the end user that is responsible for making sure that linked libraries are found at runtime.
Attempt 2
In a second attempt, I was more explicit:
Now, my build does complete, but that config.log still looks messy:
It seems to be using yes as a prefix somewhere, see e.g. the CPPFLAGS that includes -Iyes/include
It seems to define #define HAVE_GPU_DIRECT twice now
Anyway, these three points don't actually seem to break anything, but would be nice to clean up nonetheless.
Attempt 3
In a third attempt, I tried to run with optimization arguments. I'm building software for HPC systems, and we optimize all software by default for the hardware architecture on which it is going to be run.
$ make -j 128 V=1
Making all in src
make all-recursive
Making all in .
nvcc -O3 -mavx2 -mfma -fno-math-errno -c -o utilities-gpu.o utilities-gpu.cu
nvcc fatal : Unknown option '-mavx2'
make[3]: *** [Makefile:3721: utilities-gpu.o] Error 1
This makes sense, nvcc doesn't know about such optimization arguments. When and where to pass CFLAGS/CXXFLAGS/etc is always a bit of a pain. However, it is my understanding that the best practice for CUDA codes is to not pass CFLAGS/CXXFLAGS/etc just like that, as it is likely to contain arguments unknown to the CUDA compiler. The best example might be NVIDIA's own CUDA-samples, see e.g. here, where the problem is solved by using an NVCCFLAGS for nvcc specific flags, and passing the CFLAGS as argument to the -Xcompiler option ("Specify options directly to the compiler/preprocessor", i.e. these are passed on to the host compiler). In that case, replacing this with $(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) -c -o $@ $< is already a good first step and will avoid most issues, though you may want $(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) $(NVCCFLAGS) -c -o $@ $< to also allow the user to specify NVCC specific flags.
I made the replacement with $(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) -c -o $@ $< in Makefile.in and Makefile.am manually, and with that, I get:
Thanks for the detailed bug report and try, I know this is a bit too messy.
Can you try to add --with-nvcc with the configure call to see if that would catch up at least the invocation?
As you realized the nvcc is actually needed to compile utilities-gpu.cu
Your feedback for the NVCC invocation makes sense, and if it would at least try building using with-nvcc, the fix with NVCCFLAGS can be added.
would trigger a compilation of utilities-gpu.cu? I'm afraid it doesn't. Configuring like that, and building, I still get my original error, and indeed, no object file is found for uitilities-gpu.cu:
That also results in the same issue (utilities-gpu.cu not being compiled). I also tried:
./configure --with-gpuDirect
That does compile the object file for uitlities-gpu.cu (and the build then completes, provided I put the fix in place to prefix the CFLAGS with -Xcompiler for the nvcc-compiled part).
I've been trying to build an IOR with GPU Direct Storage support and have been running into some issues. I believe there are some mistakes in how this has been implemented in the build system:
Expected behavior
From the
configure --help
:Seems to suggest that it will check if the correct binaries/libraries/headers are present to support
gpuDirect
/cuda
(note that this is indeed seems to be the behaviour for GPFS).Thus, I would expect that:
Succesfully autodetects that I have CUDA installed, and enables gpuDirect.
Attempt 1
However, when doing
It errors with:
The undefined references are defined in
utilities-gpu.cu
, and indeed, that file doesn't seem to be compiled into an object file.It is also clear that link paths like
-Lcheck/lib64
are not intended to be there: it is taking the actual default value for thewith-cuda
argument, and passing that as a search dir for the linker. Note that those come from e.g. this line. I think you should only append toLDFLAGS
andCPPFLAGS
if the user has passed a non-standard location as argument. When the argument is still the default, you should just try to locate the headers. In my case, they are on theCPATH
and the compiler will find them just fine - no need to append anything.On a side note: I see you are setting an
rpath
in youLDFLAGS
, you might want to reconsider that. It is not really standard behaviour and could e.g. cause issues when CUDA installations are in different locations on the build machine compared to the machine on which it is run (not unthinkeable in an HPC system). In my humble opinion, it's the end user that is responsible for making sure that linked libraries are found at runtime.Attempt 2
In a second attempt, I was more explicit:
Note that the standard output of this argument looks the same as before, but the
config.log
does not. From the plain./configure
command, I get:But from
./configure --with-gpuDirect --with-cuda=/sw/arch/RHEL8/EB_production/2023/software/CUDA/12.1.1
I get:Now, my build does complete, but that
config.log
still looks messy:yes
as a prefix somewhere, see e.g. theCPPFLAGS
that includes-Iyes/include
#define HAVE_GPU_DIRECT
twice nowAnyway, these three points don't actually seem to break anything, but would be nice to clean up nonetheless.
Attempt 3
In a third attempt, I tried to run with optimization arguments. I'm building software for HPC systems, and we optimize all software by default for the hardware architecture on which it is going to be run.
Now, I get
This makes sense,
nvcc
doesn't know about such optimization arguments. When and where to passCFLAGS/CXXFLAGS/etc
is always a bit of a pain. However, it is my understanding that the best practice for CUDA codes is to not passCFLAGS/CXXFLAGS/etc
just like that, as it is likely to contain arguments unknown to the CUDA compiler. The best example might be NVIDIA's own CUDA-samples, see e.g. here, where the problem is solved by using anNVCCFLAGS
fornvcc
specific flags, and passing theCFLAGS
as argument to the-Xcompiler
option ("Specify options directly to the compiler/preprocessor", i.e. these are passed on to the host compiler). In that case, replacing this with$(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) -c -o $@ $<
is already a good first step and will avoid most issues, though you may want$(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) $(NVCCFLAGS) -c -o $@ $<
to also allow the user to specify NVCC specific flags.I made the replacement with
$(NVCC) $(addprefix -Xcompiler ,$(CFLAGS)) -c -o $@ $<
inMakefile.in
andMakefile.am
manually, and with that, I get:during the build and indeed this build completes succesfully.
The text was updated successfully, but these errors were encountered: