-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tpetra: TpetraCore_FECrs_MatrixMatrix_UnitTests hanging in Cuda Debug build #13339
Comments
"Debug mode" meaning a Debug build? Can you please share how you're configuring Trilinos? |
Hi @cwpearson, Indeed, I meant "Debug build". This is the These are a few more details:
Don't hesitate to let us know if you should need more info. Thanks in advance! |
I believe I've reproduced this with the following: FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
RUN apt-get update && apt-get install -y git python3 libopenblas-dev
ENV OPENMPI_SRC /opt/openmpi-5.0.4
ENV OPENMPI_BUILD /opt/openmpi-build-5.0.4
ADD https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.4.tar.bz2 \
/opt/.
RUN tar -C /opt -xf /opt/openmpi-5.0.4.tar.bz2
RUN mkdir -p ${OPENMPI_BUILD} \
&& cd ${OPENMPI_BUILD} \
&& ${OPENMPI_SRC}/configure --prefix=/usr/local --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs \
&& make -j$(nproc) all \
&& make -j$(nproc) install \
&& ldconfig
ENV TRILINOS_SRC /opt/trilinos-develop
ENV TRILINOS_BUILD /opt/trilinos-build-develop
RUN git clone --branch develop --depth 1 https://github.com/trilinos/Trilinos.git $TRILINOS_SRC
ADD https://github.com/Kitware/CMake/releases/download/v3.30.2/cmake-3.30.2-linux-x86_64.tar.gz \
/opt/.
RUN ls /opt
RUN tar -C /usr/local --strip-components=1 -xvf /opt/cmake-3.30.2-linux-x86_64.tar.gz
# openmpi doesn't want to run as root
RUN useradd -ms /bin/bash runner
RUN mkdir -p $TRILINOS_BUILD
RUN chown -R runner /opt/trilinos-build-develop
RUN chown -R runner /opt/trilinos-develop
USER runner
WORKDIR /home/runner podman build -t tr-issue-13339 .
podman run --device nvidia.com/gpu=all --rm -it tr-issue-13339 export OMPI_CXX=$TRILINOS_SRC/packages/kokkos/bin/nvcc_wrapper
mkdir -p $TRILINOS_BUILD
cmake -S $TRILINOS_SRC -B $TRILINOS_BUILD \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_C_COMPILER=mpicc \
-DCMAKE_CXX_COMPILER=mpicxx \
-DTrilinos_ENABLE_Fortran=OFF \
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON \
-DTPL_ENABLE_MPI=ON \
-DTPL_ENABLE_CUDA=ON \
-DTPL_ENABLE_CUSPARSE=ON \
-DKokkos_ENABLE_CUDA=ON \
-DKokkos_ENABLE_CUDA_LAMBDA=ON \
-DKokkos_ENABLE_CUDA_CONSTEXPR=ON \
-DKokkos_ENABLE_CUDA_UVM=OFF \
-DKokkos_ARCH_REPL_STR_CUDA_FAMILY=ON \
-DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON \
-DTrilinos_ENABLE_Tpetra=ON \
-DTpetra_ENABLE_TESTS=ON \
|& tee $TRILINOS_BUILD/configure.log
nice -n20 make -j $(nproc) -C $TRILINOS_BUILD/packages/tpetra/core/test/MatrixMatrix
ctest --test-dir $TRILINOS_BUILD/packages/tpetra/core/test/MatrixMatrix -R FECrs_MatrixMatrix -V |
I think what is happening is that #13052 caused many more configurations of Trilinos to use cuSparse SpGEMM rather than the internal Kokkos Kernels implementation. Some of the ranks are bailing out when the B matrix is unsorted (a check we only do for debug builds + TPLs, so even your debug build wasn't hitting that check before). I don't know why we're not seeing a throw message here, but this is where two of the ranks die, causing a hang Trilinos/packages/kokkos-kernels/sparse/src/KokkosSparse_spgemm_symbolic.hpp Lines 158 to 161 in 26e307f
@csiefer2, is there something about this test we should adjust to handle sorted/unsorted matrices? Trilinos/packages/tpetra/core/test/MatrixMatrix/FECrs_MatrixMatrix_UnitTests.cpp Lines 346 to 347 in 26e307f
|
@brian-kelley says that these matrices are not expected to be sorted, so I guess we need to handle this inside Tpetra when Kokkos Kernels uses the cuSparse SpGEMM |
@maartenarnst can you check if #13424 solves your problem? |
Hi @cwpearson. Thanks for the PR! I have just launched our build pipeline with the PR as a patch. It will take a few hours to compile everything. Once it's done, I'll report back. Fingers crossed :) |
@maartenarnst how are things looking? |
Hi @cwpearson. Thanks for the reminder! Sorry for not having gotten back to you. The fix solved the issue indeed. So, no news = good news :) |
@trilinos/tpetra
@cwpearson
Bug Report
We observe that the
TpetraCore_FECrs_MatrixMatrix_UnitTests
test hangs in our Cuda build in Debug mode.We use
Cuda
12.4.1 compiled withnvcc
withgcc
12.3 and with Cusparse enabled as a TPL. The hang only arises when in Debug mode. It is not random, but the test hangs every time.It seems the test hangs on the following line:
Trilinos/packages/tpetra/core/test/MatrixMatrix/FECrs_MatrixMatrix_UnitTests.cpp
Line 346 in 441b0e1
and, therein, on the following line:
Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_Cuda.hpp
Line 203 in 441b0e1
This line was introduced in a recent PR:
We are not sure what exactly is causing the issue or how to solve it.
Also tagging @romintomasetti.
The text was updated successfully, but these errors were encountered: