Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpetra: TpetraCore_FECrs_MatrixMatrix_UnitTests hanging in Cuda Debug build #13339

Closed
maartenarnst opened this issue Aug 9, 2024 · 9 comments
Assignees
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@maartenarnst
Copy link
Contributor

@trilinos/tpetra
@cwpearson

Bug Report

We observe that the TpetraCore_FECrs_MatrixMatrix_UnitTests test hangs in our Cuda build in Debug mode.

We use Cuda 12.4.1 compiled with nvcc with gcc 12.3 and with Cusparse enabled as a TPL. The hang only arises when in Debug mode. It is not random, but the test hangs every time.

It seems the test hangs on the following line:

and, therein, on the following line:

This line was introduced in a recent PR:

We are not sure what exactly is causing the issue or how to solve it.

Also tagging @romintomasetti.

@maartenarnst maartenarnst added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 9, 2024
@cwpearson cwpearson self-assigned this Aug 9, 2024
@cwpearson
Copy link
Contributor

"Debug mode" meaning a Debug build? Can you please share how you're configuring Trilinos?

@maartenarnst
Copy link
Contributor Author

Hi @cwpearson,

Indeed, I meant "Debug build".

This is the CMakePresets.json that we use to configure Trilinos : trilinos.cmake.presets.json.txt

These are a few more details:

  • We build our Cuda build in a Docker image based on nvidia/cuda:12.4.1-devel-ubuntu22.04.
  • We compile in this image OpenMPI v5 GPU aware.
  • The TPLs that we enable are: mkl, cusparse, metis, parmetis, and scotch.
  • In the presets file, you will see a few keywords that start with "REPL_STR", such as "REPL_STR_MPIDISTRO_CXX". Our build script replaces these keywords with the appropriate ones, such as "OMPI_CXX".

Don't hesitate to let us know if you should need more info. Thanks in advance!

@cwpearson
Copy link
Contributor

I believe I've reproduced this with the following:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

RUN apt-get update && apt-get install -y git python3 libopenblas-dev

ENV OPENMPI_SRC /opt/openmpi-5.0.4
ENV OPENMPI_BUILD /opt/openmpi-build-5.0.4

ADD https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.4.tar.bz2 \
  /opt/.

RUN tar -C /opt -xf /opt/openmpi-5.0.4.tar.bz2

RUN mkdir -p ${OPENMPI_BUILD} \
 && cd ${OPENMPI_BUILD} \
 && ${OPENMPI_SRC}/configure --prefix=/usr/local --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs \
 && make -j$(nproc) all \ 
 && make -j$(nproc) install \
 && ldconfig

ENV TRILINOS_SRC /opt/trilinos-develop
ENV TRILINOS_BUILD /opt/trilinos-build-develop

RUN git clone --branch develop --depth 1 https://github.com/trilinos/Trilinos.git $TRILINOS_SRC

ADD https://github.com/Kitware/CMake/releases/download/v3.30.2/cmake-3.30.2-linux-x86_64.tar.gz \
 /opt/.
RUN ls /opt
RUN tar -C /usr/local --strip-components=1 -xvf /opt/cmake-3.30.2-linux-x86_64.tar.gz

# openmpi doesn't want to run as root
RUN useradd -ms /bin/bash runner
RUN mkdir -p $TRILINOS_BUILD
RUN chown -R runner /opt/trilinos-build-develop
RUN chown -R runner /opt/trilinos-develop
USER runner
WORKDIR /home/runner
podman build -t tr-issue-13339 . 
podman run --device nvidia.com/gpu=all --rm -it tr-issue-13339
export OMPI_CXX=$TRILINOS_SRC/packages/kokkos/bin/nvcc_wrapper
mkdir -p $TRILINOS_BUILD
cmake -S $TRILINOS_SRC -B $TRILINOS_BUILD \
  -DCMAKE_BUILD_TYPE=Debug \
  -DCMAKE_C_COMPILER=mpicc \
  -DCMAKE_CXX_COMPILER=mpicxx \
  -DTrilinos_ENABLE_Fortran=OFF \
  -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON \
  -DTPL_ENABLE_MPI=ON \
  -DTPL_ENABLE_CUDA=ON \
  -DTPL_ENABLE_CUSPARSE=ON \
  -DKokkos_ENABLE_CUDA=ON \
  -DKokkos_ENABLE_CUDA_LAMBDA=ON \
  -DKokkos_ENABLE_CUDA_CONSTEXPR=ON \
  -DKokkos_ENABLE_CUDA_UVM=OFF \
  -DKokkos_ARCH_REPL_STR_CUDA_FAMILY=ON \
  -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=ON \
  -DTrilinos_ENABLE_Tpetra=ON \
  -DTpetra_ENABLE_TESTS=ON \
  |& tee $TRILINOS_BUILD/configure.log


nice -n20 make -j $(nproc) -C $TRILINOS_BUILD/packages/tpetra/core/test/MatrixMatrix

ctest --test-dir $TRILINOS_BUILD/packages/tpetra/core/test/MatrixMatrix -R FECrs_MatrixMatrix -V

@cwpearson
Copy link
Contributor

cwpearson commented Aug 12, 2024

I think what is happening is that #13052 caused many more configurations of Trilinos to use cuSparse SpGEMM rather than the internal Kokkos Kernels implementation. Some of the ranks are bailing out when the B matrix is unsorted (a check we only do for debug builds + TPLs, so even your debug build wasn't hitting that check before). I don't know why we're not seeing a throw message here, but this is where two of the ranks die, causing a hang

if (!KokkosSparse::Impl::isCrsGraphSorted(const_b_r, const_b_l))
throw std::runtime_error(
"KokkosSparse::spgemm_symbolic: entries of B are not sorted within "
"rows. May use KokkosSparse::sort_crs_matrix to sort it.");

@csiefer2, is there something about this test we should adjust to handle sorted/unsorted matrices?

Tpetra::MatrixMatrix::Multiply(feA, transA, feB, transB, feC, true, "", params);
Tpetra::MatrixMatrix::Multiply( A, transA, B, transB, C, true, "", params);

@jhux2 jhux2 added this to Tpetra Aug 12, 2024
@jhux2 jhux2 moved this to Needs Triage in Tpetra Aug 12, 2024
@cwpearson
Copy link
Contributor

@brian-kelley says that these matrices are not expected to be sorted, so I guess we need to handle this inside Tpetra when Kokkos Kernels uses the cuSparse SpGEMM

@cwpearson
Copy link
Contributor

@maartenarnst can you check if #13424 solves your problem?

@maartenarnst
Copy link
Contributor Author

Hi @cwpearson. Thanks for the PR! I have just launched our build pipeline with the PR as a patch. It will take a few hours to compile everything. Once it's done, I'll report back. Fingers crossed :)

@cwpearson
Copy link
Contributor

@maartenarnst how are things looking?

@maartenarnst
Copy link
Contributor Author

Hi @cwpearson. Thanks for the reminder! Sorry for not having gotten back to you. The fix solved the issue indeed. So, no news = good news :)

@github-project-automation github-project-automation bot moved this from Needs Triage to Done in Tpetra Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

2 participants