-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SUMMA Algorithm does not successfully run anymore #491
Comments
I cannot reproduce this locally, both using Clang and GCC with Open MPI. I cannot make much of the stack trace. Can you post a stack trace acquired using GDB? I also threw Valgrind at it and it didn't report any oddities. |
I have tested with clang, openmpi and the following build file. The most important thing is that shared memory windows are disabled. And make sure that the SUMMATest is actually executed, i.e., see the export CC=clang
export CXX=clang++
mkdir -p $BUILD_DIR
rm -Rf $BUILD_DIR/*
(cd $BUILD_DIR && cmake -DCMAKE_BUILD_TYPE=Debug \
-DBUILD_SHARED_LIBS=OFF \
-DBUILD_GENERIC=OFF \
-DENVIRONMENT_TYPE=default \
-DINSTALL_PREFIX=${INSTALL_PREFIX:=$HOME/opt/dash-0.3.0/} \
-DDART_IMPLEMENTATIONS=mpi \
-DENABLE_THREADSUPPORT=ON \
-DENABLE_DEV_COMPILER_WARNINGS=ON \
-DENABLE_EXT_COMPILER_WARNINGS=ON \
-DENABLE_LT_OPTIMIZATION=OFF \
-DENABLE_ASSERTIONS=ON \
\
-DENABLE_SHARED_WINDOWS=OFF \
-DENABLE_DYNAMIC_WINDOWS=ON \
-DENABLE_UNIFIED_MEMORY_MODEL=ON \
-DENABLE_DEFAULT_INDEX_TYPE_LONG=ON \
\
-DENABLE_LOGGING=${ENABLE_LOGGING:=ON} \
-DENABLE_TRACE_LOGGING=${ENABLE_TRACE_LOGGING:=ON} \
-DENABLE_DART_LOGGING=${ENABLE_DART_LOGGING:=ON} \
\
-DENABLE_LIBNUMA=ON \
-DENABLE_LIKWID=OFF \
-DENABLE_HWLOC=ON \
-DENABLE_PAPI=ON \
-DENABLE_MKL=ON \
-DENABLE_BLAS=ON \
-DENABLE_LAPACK=ON \
-DENABLE_SCALAPACK=OFF \
-DENABLE_PLASMA=OFF \
-DENABLE_HDF5=OFF \
-DENABLE_MEMKIND=ON \
\
-DBUILD_EXAMPLES=OFF \
-DBUILD_TESTS=ON \
-DBUILD_DOCS=OFF \
\
-DIPM_PREFIX=${IPM_HOME} \
-DPAPI_PREFIX=${PAPI_HOME} \
\
-DGTEST_LIBRARY_PATH=${HOME}/opt/gtest.clang/lib \
-DGTEST_INCLUDE_PATH=${HOME}/opt/gtest.clang/include \
\
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
../ && \
|
GDB backtrace:
|
Still no luck. I'm pretty certain that the SUMMATest runs, though (takes approx. 15s). What version of Open MPI are you using? (it looks like it's a system installation so it might be hopelessly out of date...) Single or multiple node? What does Valgrind report? |
Just ran on our Linux cluster, seeing the following OOB exception when running with 2 units:
Fun fact: it works with 4 and 16 processes but not with 6 or 8. I suspect this is a bug in the data distribution or SUMMA (as it works in cases in which there is an equal number of processes in both dimensions). Not gonna touch that... |
Does it work with a previous state of DASH? Esp. prior to changes on GlobRef and the like? |
... ah, and: There is a comprehensive performance evaluation of dash::summa using arbitrary numbers of units, test results, job setups and environment configs are in documented the wiki. |
I cannot really reproduce this issue and I would attribute this to an old OpenMPI Version (2.1.0). With versions 2.1.0 and 3.1.0 everything works as expected. |
After merging #451 the SUMMATest does not successfully run anymore. It seems that there is some memory corruption. See the attached stack trace.
I built with clang 5.0.1. Tracing and debug flags are enabled. Unfortunately, I had to cut a large portion of the trace file. Nevertheless, the interesting part is included.
Steps to reproduce: Start
SUMMATest.SeqTilePatternMatrix
with at least 2 units. The attached trace file is with 3 processes because I got a more detailed trace file then (for whatever reason) regarding memory corruption.log.txt
The text was updated successfully, but these errors were encountered: