Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUMMA Algorithm does not successfully run anymore #491

Closed
rkowalewski opened this issue Feb 15, 2018 · 8 comments
Closed

SUMMA Algorithm does not successfully run anymore #491

rkowalewski opened this issue Feb 15, 2018 · 8 comments

Comments

@rkowalewski
Copy link

After merging #451 the SUMMATest does not successfully run anymore. It seems that there is some memory corruption. See the attached stack trace.
I built with clang 5.0.1. Tracing and debug flags are enabled. Unfortunately, I had to cut a large portion of the trace file. Nevertheless, the interesting part is included.

Steps to reproduce: Start SUMMATest.SeqTilePatternMatrix with at least 2 units. The attached trace file is with 3 processes because I got a more detailed trace file then (for whatever reason) regarding memory corruption.

log.txt

@rkowalewski rkowalewski changed the title SUMMA does not run anymore SUMMA Algorithm does not successfully run anymore Feb 15, 2018
@rkowalewski rkowalewski added this to the dash-0.3.0 milestone Feb 15, 2018
@devreal
Copy link
Member

devreal commented Feb 16, 2018

I cannot reproduce this locally, both using Clang and GCC with Open MPI. I cannot make much of the stack trace. Can you post a stack trace acquired using GDB? I also threw Valgrind at it and it didn't report any oddities.

@rkowalewski
Copy link
Author

I have tested with clang, openmpi and the following build file. The most important thing is that shared memory windows are disabled. And make sure that the SUMMATest is actually executed, i.e., see the SKIP_TEST_IF_NO_SUMMA macro. @pascalj can reproduce it as well.

export CC=clang
export CXX=clang++
mkdir -p $BUILD_DIR
rm -Rf $BUILD_DIR/*
(cd $BUILD_DIR && cmake -DCMAKE_BUILD_TYPE=Debug \
                        -DBUILD_SHARED_LIBS=OFF \
                        -DBUILD_GENERIC=OFF \
                        -DENVIRONMENT_TYPE=default \
                        -DINSTALL_PREFIX=${INSTALL_PREFIX:=$HOME/opt/dash-0.3.0/} \
                        -DDART_IMPLEMENTATIONS=mpi \
                        -DENABLE_THREADSUPPORT=ON \
                        -DENABLE_DEV_COMPILER_WARNINGS=ON \
                        -DENABLE_EXT_COMPILER_WARNINGS=ON \
                        -DENABLE_LT_OPTIMIZATION=OFF \
                        -DENABLE_ASSERTIONS=ON \
                        \
                        -DENABLE_SHARED_WINDOWS=OFF \
                        -DENABLE_DYNAMIC_WINDOWS=ON \
                        -DENABLE_UNIFIED_MEMORY_MODEL=ON \
                        -DENABLE_DEFAULT_INDEX_TYPE_LONG=ON \
                        \
                        -DENABLE_LOGGING=${ENABLE_LOGGING:=ON} \
                        -DENABLE_TRACE_LOGGING=${ENABLE_TRACE_LOGGING:=ON} \
                        -DENABLE_DART_LOGGING=${ENABLE_DART_LOGGING:=ON}  \
                        \
                        -DENABLE_LIBNUMA=ON \
                        -DENABLE_LIKWID=OFF \
                        -DENABLE_HWLOC=ON \
                        -DENABLE_PAPI=ON \
                        -DENABLE_MKL=ON \
                        -DENABLE_BLAS=ON \
                        -DENABLE_LAPACK=ON \
                        -DENABLE_SCALAPACK=OFF \
                        -DENABLE_PLASMA=OFF \
                        -DENABLE_HDF5=OFF \
                        -DENABLE_MEMKIND=ON \
                        \
                        -DBUILD_EXAMPLES=OFF \
                        -DBUILD_TESTS=ON \
                        -DBUILD_DOCS=OFF \
                        \
                        -DIPM_PREFIX=${IPM_HOME} \
                        -DPAPI_PREFIX=${PAPI_HOME} \
                        \
                        -DGTEST_LIBRARY_PATH=${HOME}/opt/gtest.clang/lib \
                        -DGTEST_INCLUDE_PATH=${HOME}/opt/gtest.clang/include \
                        \
                        -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
                        ../ && \

@rkowalewski
Copy link
Author

GDB backtrace:


Thread 1 "dash-test-mpi" received signal SIGABRT, Aborted.
0x00007f64b1f75428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007f64b1f75428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f64b1f7702a in __GI_abort () at abort.c:89
#2  0x00007f64b1fb77ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f64b20d0ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007f64b1fc037a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7f64b20d0f50 "free(): invalid next size (fast)", action=3) at malloc.c:5006
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867
#5  0x00007f64b1fc453c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#6  0x00007f64a436abda in ?? () from /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so
#7  0x00007f64a4de3252 in ?? () from /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so
#8  0x00007f64a5423a0e in mca_btl_vader_poll_handle_frag () from /usr/lib/openmpi/lib/openmpi/mca_btl_vader.so
#9  0x00007f64a5423abe in ?? () from /usr/lib/openmpi/lib/openmpi/mca_btl_vader.so
#10 0x00007f64b0f611ea in opal_progress () from /usr/lib/libopen-pal.so.13
#11 0x00007f64b304ff65 in ompi_request_default_wait_all () from /usr/lib/libmpi.so.12
#12 0x00007f64a3d0edcd in ompi_coll_tuned_barrier_intra_two_procs () from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
#13 0x00007f64b3061c02 in PMPI_Barrier () from /usr/lib/libmpi.so.12
#14 0x0000000000e5127e in dart_barrier (teamid=0) at /home/kowalewski/workspaces/dash-project/dash-development/dart-impl/mpi/src/dart_communication.c:1741
#15 0x0000000000703eb1 in dash::Team::barrier (this=0x1247ec0 <dash::Team::_team_all>) at /home/kowalewski/workspaces/dash-project/dash-development/dash/include/dash/Team.h:464
#16 0x00000000008b5b5c in dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> >::barrier (this=0x7fffde36d1a8) at /home/kowalewski/workspaces/dash-project/dash-development/dash/include/dash/matrix/internal/Matrix-inl.h:352
#17 0x00000000008bddba in dash::summa<dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> > > (A=..., B=..., C=...)
    at /home/kowalewski/workspaces/dash-project/dash-development/dash/include/dash/algorithm/SUMMA.h:586
#18 0x00000000008885e5 in dash::mmult<dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::SeqTilePattern<2, (dash::MemArrange)1, long> > > (A=..., B=..., C=...)
    at /home/kowalewski/workspaces/dash-project/dash-development/dash/include/dash/algorithm/SUMMA.h:636
#19 0x000000000088693c in SUMMATest_SeqTilePatternMatrix_Test::TestBody (this=0x2817b80) at /home/kowalewski/workspaces/dash-project/dash-development/dash/test/algorithm/SUMMATest.cc:249
#20 0x0000000000de8f8e in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#21 0x0000000000dcc94b in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#22 0x0000000000da250c in testing::Test::Run() ()
#23 0x0000000000da37c0 in testing::TestInfo::Run() ()
#24 0x0000000000da44cc in testing::TestCase::Run() ()
#25 0x0000000000db12a1 in testing::internal::UnitTestImpl::RunAllTests() ()
#26 0x0000000000dec5be in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#27 0x0000000000dceebb in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#28 0x0000000000db0e8a in testing::UnitTest::Run() ()
#29 0x00000000006f8c01 in RUN_ALL_TESTS () at /home/kowalewski/opt/gtest.clang/include/gtest/gtest.h:2233
#30 0x00000000006f8201 in main (argc=1, argv=0x7fffde36ea88) at /home/kowalewski/workspaces/dash-project/dash-development/dash/test/main.cc:74
(gdb)

@devreal
Copy link
Member

devreal commented Feb 17, 2018

Still no luck. I'm pretty certain that the SUMMATest runs, though (takes approx. 15s).

What version of Open MPI are you using? (it looks like it's a system installation so it might be hopelessly out of date...) Single or multiple node? What does Valgrind report?

@devreal
Copy link
Member

devreal commented Mar 7, 2018

Just ran on our Linux cluster, seeing the following OOB exception when running with 2 units:

[=  0 LOG =]               TestBase.h : 254 | -==- DASH initialized with 2 units 
[=  0 LOG =]             SUMMATest.cc :  54 | Initialize matrix pattern ... 
[=  0 LOG =]             SUMMATest.cc :  64 | SizeSpec(40,40) TeamSpec(2,1) 
[=  0 LOG =]             SUMMATest.cc :  89 | Deduced pattern: dash::TilePattern<2, (dash::MemArrange)1, long> size(40,40) tilesize(20,40) teamsize(2,1) disttype(5,5) 
[=  0 LOG =]             SUMMATest.cc : 116 | Initialize matrix instances ... 
[=  0 LOG =]             SUMMATest.cc : 121 | Starting initialization of matrix values 
[=  0 LOG =]             SUMMATest.cc : 146 | Waiting for initialization of matrices ... 
[=  0 LOG =]             SUMMATest.cc : 150 | Calling dash::mmult ... 
[    0 ERROR ] [ 28666 ] Cartesian.h              :458  | dash::exception::OutOfRange                  | [ Unit 0 ] Range assertion 0 <= 1 <= 0 failed: Given coordinate for CartesianIndexSpace::at() exceeds extent /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/Cartesian.h:458 
terminate called after throwing an instance of 'dash::exception::OutOfRange'
  what():  [ Unit 0 ] Range assertion 0 <= 1 <= 0 failed: Given coordinate for CartesianIndexSpace::at() exceeds extent /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/Cartesian.h:458

Program received signal SIGABRT, Aborted.
0x000000330aa32495 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.x86_64 hwloc-1.5-3.el6_5.x86_64 libcxgb3-1.3.1-3.el6.x86_64 libibverbs-1.1.8-4.el6.x86_64 libipathverbs-1.3-3.el6_5.x86_64 libmlx4-1.0.6-7.el6.x86_64 libmthca-1.0.6-4.el6.x86_64 libnl-1.1.4-2.el6.x86_64 libpciaccess-0.13.4-1.el6.x86_64 librdmacm-1.0.21-0.el6.x86_64 libudev-147-2.73.el6_8.2.x86_64 libxml2-2.7.6-21.el6_8.1.x86_64 lustre-client-2.7.19.11-2.6.32_696.18.7.el6.x86_64_gb187bfd.x86_64 numactl-2.0.9-2.el6.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x000000330aa32495 in raise () from /lib64/libc.so.6
#1  0x000000330aa33c75 in abort () from /lib64/libc.so.6
#2  0x00002aaaac1ce425 in __gnu_cxx::__verbose_terminate_handler() ()
    at /lustre/nec/ws2/ws/hpcoftet-gcc-7.1-install/gcc-7.1.0/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00002aaaac1cc1d6 in __cxxabiv1::__terminate(void (*)()) ()
    at /lustre/nec/ws2/ws/hpcoftet-gcc-7.1-install/gcc-7.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00002aaaac1cc221 in std::terminate() ()
    at /lustre/nec/ws2/ws/hpcoftet-gcc-7.1-install/gcc-7.1.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5  0x00002aaaac1cc464 in __cxa_throw ()
    at /lustre/nec/ws2/ws/hpcoftet-gcc-7.1-install/gcc-7.1.0/libstdc++-v3/libsupc++/eh_throw.cc:93
#6  0x000000000081f2eb in long dash::CartesianIndexSpace<2, (dash::MemArrange)1, long>::at<(dash::MemArrange)1, long>(std::array<long, 2ul> const&) const ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/Cartesian.h:454
#7  0x000000000087b78c in dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >::block(std::array<long, 2ul> const&) ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/matrix/internal/Matrix-inl.h:159
#8  0x00000000008768ee in void dash::summa<dash::Matrix<double, 2, long, dash::T---Type <return> to continue, or q <return> to quit---
ilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> > >(dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >&, dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >&, dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >&) ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/algorithm/SUMMA.h:471
#9  0x0000000000873abf in dash::mmult<dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> >, dash::Matrix<double, 2, long, dash::TilePattern<2, (dash::MemArrange)1, long> > > ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/include/dash/algorithm/SUMMA.h:636
#10 0x0000000000871787 in SUMMATest_Deduction_Test::TestBody() ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/test/algorithm/SUMMATest.cc:151
#11 0x0000000000aae09b in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#12 0x0000000000a94cb8 in testing::Test::Run() ()
#13 0x0000000000a95558 in testing::TestInfo::Run() ()
#14 0x0000000000a95b7d in testing::TestCase::Run() ()
---Type <return> to continue, or q <return> to quit---
#15 0x0000000000a9c570 in testing::internal::UnitTestImpl::RunAllTests() ()
#16 0x0000000000aaec69 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#17 0x0000000000a9b1d0 in testing::UnitTest::Run() ()
#18 0x00000000007db1b7 in RUN_ALL_TESTS() ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/vendor/googletest/googletest/include/gtest/gtest.h:2235
#19 0x00000000007dae60 in main ()
    at /zhome/academic/HLRS/hlrs/hpcjschu/src/dash/dash-development/dash/test/main.cc:74

Fun fact: it works with 4 and 16 processes but not with 6 or 8. I suspect this is a bug in the data distribution or SUMMA (as it works in cases in which there is an equal number of processes in both dimensions). Not gonna touch that...

@fuchsto
Copy link
Member

fuchsto commented May 15, 2018

Does it work with a previous state of DASH? Esp. prior to changes on GlobRef and the like?
Edit: @fuerlinger That's what I was referring to

@fuchsto
Copy link
Member

fuchsto commented May 15, 2018

... ah, and: There is a comprehensive performance evaluation of dash::summa using arbitrary numbers of units, test results, job setups and environment configs are in documented the wiki.
So: This used to work.

@rkowalewski
Copy link
Author

I cannot really reproduce this issue and I would attribute this to an old OpenMPI Version (2.1.0). With versions 2.1.0 and 3.1.0 everything works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants