You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using debugging build HugeCTR, more information can be provided by dump core
(gdb)
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139851934344768) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=139851934344768) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=139851934344768, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f3280642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f32806287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007f3262ead42a in __gnu_cxx::__verbose_terminate_handler() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f3262eab20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007f3262eaa1e9 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8 0x00007f3262eaa959 in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x00007f32804f6884 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#10 0x00007f32804f6f41 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#11 0x00007f3262eab4cb in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f30c448aa91 in HugeCTR::GPUResource::set_stream (this=0x7f2fa009e020, name="default", priority=0) at /hugectr/HugeCTR/include/gpu_resource.hpp:80
#13 0x00007f30c4df1751 in HugeCTR::StreamContext::~StreamContext (this=0x7f31d0de1ce0, __in_chrg=<optimized out>) at /hugectr/HugeCTR/include/gpu_resource.hpp:116
#14 0x00007f30c4df05f1 in HugeCTR::StreamContextScheduleable::run (this=0x7ef9f3abe4f0, gpu=std::shared_ptr<HugeCTR::GPUResource> (use count 28, weak count 0) = {...}, use_graph=true) at /hugectr/HugeCTR/src/pipeline.cpp:102
#15 0x00007f30c4df0fdc in HugeCTR::Pipeline::run_graph (this=0x55ae7a79b6e8) at /hugectr/HugeCTR/src/pipeline.cpp:152
#16 0x00007f30c4f208ef in _ZN7HugeCTR5Model23train_pipeline_with_ebcEv._omp_fn.1(void) () at /hugectr/HugeCTR/src/pybind/model_pipeline.cpp:468
#17 0x00007f3263cf7c0e in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:129
#18 0x00007f3280694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#19 0x00007f3280726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
With CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 and cuda-gdb, more information can be provided
The text was updated successfully, but these errors were encountered:
Describe the bug
core dump when benchmarking with samples/dlrm on L20/H20
To Reproduce
Steps to reproduce the behavior:
cd /workspace/samples/dlrm pip install -r requirements.txt python train.py
Expected behavior
Running train.py without any exception.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
and cuda-gdb, more information can be providedThe text was updated successfully, but these errors were encountered: