Skip to content

When GPU cannot initialize (OOM) per-stream-info need cleanup #636

@abouteiller

Description

@abouteiller
  If I simulate being unable to allocate memory on the device, both for data and for streams, I get the following stack:
#5  0x00007ffff7e57995 in parsec_list_destruct (list=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_list.c:45
#6  0x00007ffff7e5bdaa in parsec_obj_run_destructors (object=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#7  0x00007ffff7e5c102 in parsec_info_destructor (obj=0x7ffff7fbd260 <parsec_per_stream_infos>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/info.c:34
#8  0x00007ffff7eb0ceb in parsec_obj_run_destructors (object=0x7ffff7fbd260 <parsec_per_stream_infos>)
    at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#9  0x00007ffff7eb35bd in parsec_mca_device_fini () at /home/bosilca/unstable/parsec/parsec/parsec/mca/device/device.c:572
#10 0x00007ffff7e764d0 in parsec_fini (pcontext=0x7fffffff49a0) at /home/bosilca/unstable/parsec/parsec/parsec/parsec.c:1235
#11 0x000000000040374f in main (argc=1, argv=0x7fffffff4b38)
    at /home/bosilca/unstable/parsec/parsec/tests/dsl/dtd/dtd_test_allreduce.c:237

The issue seems to be during the release of parsec_per_stream_infos because there are still infos registered inside. The CUDA code seems to perform actually really well, the devices failing to allocate memory are removed, and the execution unfolds without them.

Originally posted by @bosilca in #630 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions