Skip to content

dask-mpi not using timeout values set via config.yaml or dask environment variables #82

Closed
@lastephey

Description

@lastephey

What happened:

I was attempting to start a dask-mpi cluster on 20 (admittedly slow) 68-core Intel KNL nodes (1320 workers, each with a single thread). I observed the scheduler start and the workers attempt to start and connect, but eventually fail with messages like

distributed.comm.tcp - INFO - Connection from tcp://10.128.9.135:41230 closed before handshake completed

Knowing that KNL has a slow clock speed of 1.4 GHz, I attempted to increase the timeout by setting values in my ~/.config/dask/config.yaml file as recommended in the Dask docs. I also attempted to set environment variables via export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=240s.

I tried a very extreme case where I set

export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=1000s
export DASK_DISTRIBUTED__COMM_TIMEOUTS_TCP=1000s 

but I still saw timeout failures within a minute or two while dask-mpi attempted to start my cluster, so based on that it seems like dask-mpi is not respecting these values.

What you expected to happen:

I would like/expect dask-mpi to use the configuration options advertised in the dask docs like ~/.config/dask/config.yaml. It's not clear to me if it does or should. Whatever the outcome of this issue is, it would be helpful to add a note to the docs about whether dask-mpi does support these configuration options.

Minimal Complete Verifiable Example:

I launched my dask-mpi cluster on NERSC's Cori system with the following commands inside a custom conda enviornment:

salloc --nodes=20 --ntasks=1360 --cpus-per-task=1 --time=240  --constraint=knl --qos=interactive
export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=1000s
export DASK_DISTRIBUTED__COMM_TIMEOUTS_TCP=1000s 
srun -u dask-mpi --scheduler-file=scheduler.json --dashboard-address=0 --memory-limit="1.2 GiB" --nthreads=1 --no-nanny --local-directory=/tmp

After 1-2 minutes, I saw many timeout messages:

distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.121:44235'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.113:41129'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.126:37758'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.123:45557'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.116:58355'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.129:35394'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.126:58587'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.113:57881'

Anything else we need to know?:

Environment:

  • Dask version:
conda list | grep "dask"
dask                      2021.10.0          pyhd3eb1b0_0  
dask-core                 2021.10.0          pyhd3eb1b0_0  
dask-mpi                  2.21.0           py38h4ecba47_2    conda-forge
  • Python version:
python --version
Python 3.8.8
  • Operating System:
cat /etc/os-release
NAME="SLES"
VERSION="15"
VERSION_ID="15"
PRETTY_NAME="SUSE Linux Enterprise Server 15"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15"
  • Install method (conda, pip, source): conda

Thank you very much,
Laurie

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions