Description
What happened:
I was attempting to start a dask-mpi cluster on 20 (admittedly slow) 68-core Intel KNL nodes (1320 workers, each with a single thread). I observed the scheduler start and the workers attempt to start and connect, but eventually fail with messages like
distributed.comm.tcp - INFO - Connection from tcp://10.128.9.135:41230 closed before handshake completed
Knowing that KNL has a slow clock speed of 1.4 GHz, I attempted to increase the timeout by setting values in my ~/.config/dask/config.yaml
file as recommended in the Dask docs. I also attempted to set environment variables via export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=240s
.
I tried a very extreme case where I set
export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=1000s
export DASK_DISTRIBUTED__COMM_TIMEOUTS_TCP=1000s
but I still saw timeout failures within a minute or two while dask-mpi attempted to start my cluster, so based on that it seems like dask-mpi is not respecting these values.
What you expected to happen:
I would like/expect dask-mpi to use the configuration options advertised in the dask docs like ~/.config/dask/config.yaml
. It's not clear to me if it does or should. Whatever the outcome of this issue is, it would be helpful to add a note to the docs about whether dask-mpi does support these configuration options.
Minimal Complete Verifiable Example:
I launched my dask-mpi cluster on NERSC's Cori system with the following commands inside a custom conda enviornment:
salloc --nodes=20 --ntasks=1360 --cpus-per-task=1 --time=240 --constraint=knl --qos=interactive
export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=1000s
export DASK_DISTRIBUTED__COMM_TIMEOUTS_TCP=1000s
srun -u dask-mpi --scheduler-file=scheduler.json --dashboard-address=0 --memory-limit="1.2 GiB" --nthreads=1 --no-nanny --local-directory=/tmp
After 1-2 minutes, I saw many timeout messages:
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.121:44235'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.113:41129'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.126:37758'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.123:45557'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.116:58355'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.129:35394'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.126:58587'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.113:57881'
Anything else we need to know?:
Environment:
- Dask version:
conda list | grep "dask"
dask 2021.10.0 pyhd3eb1b0_0
dask-core 2021.10.0 pyhd3eb1b0_0
dask-mpi 2.21.0 py38h4ecba47_2 conda-forge
- Python version:
python --version
Python 3.8.8
- Operating System:
cat /etc/os-release
NAME="SLES"
VERSION="15"
VERSION_ID="15"
PRETTY_NAME="SUSE Linux Enterprise Server 15"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15"
- Install method (conda, pip, source):
conda
Thank you very much,
Laurie