dask-mpi not using timeout values set via config.yaml or dask environment variables

**What happened**:

I was attempting to start a dask-mpi cluster on 20 (admittedly slow) 68-core Intel KNL nodes (1320 workers, each with a single thread). I observed the scheduler start and the workers attempt to start and connect, but eventually fail with messages like

```
distributed.comm.tcp - INFO - Connection from tcp://10.128.9.135:41230 closed before handshake completed
```

Knowing that KNL has a slow clock speed of 1.4 GHz, I attempted to increase the timeout by setting values in my `~/.config/dask/config.yaml` file as [recommended in the Dask docs](https://docs.dask.org/en/stable/configuration.html). I also attempted to set environment variables via `export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=240s`. 

I tried a very extreme case where I set 
```
export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=1000s
export DASK_DISTRIBUTED__COMM_TIMEOUTS_TCP=1000s 
```

but I still saw timeout failures within a minute or two while dask-mpi attempted to start my cluster, so based on that it seems like dask-mpi is not respecting these values. 

**What you expected to happen**:

I would like/expect dask-mpi to use the configuration options advertised in the dask docs like `~/.config/dask/config.yaml`. It's not clear to me if it does or should. Whatever the outcome of this issue is, it would be helpful to add a note to the docs about whether dask-mpi does support these configuration options. 

**Minimal Complete Verifiable Example**:

I launched my dask-mpi cluster on NERSC's Cori system with the following commands inside a custom conda enviornment:

```
salloc --nodes=20 --ntasks=1360 --cpus-per-task=1 --time=240  --constraint=knl --qos=interactive
export OMP_NUM_THREADS=1
export PYTHONUNBUFFERED=1
export DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=1000s
export DASK_DISTRIBUTED__COMM_TIMEOUTS_TCP=1000s 
srun -u dask-mpi --scheduler-file=scheduler.json --dashboard-address=0 --memory-limit="1.2 GiB" --nthreads=1 --no-nanny --local-directory=/tmp
```

After 1-2 minutes, I saw many timeout messages:

```
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.121:44235'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.113:41129'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.126:37758'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.123:45557'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.116:58355'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.129:35394'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.126:58587'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.9.113:57881'
```


**Anything else we need to know?**:

**Environment**:

- Dask version:

```
conda list | grep "dask"
dask                      2021.10.0          pyhd3eb1b0_0  
dask-core                 2021.10.0          pyhd3eb1b0_0  
dask-mpi                  2.21.0           py38h4ecba47_2    conda-forge
```

- Python version: 

```
python --version
Python 3.8.8
```

- Operating System: 

```
cat /etc/os-release
NAME="SLES"
VERSION="15"
VERSION_ID="15"
PRETTY_NAME="SUSE Linux Enterprise Server 15"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15"
```

- Install method (conda, pip, source): `conda`

Thank you very much,
Laurie

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

dask-mpi not using timeout values set via config.yaml or dask environment variables #82

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

dask-mpi not using timeout values set via config.yaml or dask environment variables #82

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions