-
-
Notifications
You must be signed in to change notification settings - Fork 29
dask-mpi not using timeout values set via config.yaml or dask environment variables #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@lastephey: Thanks for the contribution! I do not have an account (any longer) on NERSC systems, nor any hours to spend on the NERSC machines to debug). There may be a way of reproducing this error on a smaller system, but I'm not sure what that might look like. |
Hi @kmpaul, If you'd like I am happy to get you a NERSC account if you'd like to do some testing, so please let me know. Can you clarify whether Thank you, |
Thanks, @lastephey! I might take you up on that, but let's see if we can drill down a bit before taking that measure. I'm not aware of anything in |
@lastephey: Before going too far down the road, can you test out the latest version of |
Latest version should be |
Sure, I'll see if I can reproduce with |
Hi @kmpaul, sorry for the delay. I re-tried this with
and ran
I still saw failures during the cluster startup with I tried to verify that the environment variable settings I used to attempt to increase the timeout were being set by
so at least Dask is ingesting them, although I don't know if these are actually the correct settings and/or I can't tell if they're being used. I think 1300 workers may be hard/impossible on Cori since our network interconnect (Aries) just doesn't handle TCP very well. It would still be nice to be able to increase this timeout though if possible to try to accommodate bigger pools of workers. |
Are you able to connect a client to the running cluster? If so you could run >>> from dask.distributed import Client
>>> client = Client(scheduler_file=scheduler.json)
>>> print(client.run(lambda: dask.config.config)) |
Hi everyone, I'm sorry for my very slow response here. Using @jacobtomlinson's suggestion, I confirm that the settings are making their way to the
So that's good. I did try again to start the original cluster with
But still saw quite a few
about 1-2 minutes after I launched the cluster, so it still doesn't seem like it's obeying my 1000s timeout request. Thank you for your help, |
Ah I see the reason the timeout is being ignored. The env vars should be export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=1000s
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP=1000s Note the double underscores between each word. There is a utility on the docs page that helps generate env vars and you can check it will create the right config nesting. It can be helpful because the Dask config system can be a little finicky to work out. |
Ah, good catch @jacobtomlinson! And thanks for the pointer to the utility in the dask docs. I'll test and report back. |
Well, this sounded promising but unfortunately I still don't think it's working. I tried to start up my 1300 worker test cluster but still saw timeouts after 1-2 minutes, so I went back to a 2 node cluster for testing. I started my 2 node test cluster up via:
making sure to use the correct double underscore variables. I used @jacobtomlinson's trick to see if the config had propagated to the workers:
I can see that these appear at the top of the dict, which seems to me Dask is aware that I have tried to set these values. They are still at their default values here, though, so I am confused. Thanks again for your help, |
I'm not sure. It seems the config isn't being propagated to the workers correctly. I don't know enough about SLURM but would you expect all local environment variables to be propagated to the tasks in the cluster? I know in other Dask projects we manually bundle up the local config and pass it along to the worker processes. But I don't think This seems like a good thing to fix, but I also wonder if there is a way we can work around this for you in the mean time. Do you know what the filesystem on the remote nodes looks like? Does it have your home directory? If so you could drop a YAML file into Alternatively you could use the Python API instead of the CLI to launch your cluster and set your config within the script. |
Thanks @jacobtomlinson. As far as I know slurm copies all environment variable settings to our compute nodes. Here is a tiny demo:
so I think those settings should be available to the workers. I did try to test your other suggestions, too. We do have a shared filesystem that is mounted on all compute nodes, so I put the following into my
I used this setup to start a 2 node cluster, but when I checked, I didn't see the settings present in the worker config. They were still the default values:
Finally I tried the API, although I haven't used it before and I'm not sure I'm doing it right. I started this script via
but I don't see any output in
|
Could you try using Your API example looks good, you should be able to set Dask config in there with |
Ah yes, with your suggestions @jacobtomlinson the API does seems to work and the settings appear to be used, so thank you very much.
And in the cluster output I see
so that's good! Then I went to start up my 1300 worker cluster using the API, but unfortunately I still started to see
I am baffled about why it seems to still be using a 60s timeout limit. |
There is an ongoing discussion in dask/distributed#3691 about scaling to large numbers of workers on HPC clusters. Folks there may have run into some of these same issues. Have you seen that issue? |
Thanks @jacobtomlinson. I had seen it awhile ago, but I went back and read it again. I tried the Client.wait_for_workers suggestion there and I think that's helpful. When I tried to launch our big 1300 worker node test with
and
I took ORNL's advice and started up my client while the workers were still coming alive (although I'm not sure if the timing actually matters). I did still see a lot of the same timeout messages while the workers were starting but in the end all the workers arrived and I was able to complete a toy calculation with all 1300 workers. Can you give any insight into the messages like Thank you very much. |
There should be some retries built in there. Given the log is only at |
I see, that is very helpful to know. When I saw those messages I assumed the workers were failing but it sounds like after a retry or two it was fine. Since I have a workaround for now, do you think I should close this issue? Or do you want to look into adjusting the config settings further? Either way I am very grateful for all your help with this @kmpaul and @jacobtomlinson. |
I think thanks go mostly to @jacobtomlinson. 😄 Excellent finds with the issue on dask/distributed! |
(Oh, and I'm happy to close this issue for now, if you want. We can always reopen it.) |
What happened:
I was attempting to start a dask-mpi cluster on 20 (admittedly slow) 68-core Intel KNL nodes (1320 workers, each with a single thread). I observed the scheduler start and the workers attempt to start and connect, but eventually fail with messages like
Knowing that KNL has a slow clock speed of 1.4 GHz, I attempted to increase the timeout by setting values in my
~/.config/dask/config.yaml
file as recommended in the Dask docs. I also attempted to set environment variables viaexport DASK_DISTRIBUTED__COMM_TIMEOUTS_CONNECT=240s
.I tried a very extreme case where I set
but I still saw timeout failures within a minute or two while dask-mpi attempted to start my cluster, so based on that it seems like dask-mpi is not respecting these values.
What you expected to happen:
I would like/expect dask-mpi to use the configuration options advertised in the dask docs like
~/.config/dask/config.yaml
. It's not clear to me if it does or should. Whatever the outcome of this issue is, it would be helpful to add a note to the docs about whether dask-mpi does support these configuration options.Minimal Complete Verifiable Example:
I launched my dask-mpi cluster on NERSC's Cori system with the following commands inside a custom conda enviornment:
After 1-2 minutes, I saw many timeout messages:
Anything else we need to know?:
Environment:
conda
Thank you very much,
Laurie
The text was updated successfully, but these errors were encountered: