You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This works OK for say, 2 nodes with 4 GPUs each but I find that when I request more workers, say, 80 GPUs, some of them just never come up, or they take an inordinately long amount of time to come up. I had this problem with manual start in a CPU-only context with Dask Distributed, and the fix for that kind of on-demand scale was dask-mpi. Combined with containers, this is a very good solution we've found for reliable Dask startup on HPC. I'm hoping for the same kind of solution with GPUs.
I've tried out dask-mpi support for dask_cuda.CUDAWorker but I can't seem to find a working invocation. I tried just adapting the above, or using something similar to what I've done before:
The first thing I noticed is that I have to specify --nthreads 1 because otherwise you get an error which might be easy to fix:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
return self.main(*args, **kwargs)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_mpi/cli.py", line 147, in main
asyncio.get_event_loop().run_until_complete(run_worker())
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_mpi/cli.py", line 144, in run_worker
async with WorkerType(**opts) as worker:
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_cuda/cuda_worker.py", line 95, in __init__
if nthreads < 1:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
Once you have --nthreads 1 in place though, you hit this:
Traceback (most recent call last):
File "/pscratch/sd/r/rthomas/dask/env/bin/dask-mpi", line 8, in <module>
sys.exit(go())
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_mpi/cli.py", line 152, in go
main()
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_mpi/cli.py", line 147, in main
asyncio.get_event_loop().run_until_complete(run_worker())
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_mpi/cli.py", line 144, in run_worker
async with WorkerType(**opts) as worker:
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_cuda/cuda_worker.py", line 216, in __init__
self.nannies = [
File "/pscratch/sd/r/rthomas/dask/env/lib/python3.8/site-packages/dask_cuda/cuda_worker.py", line 217, in <listcomp>
Nanny(
TypeError: __init__() got multiple values for argument 'scheduler_ip'
as if a keyword arg is trying to overwrite a positional one. I started wondering how people were trying this out themselves and thought maybe they weren't going with a scheduler file for the use cases tested up till now, but I had trouble properly specifying options and arguments for dask_cuda.CUDAWorker via --worker-options (they seem to not matter) or specifying SCHEDULER_ADDRESS on the command line.
I've been able to manually launch a Dask GPU cluster using Slurm successfully. The setup for that looks like:
This works OK for say, 2 nodes with 4 GPUs each but I find that when I request more workers, say, 80 GPUs, some of them just never come up, or they take an inordinately long amount of time to come up. I had this problem with manual start in a CPU-only context with Dask Distributed, and the fix for that kind of on-demand scale was
dask-mpi
. Combined with containers, this is a very good solution we've found for reliable Dask startup on HPC. I'm hoping for the same kind of solution with GPUs.I've tried out
dask-mpi
support fordask_cuda.CUDAWorker
but I can't seem to find a working invocation. I tried just adapting the above, or using something similar to what I've done before:The first thing I noticed is that I have to specify
--nthreads 1
because otherwise you get an error which might be easy to fix:Once you have
--nthreads 1
in place though, you hit this:as if a keyword arg is trying to overwrite a positional one. I started wondering how people were trying this out themselves and thought maybe they weren't going with a scheduler file for the use cases tested up till now, but I had trouble properly specifying options and arguments for
dask_cuda.CUDAWorker
via--worker-options
(they seem to not matter) or specifyingSCHEDULER_ADDRESS
on the command line.My conda env, based off RAPIDS stable
For completeness, I did this to set up my Dask cluster env and Jupyter kernel env:
pip install --no-cache-dir --force git+https://github.com/dask/dask-mpi
(to keep from overwriting my mpi4py)The text was updated successfully, but these errors were encountered: