You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyTorch Elastic launch with torchrun sometimes gives transient error at startup similar to below:
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
result = agent.run()
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
self._initialize_workers(self._worker_group)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 541, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 610, in _assign_worker_ranks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 647, in _share_and_gather
role_infos_bytes = store_util.synchronize(
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
The text was updated successfully, but these errors were encountered:
PyTorch Elastic launch with
torchrun
sometimes gives transient error at startup similar to below:The text was updated successfully, but these errors were encountered: