Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch distributed RuntimeError: Socket Timeout #89

Open
ajayvohra2005 opened this issue Mar 19, 2024 · 1 comment
Open

Torch distributed RuntimeError: Socket Timeout #89

ajayvohra2005 opened this issue Mar 19, 2024 · 1 comment

Comments

@ajayvohra2005
Copy link
Contributor

PyTorch Elastic launch with torchrun sometimes gives transient error at startup similar to below:

Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    self._rendezvous(worker_group)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 541, in _rendezvous
    workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 610, in _assign_worker_ranks
    role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 647, in _share_and_gather
    role_infos_bytes = store_util.synchronize(
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/usr/local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
@ajayvohra2005
Copy link
Contributor Author

Root cause is unknown at this time, but Helm chart uninstall and re-install resolves the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant