Skip to content

Pickle FileNotFoundError in evaluation phase of distributed training #584

Answered by C1rN09
austinmw asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks for reporting the bug! I checked the code and find it's a bug in our dist.collect_results_cpu.

The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:

  • Make sure .dist_test folder that is shared across instances exists just in the directory you are running torchrun. If you have shared storage in other directories (e.g. /mnt/your_shared), you may create a soft link via ln -s. If you don't have one, you may try mounting through nfs or something similar.
  • Or, you may add collect_device='gpu' to your metrics' config to enable GPU collecting. This is currently experimental and may not be that stable.

Replies: 3 comments 2 replies

Comment options

You must be logged in to vote
1 reply
@austinmw
Comment options

Answer selected by ZwwWayne
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@chanlilong
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
bug Something isn't working
4 participants
Converted from issue

This discussion was converted from issue #557 on October 08, 2022 03:38.