-
Describe the bug Reproduction I used torchrun to kick off Here's my config (
Environment The environment script seems to be broken. I'm running in a Docker container on Amazon SageMaker with the latest version of MMEngine and MMDetection 3.0. Error traceback
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Thanks for reporting the bug! I checked the code and find it's a bug in our The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:
|
Beta Was this translation helpful? Give feedback.
-
Yep, that matches my scenario (multiple AWS p3.16xlarge instances without shared storage), thanks! |
Beta Was this translation helpful? Give feedback.
-
When I set |
Beta Was this translation helpful? Give feedback.
Thanks for reporting the bug! I checked the code and find it's a bug in our
dist.collect_results_cpu
.The bug occurs when you are evaluating a model in multiple instances without shared storage. We'll fix it ASAP. Some workarounds now:
.dist_test
folder that is shared across instances exists just in the directory you are runningtorchrun
. If you have shared storage in other directories (e.g. /mnt/your_shared), you may create a soft link vialn -s
. If you don't have one, you may try mounting throughnfs
or something similar.collect_device='gpu'
to your metrics' config to enable GPU collecting. This is currently experimental and may not be that stable.