I was running fit-model on 18 tomograms using the config.yaml attached below and got this error for "too many open files" (ulimit is actually unlimited already). Have you seen this before or know what the cause might be?
DeadlockDetectedException: DeadLock detected from rank: 3
Traceback (most recent call last):
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in
_call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1058, in _run
results = self._run_stage()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1137, in _run_stage
self._run_train()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1160, in _run_train
self.fit_loop.run()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in
run
self.on_advance_end()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 296,
in on_advance_end
self.trainer._call_lightning_module_hook("on_train_epoch_end")
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 84, in
on_train_epoch_end
self.update_subtomo_missing_wedges()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 125, in
update_subtomo_missing_wedges
for batch in tqdm.tqdm(loader, desc="Updating subtomo missing wedges"):
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in
__next__
data = self._next_data()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in
_next_data
idx, data = self._get_data()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in
_get_data
success, data = self._try_get_data()
File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1160, in
_try_get_data
raise RuntimeError(
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n`
in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning
of your code
Killed
shared:
project_dir: "."
tomo0_files:
- "../23dec02b/aretomo/23dec02b_ts10.mrc_ODD_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts12.mrc_ODD_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts13.mrc_ODD_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts108.mrc_ODD_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts109.mrc_ODD_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts120.mrc_ODD_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts38.mrc_ODD_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts49.mrc_ODD_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts54.mrc_ODD_Vol.mrc"
- "../24feb09a/aretomo/da27-3_29.mrc_ODD_Vol.mrc"
- "../24feb09a/aretomo/da27-3_34.mrc_ODD_Vol.mrc"
- "../24feb09a/aretomo/da27-3_35.mrc_ODD_Vol.mrc"
- "../24feb16a/aretomo/da8-1_10.mrc_ODD_Vol.mrc"
- "../24feb16a/aretomo/da8-1_13.mrc_ODD_Vol.mrc"
- "../24feb16a/aretomo/da8-1_17.mrc_ODD_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_10.mrc_ODD_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_11.mrc_ODD_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_12.mrc_ODD_Vol.mrc"
tomo1_files:
- "../23dec02b/aretomo/23dec02b_ts10.mrc_EVN_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts12.mrc_EVN_Vol.mrc"
- "../23dec02b/aretomo/23dec02b_ts13.mrc_EVN_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts108.mrc_EVN_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts109.mrc_EVN_Vol.mrc"
- "../23dec05a/aretomo/23dec05a_ts120.mrc_EVN_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts38.mrc_EVN_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts49.mrc_EVN_Vol.mrc"
- "../23dec25a/aretomo/23dec25a_ts54.mrc_EVN_Vol.mrc"
- "../24feb09a/aretomo/da27-3_29.mrc_EVN_Vol.mrc"
- "../24feb09a/aretomo/da27-3_34.mrc_EVN_Vol.mrc"
- "../24feb09a/aretomo/da27-3_35.mrc_EVN_Vol.mrc"
- "../24feb16a/aretomo/da8-1_10.mrc_EVN_Vol.mrc"
- "../24feb16a/aretomo/da8-1_13.mrc_EVN_Vol.mrc"
- "../24feb16a/aretomo/da8-1_17.mrc_EVN_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_10.mrc_EVN_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_11.mrc_EVN_Vol.mrc"
- "../24feb19a/aretomo/DA16-2_12.mrc_EVN_Vol.mrc"
subtomo_size: 96
mw_angle: 92
num_workers: 32
gpu: [0, 1, 2, 3]
seed: 42
prepare_data:
val_fraction: 0.1
extract_larger_subtomos_for_rotating: true
overwrite: true
fit_model:
unet_params_dict:
chans: 64
num_downsample_layers: 3
drop_prob: 0.0
adam_params_dict:
lr: 0.0004
num_epochs: 1000
batch_size: 5
update_subtomo_missing_wedges_every_n_epochs: 10
check_val_every_n_epochs: 10
save_n_models_with_lowest_val_loss: 5
save_n_models_with_lowest_fitting_loss: 5
save_model_every_n_epochs: 50
logger: "csv"
refine_tomogram:
model_checkpoint_file: "logs/version_0/checkpoints/epoch/epoch=999.ckpt"
subtomo_overlap: 32
batch_size: 10
I was running fit-model on 18 tomograms using the config.yaml attached below and got this error for "too many open files" (
). Have you seen this before or know what the cause might be?ulimitis actually unlimited alreadyulimit -nwas actually only 1024, I increased it to 4096 and that did allow ddw to resume upon re-running the command (instead of failing immediately with the same error). I do have ~1700 subtomograms in each half. However, after validation DDW exits with "Killed" and no other output.config.yaml.txt