Skip to content

Too many open files during fitting #30

@asarnow

Description

@asarnow

I was running fit-model on 18 tomograms using the config.yaml attached below and got this error for "too many open files" (ulimit is actually unlimited already). Have you seen this before or know what the cause might be?

ulimit -n was actually only 1024, I increased it to 4096 and that did allow ddw to resume upon re-running the command (instead of failing immediately with the same error). I do have ~1700 subtomograms in each half. However, after validation DDW exits with "Killed" and no other output.

Image
DeadlockDetectedException: DeadLock detected from rank: 3
 Traceback (most recent call last):
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in
_call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
621, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1058, in _run
    results = self._run_stage()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1137, in _run_stage
    self._run_train()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1160, in _run_train
    self.fit_loop.run()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 200, in
run
    self.on_advance_end()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 296,
in on_advance_end
    self.trainer._call_lightning_module_hook("on_train_epoch_end")
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line
1302, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 84, in
on_train_epoch_end
    self.update_subtomo_missing_wedges()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/ddw/utils/unet.py", line 125, in
update_subtomo_missing_wedges
    for batch in tqdm.tqdm(loader, desc="Updating subtomo missing wedges"):
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in
__next__
    data = self._next_data()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in
_next_data
    idx, data = self._get_data()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1295, in
_get_data
    success, data = self._try_get_data()
  File "/home/asarnow/local/opt/miniforge3/envs/ddw_env/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1160, in
_try_get_data
    raise RuntimeError(
RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using `ulimit -n`
in the shell or change the sharing strategy by calling `torch.multiprocessing.set_sharing_strategy('file_system')` at the beginning
of your code

Killed

config.yaml.txt

shared:
  project_dir: "."
  tomo0_files: 
    - "../23dec02b/aretomo/23dec02b_ts10.mrc_ODD_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts12.mrc_ODD_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts13.mrc_ODD_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts108.mrc_ODD_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts109.mrc_ODD_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts120.mrc_ODD_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts38.mrc_ODD_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts49.mrc_ODD_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts54.mrc_ODD_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_29.mrc_ODD_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_34.mrc_ODD_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_35.mrc_ODD_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_10.mrc_ODD_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_13.mrc_ODD_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_17.mrc_ODD_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_10.mrc_ODD_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_11.mrc_ODD_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_12.mrc_ODD_Vol.mrc"
  tomo1_files:
    - "../23dec02b/aretomo/23dec02b_ts10.mrc_EVN_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts12.mrc_EVN_Vol.mrc"
    - "../23dec02b/aretomo/23dec02b_ts13.mrc_EVN_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts108.mrc_EVN_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts109.mrc_EVN_Vol.mrc"
    - "../23dec05a/aretomo/23dec05a_ts120.mrc_EVN_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts38.mrc_EVN_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts49.mrc_EVN_Vol.mrc"
    - "../23dec25a/aretomo/23dec25a_ts54.mrc_EVN_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_29.mrc_EVN_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_34.mrc_EVN_Vol.mrc"
    - "../24feb09a/aretomo/da27-3_35.mrc_EVN_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_10.mrc_EVN_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_13.mrc_EVN_Vol.mrc"
    - "../24feb16a/aretomo/da8-1_17.mrc_EVN_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_10.mrc_EVN_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_11.mrc_EVN_Vol.mrc"
    - "../24feb19a/aretomo/DA16-2_12.mrc_EVN_Vol.mrc"
  subtomo_size: 96
  mw_angle: 92
  num_workers: 32
  gpu: [0, 1, 2, 3]
  seed: 42

prepare_data:
  val_fraction: 0.1
  extract_larger_subtomos_for_rotating: true
  overwrite: true

fit_model:
    unet_params_dict:
      chans: 64
      num_downsample_layers: 3
      drop_prob: 0.0
    adam_params_dict: 
      lr: 0.0004
    num_epochs: 1000
    batch_size: 5
    update_subtomo_missing_wedges_every_n_epochs: 10
    check_val_every_n_epochs: 10
    save_n_models_with_lowest_val_loss: 5
    save_n_models_with_lowest_fitting_loss: 5
    save_model_every_n_epochs: 50
    logger: "csv"


refine_tomogram:
    model_checkpoint_file: "logs/version_0/checkpoints/epoch/epoch=999.ckpt"
    subtomo_overlap: 32
    batch_size: 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions