Skip to content

pytorch crashed during tutorial  #14

@davidhoover

Description

@davidhoover

I recently installed spisonet and attempted to run the tutorial. Pytorch crashed immediately during the training with these errors:

07-01 10:44:32, INFO     voxel_size 1.309999942779541
07-01 10:44:33, INFO     spIsoNet correction until resolution 3.5A!
                     Information beyond 3.5A remains unchanged
07-01 10:44:42, INFO     Start preparing subvolumes!
07-01 10:44:48, INFO     Done preparing subvolumes!
07-01 10:44:48, INFO     Start training!
07-01 10:44:52, INFO     Port number: 42933
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
  0%|                                                                                                                              | 0/250 [00:00<?, ?batch/s]/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/conv.py:605: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)
  return F.conv3d(
  0%|                                                                                                                              | 0/250 [00:05<?, ?batch/s]
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 160, in ddp_train
    loss.backward()
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

What version of torch is required? We have 2.3.1+cu118. This was run on a single P100 GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions