Skip to content

NaN loss during training #22

@mgflast

Description

@mgflast

Hi,

Some of my training runs fail with the loss reaching a nan value. No idea what's going on really... I increased the learning rate by quite a bit but I don't think that should result in nan's? I'm training on denoised volumes.

isonet.py refine datasets/001_HELA/tomograms.star --method isonet2 --arch unet-medium --cube_size 128 --epochs 50 --input_column rlnTomoName --CTF_mode None --bfactor 0 --noise_level 0 --mw_weight 200 --learning_rate 0.005 --gpuID 0,1,2,3,4,5,6,7 --ncpus 64 --output_dir datasets/001_HELA/isonet_refine --with_preview False

02-01 17:47:36, INFO     The datasets/001_HELA/isonet_refine folder already exists, outputs will write into this folder
02-01 17:47:36, INFO     8 CPU cores per GPU, total 64 CPUs
02-01 17:47:36, INFO     Your noise_level is 0, we recommend to increase noise_level for denoising during isonet2 training
02-01 17:47:36, INFO     Enabling mw_weight
02-01 17:47:36, INFO     Total number of parameters: 23989825
02-01 17:47:36, INFO     Port number: 56077

Preprocess tomograms: 100%|█████████████████████████████████████████| 53/53 [02:21<00:00,  2.67s/it]
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
Epoch 1: 100%|██████████████████████████████████████| 189/189 [02:37<00:00,  1.20 batch/s, Loss: 0.04167, Learning rate: 5.0000e-03]
Epoch [  1/ 50] Loss: 0.17983, inside_loss: 0.04344, outside_loss: 0.05730
Epoch 2: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.10 batch/s, Loss: 0.04481, Learning rate: 4.8850e-03]
Epoch [  2/ 50] Loss: 0.04499, inside_loss: 0.00010, outside_loss: 0.03616
Epoch 3: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03294, Learning rate: 4.5512e-03]
Epoch [  3/ 50] Loss: 0.04524, inside_loss: 0.00013, outside_loss: 0.03642
Epoch 4: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01759, Learning rate: 4.0313e-03]
Epoch [  4/ 50] Loss: 0.04454, inside_loss: 0.00018, outside_loss: 0.03545
Epoch 5: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.05451, Learning rate: 3.3762e-03]
Epoch [  5/ 50] Loss: 0.04476, inside_loss: 0.00024, outside_loss: 0.03578
Epoch 6: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.07039, Learning rate: 2.6500e-03]
Epoch [  6/ 50] Loss: 0.04420, inside_loss: 0.00030, outside_loss: 0.03514
Epoch 7: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.06034, Learning rate: 1.9238e-03]
Epoch [  7/ 50] Loss: 0.04362, inside_loss: 0.00033, outside_loss: 0.03424
Epoch 8: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.04348, Learning rate: 1.2687e-03]
Epoch [  8/ 50] Loss: 0.04290, inside_loss: 0.00035, outside_loss: 0.03354
Epoch 9: 100%|██████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.02170, Learning rate: 7.4881e-04]
Epoch [  9/ 50] Loss: 0.04186, inside_loss: 0.00037, outside_loss: 0.03279
Epoch 10: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.10 batch/s, Loss: 0.01545, Learning rate: 4.1502e-04]
Epoch [ 10/ 50] Loss: 0.04210, inside_loss: 0.00037, outside_loss: 0.03295
Epoch 11: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03982, Learning rate: 3.0000e-04]
Epoch [ 11/ 50] Loss: 0.04138, inside_loss: 0.00039, outside_loss: 0.03204
Epoch 12: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01589, Learning rate: 4.1502e-04]
Epoch [ 12/ 50] Loss: 0.04114, inside_loss: 0.00040, outside_loss: 0.03195
Epoch 13: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01341, Learning rate: 7.4881e-04]
Epoch [ 13/ 50] Loss: 0.04053, inside_loss: 0.00043, outside_loss: 0.03148
Epoch 14: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.04002, Learning rate: 1.2687e-03]
Epoch [ 14/ 50] Loss: 0.04083, inside_loss: 0.00045, outside_loss: 0.03143
Epoch 15: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.05732, Learning rate: 1.9238e-03]
Epoch [ 15/ 50] Loss: 0.03989, inside_loss: 0.00048, outside_loss: 0.03060
Epoch 16: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03485, Learning rate: 2.6500e-03]
Epoch [ 16/ 50] Loss: 0.03888, inside_loss: 0.00049, outside_loss: 0.02967
Epoch 17: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.00839, Learning rate: 3.3762e-03]
Epoch [ 17/ 50] Loss: 0.03790, inside_loss: 0.00050, outside_loss: 0.02880
Epoch 18: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.05629, Learning rate: 4.0313e-03]
Epoch [ 18/ 50] Loss: 0.03690, inside_loss: 0.00049, outside_loss: 0.02796
Epoch 19: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03838, Learning rate: 4.5512e-03]
Epoch [ 19/ 50] Loss: 0.03503, inside_loss: 0.00053, outside_loss: 0.02618
Epoch 20: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.02492, Learning rate: 4.8850e-03]
Epoch [ 20/ 50] Loss: 0.03455, inside_loss: 0.00052, outside_loss: 0.02578
Epoch 21: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.10 batch/s, Loss: 0.01577, Learning rate: 5.0000e-03]
Epoch [ 21/ 50] Loss: 0.03390, inside_loss: 0.00053, outside_loss: 0.02508
Epoch 22: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.06428, Learning rate: 4.8850e-03]
Epoch [ 22/ 50] Loss: 0.04081, inside_loss: 0.00037, outside_loss: 0.03154
Epoch 23: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03822, Learning rate: 4.5512e-03]
Epoch [ 23/ 50] Loss: 0.03822, inside_loss: 0.00047, outside_loss: 0.02887
Epoch 24: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.02705, Learning rate: 4.0313e-03]
Epoch [ 24/ 50] Loss: 0.03542, inside_loss: 0.00051, outside_loss: 0.02650
Epoch 25: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.04784, Learning rate: 3.3762e-03]
Epoch [ 25/ 50] Loss: 0.03452, inside_loss: 0.00051, outside_loss: 0.02588
Epoch 26: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.04727, Learning rate: 2.6500e-03]
Epoch [ 26/ 50] Loss: 0.03336, inside_loss: 0.00053, outside_loss: 0.02468
Epoch 27: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01179, Learning rate: 1.9238e-03]
Epoch [ 27/ 50] Loss: 0.03312, inside_loss: 0.00051, outside_loss: 0.02448
Epoch 28: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03676, Learning rate: 1.2687e-03]
Epoch [ 28/ 50] Loss: 0.03212, inside_loss: 0.00052, outside_loss: 0.02380
Epoch 29: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01638, Learning rate: 7.4881e-04]
Epoch [ 29/ 50] Loss: 0.03219, inside_loss: 0.00050, outside_loss: 0.02397
Epoch 30: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.03195, Learning rate: 4.1502e-04]
Epoch [ 30/ 50] Loss: 0.03215, inside_loss: 0.00052, outside_loss: 0.02380
Epoch 31: 100%|█████████████████████████████████████| 189/189 [01:51<00:00,  1.69 batch/s, Loss: 0.03081, Learning rate: 3.0000e-04]
Epoch [ 31/ 50] Loss: 0.03218, inside_loss: 0.00051, outside_loss: 0.02383
Epoch 32: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01386, Learning rate: 4.1502e-04]
Epoch [ 32/ 50] Loss: 0.03242, inside_loss: 0.00052, outside_loss: 0.02402
Epoch 33: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.09 batch/s, Loss: 0.01729, Learning rate: 7.4881e-04]
Epoch [ 33/ 50] Loss: 0.03212, inside_loss: 0.00051, outside_loss: 0.02388
Epoch 34: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.08 batch/s, Loss: 0.04065, Learning rate: 1.2687e-03]
Epoch [ 34/ 50] Loss: 0.03231, inside_loss: 0.00050, outside_loss: 0.02400
Epoch 35: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.08 batch/s, Loss: 0.03545, Learning rate: 1.9238e-03]
Epoch [ 35/ 50] Loss: 0.03242, inside_loss: 0.00052, outside_loss: 0.02412
Epoch 36: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.08 batch/s, Loss: 0.00677, Learning rate: 2.6500e-03]
Epoch [ 36/ 50] Loss: 0.03266, inside_loss: 0.00053, outside_loss: 0.02409
Epoch 37: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.08 batch/s, Loss: 0.03407, Learning rate: 3.3762e-03]
Epoch [ 37/ 50] Loss: 0.03228, inside_loss: 0.00052, outside_loss: 0.02394
Epoch 38: 100%|█████████████████████████████████████| 189/189 [01:30<00:00,  2.08 batch/s, Loss: 0.03742, Learning rate: 4.0313e-03]
Epoch [ 38/ 50] Loss: 0.03222, inside_loss: 0.00052, outside_loss: 0.02384
Epoch 39: 100%|██████████████████████████████████████| 189/189 [01:31<00:00,  2.07 batch/s, Loss:    nan, Learning rate: 4.5512e-03]
Epoch [ 39/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 40: 100%|██████████████████████████████████████| 189/189 [01:31<00:00,  2.06 batch/s, Loss:    nan, Learning rate: 4.8850e-03]
Epoch [ 40/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 41: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 5.0000e-03]
Epoch [ 41/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 42: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 4.8850e-03]
Epoch [ 42/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 43: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 4.5512e-03]
Epoch [ 43/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 44: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 4.0313e-03]
Epoch [ 44/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 45: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 3.3762e-03]
Epoch [ 45/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 46: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 2.6500e-03]
Epoch [ 46/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 47: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 1.9238e-03]
Epoch [ 47/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 48: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 1.2687e-03]
Epoch [ 48/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 49: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 7.4881e-04]
Epoch [ 49/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan
Epoch 50: 100%|██████████████████████████████████████| 189/189 [01:32<00:00,  2.05 batch/s, Loss:    nan, Learning rate: 4.1502e-04]
Epoch [ 50/ 50] Loss:    nan, inside_loss:    nan, outside_loss:    nan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions