Some of my training runs fail with the loss reaching a nan value. No idea what's going on really... I increased the learning rate by quite a bit but I don't think that should result in nan's? I'm training on denoised volumes.
02-01 17:47:36, INFO The datasets/001_HELA/isonet_refine folder already exists, outputs will write into this folder
02-01 17:47:36, INFO 8 CPU cores per GPU, total 64 CPUs
02-01 17:47:36, INFO Your noise_level is 0, we recommend to increase noise_level for denoising during isonet2 training
02-01 17:47:36, INFO Enabling mw_weight
02-01 17:47:36, INFO Total number of parameters: 23989825
02-01 17:47:36, INFO Port number: 56077
Preprocess tomograms: 100%|█████████████████████████████████████████| 53/53 [02:21<00:00, 2.67s/it]
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [localhost]:56077 (errno: 97 - Address family not supported by protocol).
Epoch 1: 100%|██████████████████████████████████████| 189/189 [02:37<00:00, 1.20 batch/s, Loss: 0.04167, Learning rate: 5.0000e-03]
Epoch [ 1/ 50] Loss: 0.17983, inside_loss: 0.04344, outside_loss: 0.05730
Epoch 2: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.10 batch/s, Loss: 0.04481, Learning rate: 4.8850e-03]
Epoch [ 2/ 50] Loss: 0.04499, inside_loss: 0.00010, outside_loss: 0.03616
Epoch 3: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03294, Learning rate: 4.5512e-03]
Epoch [ 3/ 50] Loss: 0.04524, inside_loss: 0.00013, outside_loss: 0.03642
Epoch 4: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01759, Learning rate: 4.0313e-03]
Epoch [ 4/ 50] Loss: 0.04454, inside_loss: 0.00018, outside_loss: 0.03545
Epoch 5: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.05451, Learning rate: 3.3762e-03]
Epoch [ 5/ 50] Loss: 0.04476, inside_loss: 0.00024, outside_loss: 0.03578
Epoch 6: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.07039, Learning rate: 2.6500e-03]
Epoch [ 6/ 50] Loss: 0.04420, inside_loss: 0.00030, outside_loss: 0.03514
Epoch 7: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.06034, Learning rate: 1.9238e-03]
Epoch [ 7/ 50] Loss: 0.04362, inside_loss: 0.00033, outside_loss: 0.03424
Epoch 8: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.04348, Learning rate: 1.2687e-03]
Epoch [ 8/ 50] Loss: 0.04290, inside_loss: 0.00035, outside_loss: 0.03354
Epoch 9: 100%|██████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.02170, Learning rate: 7.4881e-04]
Epoch [ 9/ 50] Loss: 0.04186, inside_loss: 0.00037, outside_loss: 0.03279
Epoch 10: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.10 batch/s, Loss: 0.01545, Learning rate: 4.1502e-04]
Epoch [ 10/ 50] Loss: 0.04210, inside_loss: 0.00037, outside_loss: 0.03295
Epoch 11: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03982, Learning rate: 3.0000e-04]
Epoch [ 11/ 50] Loss: 0.04138, inside_loss: 0.00039, outside_loss: 0.03204
Epoch 12: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01589, Learning rate: 4.1502e-04]
Epoch [ 12/ 50] Loss: 0.04114, inside_loss: 0.00040, outside_loss: 0.03195
Epoch 13: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01341, Learning rate: 7.4881e-04]
Epoch [ 13/ 50] Loss: 0.04053, inside_loss: 0.00043, outside_loss: 0.03148
Epoch 14: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.04002, Learning rate: 1.2687e-03]
Epoch [ 14/ 50] Loss: 0.04083, inside_loss: 0.00045, outside_loss: 0.03143
Epoch 15: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.05732, Learning rate: 1.9238e-03]
Epoch [ 15/ 50] Loss: 0.03989, inside_loss: 0.00048, outside_loss: 0.03060
Epoch 16: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03485, Learning rate: 2.6500e-03]
Epoch [ 16/ 50] Loss: 0.03888, inside_loss: 0.00049, outside_loss: 0.02967
Epoch 17: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.00839, Learning rate: 3.3762e-03]
Epoch [ 17/ 50] Loss: 0.03790, inside_loss: 0.00050, outside_loss: 0.02880
Epoch 18: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.05629, Learning rate: 4.0313e-03]
Epoch [ 18/ 50] Loss: 0.03690, inside_loss: 0.00049, outside_loss: 0.02796
Epoch 19: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03838, Learning rate: 4.5512e-03]
Epoch [ 19/ 50] Loss: 0.03503, inside_loss: 0.00053, outside_loss: 0.02618
Epoch 20: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.02492, Learning rate: 4.8850e-03]
Epoch [ 20/ 50] Loss: 0.03455, inside_loss: 0.00052, outside_loss: 0.02578
Epoch 21: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.10 batch/s, Loss: 0.01577, Learning rate: 5.0000e-03]
Epoch [ 21/ 50] Loss: 0.03390, inside_loss: 0.00053, outside_loss: 0.02508
Epoch 22: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.06428, Learning rate: 4.8850e-03]
Epoch [ 22/ 50] Loss: 0.04081, inside_loss: 0.00037, outside_loss: 0.03154
Epoch 23: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03822, Learning rate: 4.5512e-03]
Epoch [ 23/ 50] Loss: 0.03822, inside_loss: 0.00047, outside_loss: 0.02887
Epoch 24: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.02705, Learning rate: 4.0313e-03]
Epoch [ 24/ 50] Loss: 0.03542, inside_loss: 0.00051, outside_loss: 0.02650
Epoch 25: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.04784, Learning rate: 3.3762e-03]
Epoch [ 25/ 50] Loss: 0.03452, inside_loss: 0.00051, outside_loss: 0.02588
Epoch 26: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.04727, Learning rate: 2.6500e-03]
Epoch [ 26/ 50] Loss: 0.03336, inside_loss: 0.00053, outside_loss: 0.02468
Epoch 27: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01179, Learning rate: 1.9238e-03]
Epoch [ 27/ 50] Loss: 0.03312, inside_loss: 0.00051, outside_loss: 0.02448
Epoch 28: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03676, Learning rate: 1.2687e-03]
Epoch [ 28/ 50] Loss: 0.03212, inside_loss: 0.00052, outside_loss: 0.02380
Epoch 29: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01638, Learning rate: 7.4881e-04]
Epoch [ 29/ 50] Loss: 0.03219, inside_loss: 0.00050, outside_loss: 0.02397
Epoch 30: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.03195, Learning rate: 4.1502e-04]
Epoch [ 30/ 50] Loss: 0.03215, inside_loss: 0.00052, outside_loss: 0.02380
Epoch 31: 100%|█████████████████████████████████████| 189/189 [01:51<00:00, 1.69 batch/s, Loss: 0.03081, Learning rate: 3.0000e-04]
Epoch [ 31/ 50] Loss: 0.03218, inside_loss: 0.00051, outside_loss: 0.02383
Epoch 32: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01386, Learning rate: 4.1502e-04]
Epoch [ 32/ 50] Loss: 0.03242, inside_loss: 0.00052, outside_loss: 0.02402
Epoch 33: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.09 batch/s, Loss: 0.01729, Learning rate: 7.4881e-04]
Epoch [ 33/ 50] Loss: 0.03212, inside_loss: 0.00051, outside_loss: 0.02388
Epoch 34: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.08 batch/s, Loss: 0.04065, Learning rate: 1.2687e-03]
Epoch [ 34/ 50] Loss: 0.03231, inside_loss: 0.00050, outside_loss: 0.02400
Epoch 35: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.08 batch/s, Loss: 0.03545, Learning rate: 1.9238e-03]
Epoch [ 35/ 50] Loss: 0.03242, inside_loss: 0.00052, outside_loss: 0.02412
Epoch 36: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.08 batch/s, Loss: 0.00677, Learning rate: 2.6500e-03]
Epoch [ 36/ 50] Loss: 0.03266, inside_loss: 0.00053, outside_loss: 0.02409
Epoch 37: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.08 batch/s, Loss: 0.03407, Learning rate: 3.3762e-03]
Epoch [ 37/ 50] Loss: 0.03228, inside_loss: 0.00052, outside_loss: 0.02394
Epoch 38: 100%|█████████████████████████████████████| 189/189 [01:30<00:00, 2.08 batch/s, Loss: 0.03742, Learning rate: 4.0313e-03]
Epoch [ 38/ 50] Loss: 0.03222, inside_loss: 0.00052, outside_loss: 0.02384
Epoch 39: 100%|██████████████████████████████████████| 189/189 [01:31<00:00, 2.07 batch/s, Loss: nan, Learning rate: 4.5512e-03]
Epoch [ 39/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 40: 100%|██████████████████████████████████████| 189/189 [01:31<00:00, 2.06 batch/s, Loss: nan, Learning rate: 4.8850e-03]
Epoch [ 40/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 41: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 5.0000e-03]
Epoch [ 41/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 42: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 4.8850e-03]
Epoch [ 42/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 43: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 4.5512e-03]
Epoch [ 43/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 44: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 4.0313e-03]
Epoch [ 44/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 45: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 3.3762e-03]
Epoch [ 45/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 46: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 2.6500e-03]
Epoch [ 46/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 47: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 1.9238e-03]
Epoch [ 47/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 48: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 1.2687e-03]
Epoch [ 48/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 49: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 7.4881e-04]
Epoch [ 49/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Epoch 50: 100%|██████████████████████████████████████| 189/189 [01:32<00:00, 2.05 batch/s, Loss: nan, Learning rate: 4.1502e-04]
Epoch [ 50/ 50] Loss: nan, inside_loss: nan, outside_loss: nan
Hi,
Some of my training runs fail with the loss reaching a nan value. No idea what's going on really... I increased the learning rate by quite a bit but I don't think that should result in nan's? I'm training on denoised volumes.
isonet.py refine datasets/001_HELA/tomograms.star --method isonet2 --arch unet-medium --cube_size 128 --epochs 50 --input_column rlnTomoName --CTF_mode None --bfactor 0 --noise_level 0 --mw_weight 200 --learning_rate 0.005 --gpuID 0,1,2,3,4,5,6,7 --ncpus 64 --output_dir datasets/001_HELA/isonet_refine --with_preview False