Thank you for your release source code. It helps me a lot.
During training process, I met a problem related to memory.

The process consume a lot of memory, over 150GB RAM. I think the problem in the validate function. Because you append all the input/output data to the inputs_all, gts_all, predictions_all
def validate(net, val_set, val_loader, criterion, optimizer, epoch, new_ep):
net.eval()
val_loss = AverageMeter()
inputs_all, gts_all, predictions_all = [], [], []
with torch.no_grad():
for vi, (inputs, gts) in enumerate(val_loader):
inputs, gts = inputs.cuda(), gts.cuda()
N = inputs.size(0) * inputs.size(2) * inputs.size(3)
outputs = net(inputs)
val_loss.update(criterion(outputs, gts).item(), N)
# val_loss.update(criterion(gts, outputs).item(), N)
if random.random() > train_args.save_rate:
inputs_all.append(None)
else:
inputs_all.append(inputs.data.squeeze(0).cpu())
gts_all.append(gts.data.squeeze(0).cpu().numpy())
predictions = outputs.data.max(1)[1].squeeze(1).squeeze(0).cpu().numpy()
predictions_all.append(predictions)
update_ckpt(net, optimizer, epoch, new_ep, val_loss,
inputs_all, gts_all, predictions_all)
net.train()
return val_loss, inputs_all, gts_all, predictions_all
Hi @samleoqh
Thank you for your release source code. It helps me a lot.
During training process, I met a problem related to memory.

The process consume a lot of memory, over 150GB RAM. I think the problem in the
validatefunction. Because you append all the input/output data to theinputs_all, gts_all, predictions_all