Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logfile is not showing any runs for the test set. The plots also don't show anything for test set and accuracy. #4

Open
saharudra opened this issue Oct 5, 2018 · 2 comments

Comments

@saharudra
Copy link

When I run the code, I get the following output:

(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ python                          
Python 3.6.6 (default, Jun 28 2018, 00:00:00)                                         
[GCC 4.8.4] on linux                                             
Type "help", "copyright", "credits" or "license" for more information.                 
>>> import torch                                                   
>>> exit()                                                                     
(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ pyton -m train --clevr-dir /data/DATASETS/CLEVR_v1.0/ --model 'original-fp' | tee logfile.log
No command 'pyton' found, did you mean:                           
 Command 'python' from package 'python-minimal' (main)                                                                                                                                                             
 Command 'pytone' from package 'pytone' (universe)                    
pyton: command not found                                           
(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ python -m train --clevr-dir /data/DATASETS/CLEVR_v1.0/ --model 'original-fp' | tee logfile.log                                                             
TRAIN:   0%|                                                                                                                                                                               | 0/350 [00:00<?, ?it/sL
oaded hyperparameters from configuration config.json, model: original-fp: {'state_description': False, 'g_layers': [256, 256, 256, 256], 'question_injection_position': 0, 'f_fc1': 256, 'f_fc2': 256, 'dropout': 0
.5, 'lstm_hidden': 128, 'lstm_word_emb': 32, 'rl_in_size': 52}                                                                                         
Building word dictionaries from all the words in the dataset...                                   
==> using cached dictionaries: /data/DATASETS/CLEVR_v1.0/questions/CLEVR_built_dictionaries.pkl
Word dictionary completed!                                                                                                                                                                                         
Initializing CLEVR dataset...
==> using cached questions: /data/DATASETS/CLEVR_v1.0/questions/CLEVR_train_questions.pkl
==> using cached questions: /data/DATASETS/CLEVR_v1.0/questions/CLEVR_val_questions.pkl
CLEVR dataset initialized!
Supposing original DeepMind model
Training (350 epochs) is starting...
Dataset reinitialized with batch size 640
Current learning rate: 1e-05
                                                                                                                                                                                                                  T
raceback (most recent call last):███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 1093/1094 [11:21:28<00:37, 37.41s/it, loss=1.92]
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/Rudra/RelationNetworks-CLEVR/train.py", line 418, in <module>
    main(args)
  File "/data/Rudra/RelationNetworks-CLEVR/train.py", line 356, in main
    train(clevr_train_loader, model, optimizer, epoch, args)
  File "/data/Rudra/RelationNetworks-CLEVR/train.py", line 40, in train
    output = model(img, qst)
  File "/data/Rudra/virtualenvs/rn_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/Rudra/RelationNetworks-CLEVR/model.py", line 200, in forward
    x = torch.cat([x, self.coord_tensor], 1)    # (B x 24+2 x 8*8)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 469 and 640 in dimension 0 at /pytorch/torch/lib/TH/generic/THTensorMath.c:2897
Train Epoch: 1 [0/700160 (0%)] Train loss: 39.945804595947266
Train Epoch: 1 [6400/700160 (1%)] Train loss: 36.57775611877442
Train Epoch: 1 [12800/700160 (2%)] Train loss: 29.848896408081053
Train Epoch: 1 [19200/700160 (3%)] Train loss: 24.984291648864748
Train Epoch: 1 [25600/700160 (4%)] Train loss: 20.945134353637695
.
.
.
Train Epoch: 1 [684800/700160 (98%)] Train loss: 1.8508247494697572
Train Epoch: 1 [691200/700160 (99%)] Train loss: 1.8768051743507386
Train Epoch: 1 [697600/700160 (100%)] Train loss: 1.8581566572189332

(rn_env) exx@ubuntu:/data/Rudra/RelationNetworks-CLEVR$ 

I have also attached my logfile with this. When I run the plot function, I get empty plots for everything apart from training loss. Please let me know where the issue might be. Thanks.

logfile.log

@mesnico
Copy link
Owner

mesnico commented Oct 9, 2018

Hi @saharudra, this issue is probably due to a batch handling issue on the Multi GPU setup.
You should be able to run the code by simply removing the condition (the entire line):

if self.coord_tensor is None or torch.cuda.device_count() == 1:

This is not the most efficient solution; however, if that is the problem, I will fix it permanently as soon as possible using a better approach.
Thanks!

@saharudra
Copy link
Author

saharudra commented Oct 9, 2018

Hi @mesnico, I will give this a try and let you know the outcome here. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants