Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some bugs #35

Open
xiaozhenxu opened this issue Jan 10, 2022 · 3 comments
Open

some bugs #35

xiaozhenxu opened this issue Jan 10, 2022 · 3 comments

Comments

@xiaozhenxu
Copy link

Hi,Professors.There some bugs when I run the code.Can you explain what they mean and how to solve the problems? Thank you very much!
Using gpu: 0
Files already downloaded and verified
Files already downloaded and verified
Order name:./logs/cifar100_nfg50_ncls10_nproto20_mnemonics/seed_1993_cifar100_order_run_0.pkl
Generating orders
pickle into ./logs/cifar100_nfg50_ncls10_nproto20_mnemonics/seed_1993_cifar100_order_run_0.pkl
[68, 56, 78, 8, 23, 84, 90, 65, 74, 76, 40, 89, 3, 92, 55, 9, 26, 80, 43, 38, 58, 70, 77, 1, 85, 19, 17, 50, 28, 53, 13, 81, 45, 82, 6, 59, 83, 16, 15, 44, 91, 41, 72, 60, 79, 52, 20, 10, 31, 54, 37, 95, 14, 71, 96, 98, 97, 2, 64, 66, 42, 22, 35, 86, 24, 34, 87, 21, 99, 0, 88, 27, 18, 94, 11, 12, 47, 25, 30, 46, 62, 69, 36, 61, 7, 63, 75, 5, 32, 4, 51, 48, 73, 93, 39, 67, 29, 49, 57, 33]
Out_features: 50
Batch of classes number 5 arrives
Max and min of train labels: 0, 49
Max and min of valid labels: 0, 49
Checkpoint name: ./logs/cifar100_nfg50_ncls10_nproto20_mnemonics/run_0_iteration_4_model.pth
Incremental train

Epoch: 0, LR: [0.1]
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [13,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [21,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [22,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [23,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [26,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [28,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [29,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1603729138878/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "main.py", line 78, in
trainer.train()
File "/data1/22160073/project/incremental learning/class-incremental-learning-main/mnemonics-training/1_train/trainer/mnemonics.py", line 237, in train
tg_model = incremental_train_and_eval(self.args.epochs, tg_model, ref_model, free_model, ref_free_model, tg_optimizer, tg_lr_scheduler, trainloader, testloader, iteration, start_iter, cur_lamda, self.args.dist, self.args.K, self.args.lw_mr)
File "/data1/22160073/project/incremental learning/class-incremental-learning-main/mnemonics-training/1_train/trainer/incremental.py", line 44, in incremental_train_and_eval
loss.backward()
File "/data1/22160073/anaconda3/envs/xxz/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/data1/22160073/anaconda3/envs/xxz/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered

@yaoyao-liu
Copy link
Owner

Thank you for your interest in our work.

Since this is an error caused by CUDA, I don't know how to fix it.
May I know if you are using the same version of python and pytorch as we are?

@xiaozhenxu
Copy link
Author

Thanks for your reply.I will try to use the same version of python and pytorch as yours.

@EnnengYang
Copy link

EnnengYang commented Mar 1, 2023

I had the same problem with the new version of Pytorch. There are two solutions:

  • First, use the author's lower version of the pytorch environment (i.e. torch=0.4.0,python=3.6.8): same version of python and pytorch;
  • Second, modify the code to accommodate the higher version of the pytorch environment (torch=1.10.0+cu113 in my test environment).

Modify lines 62 to 65 in baseline.py:

`

  import torch
  torch_version = torch.__version__

  if '0.4.0' in str(torch_version):
        X_train_total = np.array(self.trainset.train_data)
        Y_train_total = np.array(self.trainset.train_labels)
        X_valid_total = np.array(self.testset.test_data)
        Y_valid_total = np.array(self.testset.test_labels)
  else:
        X_train_total = np.array(self.trainset.data)
        Y_train_total = np.array(self.trainset.targets)
        X_valid_total = np.array(self.testset.data)
        Y_valid_total = np.array(self.testset.targets)

`

Modify lines 188 to 189 in baseline.py:

`

 if '0.4.0' in str(torch_version):
            self.trainset.train_data = X_train.astype('uint8')
            self.trainset.train_labels = map_Y_train
else:
            self.trainset.data = X_train.astype('uint8')
            self.trainset.targets = map_Y_train

`


It is important to store the mapped label in self.trainset.targets, otherwise it will store the ID of the original Class, which would cause the targets scope to go out of bounds when calculating nn.CrossEntropyLoss in incremental.py, i.e., the above error.


Finally, you need to modify the code that corresponds to the set of tests and calculations in baseline.py. (Similar modification to mnemonics.py)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants