Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: number sections must be larger than 0 #358

Open
SeolhwaLee opened this issue Feb 15, 2022 · 15 comments
Open

ValueError: number sections must be larger than 0 #358

SeolhwaLee opened this issue Feb 15, 2022 · 15 comments
Labels
bug Something isn't working

Comments

@SeolhwaLee
Copy link

Hi,

I tried to apply OPACUS in my model. But I've got this error when I ran the code like below.

Traceback (most recent call last):
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 772, in array_split
    Nsections = len(indices_or_sections) + 1
TypeError: object of type 'int' has no len()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main_dp.py", line 462, in <module>
    main(args)
  File "main_dp.py", line 166, in main
    for step, batch in enumerate(tqdm(memory_safe_data_loader)):
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in _next_data
    index = self._next_index()  # may raise StopIteration
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 512, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/opacus/utils/batch_memory_manager.py", line 48, in __iter__
    batch_idxs, math.ceil(len(batch_idxs) / self.max_batch_size)
  File "<__array_function__ internals>", line 6, in array_split
  File "/home/seol/miniconda3/envs/pytorch_p37/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 778, in array_split
    raise ValueError('number sections must be larger than 0.')
ValueError: number sections must be larger than 0.

One of my hack approaches was worked well without error when I added the below code in opacus/utils/batch_memory_manager.py for avoiding this error.

def __iter__(self):
     for batch_idxs in self.samplers:
         if not bool(batch_idxs):
             continue

But, the generated result was quite weird so I'm suspicious this hack was affecting model performance maybe?

Could you give any feedback that this hack is okay??

Thank you in advance!

P.S. I used this tutorial https://opacus.ai/tutorials/building_text_classifier (But different dataset and model)

@karthikprasad
Copy link
Contributor

Hello! Thanks for raising the issue. I'll take a look at get back to you on this.

@karthikprasad
Copy link
Contributor

Hi @SeolhwaLee. Are you using a different sampler that is resulting in an empty batch? If yes, is that intentional?

@SeolhwaLee
Copy link
Author

@karthikprasad I used RandomSampler like this.

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(
      dataset=train_dataset,
      sampler=train_sampler,
      collate_fn=train_dataset.collate_fn,
      batch_size=args.batch_size)

The empty batch is not my intent. Maybe it needs somehow raise empty batch process logic on this code?

@karthikprasad
Copy link
Contributor

karthikprasad commented Feb 21, 2022

Hi @SeolhwaLee ,
The tutorial you linked does not use RandomSampler. When you call make_private() and pass your dataloader, Opacus internally switches it with DPDataLoader that uses Uniform Batch Sampler to sample according to SGM.
Also note that the DPDataloader automatically handles empty_batches.

Since you are explicitly using RandomSampler, I suspect you are trying out something different? Could you share your code notebook and context?

@ashkan-software
Copy link
Contributor

Hello @SeolhwaLee,

Has your issue been resolved?

@SeolhwaLee
Copy link
Author

Hi @ashkan-software

I don't have time to investigate this recently. But will do it ASAP.

I will comment here if have any progress.

@ffuuugor ffuuugor added the bug Something isn't working label Mar 11, 2022
@lucacorbucci
Copy link

Hi, I'm having the same issue. Have you found a way to solve it @SeolhwaLee?

@karthikprasad
Copy link
Contributor

@lucacorbucci , would you mind sharing your fully reproducible code on a colab? That would help us tremendously to debug the issue.

@lucacorbucci
Copy link

Hi @karthikprasad I will try, but I don't know if it is possible to reproduce it on colab. I am trying to run a federated learning algorithm. I have several clients training a model and each of them uses Opacus.
The repo with the code I've written is not public yet but maybe I can share it with you when I publish it with a script to reproduce the bug.

@anirban-nath
Copy link

Hi @lucacorbucci I am using Opacus in the same way as you are. My question is: do you feed the client models to the make_private function after each round of federation or only once at the start of the whole process?

I am trying to use the make_private_with_epsilon function and am worried about the privacy accounting process. I have to specify the number of epochs there and I am wondering if I should give epochs = federation rounds * local epochs or epochs = local epochs.

@lucacorbucci
Copy link

I am trying to use the make_private_with_epsilon function and am worried about the privacy accounting process. I have to specify the number of epochs there and I am wondering if I should give epochs = federation rounds * local epochs or epochs = local epochs.

I also had this doubt in the past. I don't know if there is a correct answer to this question. If we're in a cross-device scenario where there are millions of nodes, I'd go with the epochs = local epochs approach because you don't know in advance how many federation rounds you will perform. Instead, If we're in a cross-silo device, maybe it could also be possible to know in advance the amount of federation rounds. Then in this case it would be possible to use epochs = federation rounds * local epochs.

Have you considered the use of the function make_private passing the noise as a parameter? In this case you don't need to specify the number of epochs in advance

@anirban-nath
Copy link

Have you considered the use of the function make_private passing the noise as a parameter? In this case, you don't need to specify the number of epochs in advance

I have. I am actually working on this as part of a research project, so in my case, it makes more sense for me to fix the value of epsilon and check model performance at those values. I have actually not yet experimented with either case because of this issue that there is a BatchNorm function in my code whose per_sample gradients are not being populated so I am getting a "Per sample gradient is not initialized. Not updated in backward pass?" error.

I know for a fact that the BatchNorm is being used because its gradient is being populated but not its per_sample gradient for some reason. Any idea about this?

@lucacorbucci
Copy link

BatchNorm function in my code whose per_sample gradients are not being populated so I am getting a "Per sample gradient is not initialized. Not updated in backward pass?" error.

BatchNorm is not DP friendly, have you used the ModuleValidator? https://opacus.ai/tutorials/guide_to_module_validator

@anirban-nath
Copy link

BatchNorm is not DP friendly, have you used the ModuleValidator? https://opacus.ai/tutorials/guide_to_module_validator

Absolutely, I used ModuleValidator before anything else. Even after that, this one particular LayerNorm is causing me issues. What I don't understand is under what circumstances can it happen that a layer's grad is populated but not its per_sample grad. I even opened an issue about it a few minutes ago.

@lucacorbucci
Copy link

Hi, I've seen the issue you opened but I don't have a solution.
Maybe a possible workaround could be to use Opacus with functorch. I'm not 100% sure that it will work but here, they said: "With functorch, Opacus can now handle almost all input models, removing previous limitation where we could only handle certain standard layers.".
I think that it is worth a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants