You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi thanks for this contribution
as a small exercise I am training SD2 on the pokemon dataset
I precomputed the latents and it starts training on one gpu
However at the evaluation time I get the following error
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/composer/trainer/trainer.py", line 2814, in _eval_loop
self.state.outputs = self._original_model.eval_forward(self.state.batch)
File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 255, in eval_forward
gen_images = self.generate(tokenized_prompts=prompts,
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/fsx_vfx/users/csegalin/code/diffusion/diffusion/models/stable_diffusion.py", line 464, in generate
pred = self.unet(latent_model_input,
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/diffusers/models/unet_2d_condition.py", line 934, in forward
sample = self.conv_in(sample)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/fsx_vfx/users/csegalin/code/diffusion/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (162 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size`
when I try to train on a multi gpu machine (resetting fspd to true) and uncommenting last two lines of the config and batch size accordingly I get this error
ValueError: The world_size(2) > 1 but dataloader does not use DistributedSampler. This will cause all ranks to train on the same data, removing any benefit from multi-GPU training. To resolve this, create a Dataloader with DistributedSampler. For example, DataLoader(..., sampler=composer.utils.dist.get_sampler(...)).Alternatively, the process group can be instantiated with composer.utils.dist.instantiate_dist(...) and DistributedSampler can directly be created with DataLoader(..., sampler=DistributedSampler(...)). For more information, see https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler.
I don't see a distributesampler for the laion or coco functions
Hi thanks for this contribution
as a small exercise I am training SD2 on the pokemon dataset
I precomputed the latents and it starts training on one gpu
However at the evaluation time I get the following error
this is my confguration
``
The text was updated successfully, but these errors were encountered: