Allow masking padding tokens in cross attention layers #94
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a new parameter to the
stable_diffusion_xl
andstable_diffusion_2
classes calledmask_pad_tokens
that allows for masking out padding tokens in cross attention layers.The
generate()
had to get a bit more complicated due to the setting where we pass in pre-tokenized inputs.. we'd now want to allow passing the padding mask with it (as well as for pre-tokenized negative prompts). Let me know if you think of a better way of handling this :/One small note: this change might be slightly redundant with the
zero_out_negative_prompt
arg (in the generate() function) andzero_dropped_captions
(in dataloader) that I added not too long ago for zero-ing out empty negative prompts and dropped captions. I thinkmask_pad_tokens
ought to serve a similar purpose of masking out the empty text embeddings in the cross attention layers, howeverzero_out_negative_prompt/zero_dropped_captions
additionally zero-out the pooled text embedding (used in SDXL as microconditioning). I think we still want to keep that functionality. The cleanest way would prob be to merge these all into one flag?