Implement cached inference #409

borzunov · 2022-01-10T19:32:49Z

Note: This PR is based on the branch from #408, and the diff shows changes from both PRs by default. See this diff to keep only the changes related to cached inference.

Description

This PR optimizes inference, reducing its time complexity from O(n^3) to O(n^2), where n is the sequence length.

When someone runs generate_images(..., use_cache=True):

The text is processed as before, but the outputs of all Attention and PreShiftToken layers are cached.
The image is generated token-by-token. For each token, we only pass the last token through the network (as if seq_len = 1). The layers involving token interaction (Attention and PreShiftToken) look up the necessary info (e.g., keys and values of previous tokens) in cache.

This implementation should be more efficient than ai-forever/ru-dalle#12 since it doesn't concat the outputs of FFN layers and doesn't create excess tensors with seq_len dimension.

Supporting sparse attention

It's cumbersome to implement caching in sparse attention layers efficiently, so they are not cached by default. If you'd like to use the cached inference for sparse attention layers, you can load their weights into a full attention layer and use an appropriate mask to simulate a sparse layer.

The sparse implementation (like SparseAxialCausalAttention(...)) is faster for training, but the masked implementation (like Attention(..., static_mask=...)) is much faster for inference (thanks to caching).

To enable the masked implementation, create the model as DALLE(..., optimize_for_inference=True).

Experiments

In the "Training Transformers Together" demo, we train a model created as DALLE(..., optimize_for_inference=False) (see details on model config in Add options for weight sharing #408).
In our Colab for inference, we load its latest checkpoint to a model created as DALLE(..., optimize_for_inference=True) (actually, there's an older param name, but it doesn't matter).
Next, running generate_images(..., use_cache=True) gives > 9x speed-up comparing to the inference time without code from this PR. The results seem to be equivalent.

afiaka87 · 2022-01-11T03:50:05Z

This PR optimizes inference, reducing its time complexity from O(n^3) to O(n^2), where n is the sequence length.

Wow!

lucidrains · 2022-01-11T12:46:34Z

@borzunov This is tremendous Alexander! You made it fit with so many of the other features in the repository too. Truly a wizard!

borzunov · 2022-01-11T12:59:21Z

Thank you :)

EmaadKhwaja · 2022-04-13T14:28:45Z

This is awesome but I'm having issues loading old models. It's caused by the extra layers from the cached classes.
i.e. transformer.layers.layers.0.0.fn.fn.fn.to_qkv.weight became transformer.layers.layers.0.0.fn.fn.fn.fn.fn.to_qkv.weight

borzunov · 2022-04-13T17:51:49Z

@EmaadKhwaja That's true, sorry for the inconvenience. You can use the code like this to load older models:

state_dict = torch.load('model_state.pt')
state_dict = OrderedDict([(key.replace('to_qkv', 'fn.fn.to_qkv').replace('to_out', 'fn.fn.to_out'), value)
                          for key, value in state_dict.items()])
print(model.load_state_dict(state_dict))

Other mismatches (if any) can be fixed in the same way.

borzunov added 30 commits November 9, 2021 17:42

Implement weight sharing in transformer

44775fc

Pass sharing args from CLI

e10096e

Improve checking reused attn type

d7c034e

Fix view errors in sparse attention

f2a53e9

Implement share_input_output_emb option

14eb932

Revert excess changes

c9f462a

Add FFN caching

1cd8e20

Cache to_qkv and to_out in sparse attn, add debug prints

4d431ac

Cache full Attention

2603776

Remove debug outputs

112ea05

Cache pre-logits MLP

6ba4cb6

Further optimize attention caching

c333ea7

Fix mask in attention

1fd45ca

Don't cache MLPs since we can just pass only last item

2b77018

Revert excess changes in attentions

8e8dea8

Rename FixCacheKey -> CachedAs

df89951

Save the current offset in cache

059fe1b

Use static masks to simulate axial attn

b76b78e

Add option to disable caching

4c833a2

Make the cached version work

1ff47c6

Speed up PreShiftToken

94fda36

Remove excess changes

adfce34

Add NonCached wrapper

59cfc49

Rename use_static_masks -> optimize_for_inference

4f496a4

Improve names and comments

732226d

Implement weight sharing in transformer

c5f009a

Pass sharing args from CLI

c57eae6

Improve checking reused attn type

538c42a

Fix view errors in sparse attention

ec5b477

Implement share_input_output_emb option

543a3e4

borzunov added 3 commits January 10, 2022 17:21

Revert excess changes

6de6cb0

Add CLI option for share_input_output_emb

f968697

Merge branch 'weight-sharing-v3' into inference-cache-v3

1f517d7

borzunov mentioned this pull request Jan 10, 2022

faster inference #407

Closed

Support cached inference with super-conditioning

7f2f50c

Repository owner deleted a comment from kriskrisliu Jan 11, 2022

lucidrains merged commit 2094474 into lucidrains:main Jan 11, 2022

borzunov mentioned this pull request Jan 11, 2022

Add options for weight sharing #408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cached inference #409

Implement cached inference #409

borzunov commented Jan 10, 2022 •

edited

Loading

afiaka87 commented Jan 11, 2022 •

edited

Loading

lucidrains commented Jan 11, 2022 •

edited

Loading

borzunov commented Jan 11, 2022

EmaadKhwaja commented Apr 13, 2022

borzunov commented Apr 13, 2022

Implement cached inference #409

Implement cached inference #409

Conversation

borzunov commented Jan 10, 2022 • edited Loading

Description

Supporting sparse attention

Experiments

afiaka87 commented Jan 11, 2022 • edited Loading

lucidrains commented Jan 11, 2022 • edited Loading

borzunov commented Jan 11, 2022

EmaadKhwaja commented Apr 13, 2022

borzunov commented Apr 13, 2022

borzunov commented Jan 10, 2022 •

edited

Loading

afiaka87 commented Jan 11, 2022 •

edited

Loading

lucidrains commented Jan 11, 2022 •

edited

Loading