Change get_block_index
func to non-blocking version
#805
+3
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
During my attempts to enhance efficiency in multigpu setup with Ampere+Turing cards, I found out that disabling paged attention increase PP speed ~3x times (non-tp mode, Mistral Large, 8.0bpw). For the past two days I profiled
prefill()
code trying to understand what actually slows down paged attention.In the end problem appears to be in the
get_block_index
func which is probably meant to be turned into unblocking as well in this commit 843cec5 but for some reason it wasn't. After making it unblocking, profiler's picture for paged and non-paged become identical.To provide some numbers, this is part of the log which shows how much time is being spent when processing modules when looping here during prefill stage:
Because of the modules created sequentially in layers order, hidden states tensor is also being moved sequentially from one gpu to another. Here we can see that we reached modules which process layer that lies on gpu 4, and we should move tensor from gpu 3 to gpu 4 (module 150 is likely ExLlamaV2MLP, module 151 is ExLlamaV2Attention). Then, during attention processing we spend almost 1.5 seconds inside this line (sorry for my weird startN namings, I had to think up many different names to measure time between each call)
block_table = attn_params.get_block_index(self.device_idx)
And it's the only culprit why
forward
call for attention module takes so long.After changing
get_block_index
to non-blocking version, the log looks like this:So, now almost no wait here. The new place where we wait things are the first attention module:
So, in the end, instead of waiting for each gpu per chunch we are waiting only once in the beginning of the chunk.
And the final numbers by tabbyAPI (I had to powerlimit some cards, so I got less speed improvements, but still)
Before the fix:
After the fix:
In my previous experiments when I compared paged and non paged, I got something like 280 T/s -> 740 T/s
Actually I know very little of attention mechanism itself and how it should be implemented, is it even okay for any non-blocking tensors move, etc, etc... So I expect for someone to review this change and maybe test it in your multigpu setup (you don't have to turn paged attention off, it's a fix for paged attention, so just add this change and test in non-tensor parallel mode)