Optimisation of per-token CPU activities for GPU inference

When using a GPU backend, for each token evaluation there exists not only computation on the GPU but also significant CPU computation which can potentially be optimized. 

Here are some timing measurements of the critical path for each token for llama2 Q4_K_M 7B and 13B models on A100 and H100 GPUs.

Firstly, here are absolute times:  
<img src="https://github.com/ggerganov/llama.cpp/assets/10851179/fb8ee0a5-09e1-4a05-a042-f60964694f8f" width="70%">


and here are the same data presented as a percentage breakdown in each case:
<img src="https://github.com/ggerganov/llama.cpp/assets/10851179/8ea0edfe-95de-43ac-8088-b996e3e0870e" width="70%">

`CUDA Graph Execution` is the time spent executing the compute graph on the GPU, which is responsible for around 85-90% of the time taken in evaluating each token..
 
The remaining 10-15% of the time is taken by CPU activities, the most dominant of which are discussed below.

**GGML Graph Preparation:** `llama_build_graph` and `ggml_backend_sched_split_graph` are related to the building/preparation of the compute graph in GGML format for each token, which is ultimately translated into a CUDA graph for execution. However, we know from the CUDA graph implementation (https://github.com/ggerganov/llama.cpp/issues/6763) that only very minor adjustments are required across the majority of tokens. Therefore, it seems that most of the work is not required and we should be able to cache/reuse components of the GGML graph across tokens, in a similar way that we reuse each CUDA graph with only minor adjustments. E.g. in `build_llama()` we could add some code to save state across tokens, rather than perform the full re-build every token.

**Sampling:**  `llama_sampling_sample` uses the CPU to perform sampling on the logits that have been evaluated on the GPU, for each token. In principle this sampling could be ported to the GPU.

I will continue to investigate these optimization possibilities.

 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimisation of per-token CPU activities for GPU inference #7456

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimisation of per-token CPU activities for GPU inference #7456

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions