Simplify and improve CUDA graphs through use of indirect copy pointers #9017

agray3 · 2024-08-13T08:56:22Z

Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

agray3 · 2024-08-13T08:57:48Z

@slaren could you possibly review this whenever you get the bandwidth? Note that as well as simplifying the CUDA graphs code, this change also gives ~1-2% performance uplift by avoiding CUDA Graph updates for each token.

Nexesenex · 2024-08-13T12:52:16Z

Is this PR compatible with #8366, or does it supersedes it?

agray3 · 2024-08-13T13:05:46Z

Is this PR compatible with #8366, or does it supersedes it?

Yes, it is compatible (it doesn't supersede since #8366 provides further benefits). If/when this change is merged then I will re-base #8366 on it (which will actually also simplify #8366).
__

slaren · 2024-08-13T14:35:06Z

The idea of keeping a list of pointers in device memory to avoid the update to the graphs is interesting, but the way this is implemented is shifting some of the complexity from the CUDA backend to the application side. My view generally is that adding custom functions to the backends that require special handling from the application side should only be done as a last resort, and the priority should be to provide a simple and unified interface to the applications.

I think it would be possible to implement this entirely in the CUDA backend side by scanning the graph to obtain and update the list of pointers. I suppose it may be worth it if updating the nodes in the CUDA graph is significantly slower than copying a list of pointers to device memory, but if the difference is small, it may be hard to justify the added complexity to the CUDA backend code.

agray3 · 2024-08-13T16:14:57Z

Thanks @slaren. The current code involves repeated updates to the graph, and the proposed approach does give a significant performance advantage (even with the exrtra memcopies). E.g. On A100 for llama 7B Q4 I get (tokens/s):

        PR9017  Current 
Run1	157.02	154.35
Run2	157.03	154.62
Run3	156.78	154.45
Run4	156.45	153.98
Run5	156.78	154.26
		
Average	156.82	154.33
Speedup	1.016   1

This 1.6% speedup is not dramatic, but given the huge worldwide usage of llama.cpp I'd argue that it would accumulate to an enormous overall time, cost and energy saving. Plus it is a step in the right direction (IMO) of reducing the need to do a full rebuild of the GGML graph every step.

But I acknowledge that it does add e few lines of extra complexity to the llama.cpp file - I'll have a think about that can be better abstracted into GGML.

agray3 · 2024-08-14T10:05:15Z

I've now fully abstracted into the GGML CUDA backend, with just a single call from llama.cpp.

slaren · 2024-08-14T14:04:27Z

@agray3 I am sorry, I think there has been a misunderstanding. The problem is not the location of the few lines of code to build the list of pointers, the problem is skipping several layers of abstraction and going directly from llama.cpp to the CUDA backend code. Not only this code is going to be hard to maintain and will certainly require exceptions for some architectures, but ggml is a separate library from llama.cpp and it used in more applications, and the goal is to continue expanding the capabilities to use ggml in other projects. Simply put, it is not ok to add new functions to the CUDA backend interface to achieve this, and much less so to the ggml-backend interface. The only way I can see to implement this would be to build the list of pointers automatically and transparently by inspecting the graph within the CUDA backend.

agray3 · 2024-08-14T15:03:13Z

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

slaren · 2024-08-14T15:17:43Z

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

The problem is that we cannot add new functions to the backend interface every time it is more convenient to implement some optimization by doing so, because it will pollute the application code and the backend interface, and will quickly become unmaintainable. Even if this is a small change now, there are currently 7 backends supported in ggml, and all of them would like to add similar functions to simplify their implementation. We cannot go this route unless it is absolutely necessary, and I don't think that this case qualifies.

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes. The only difference is how the kernel receives the updated pointers, either as a kernel argument argument updated in the CUDA graph, or as a pointer obtained from device memory.

agray3 · 2024-08-14T15:44:42Z

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes.

Yes, you are right. I was getting mixed up between GGML and CUDA graphs. Currently we extract from the GGML graph and insert into the CUDA graph, but we could instead extract from the GGML graph and pass to the GPU via a memcpy. I'll experiment with that, thanks.

ggml-org#9017 Co-Authored-By: agray3 <[email protected]>

…y pointers ggml-org#9017" This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.

Nexesenex · 2024-10-17T19:34:19Z

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

agray3 · 2024-10-19T19:24:39Z

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

This will require a bit more rebasing to be compatible with my other patch - I'm away for a few days so will take a look when I'm back.

Nexesenex · 2024-10-19T20:01:45Z

@agray3: Thanks! Have a great time meanwhile!

…ointers Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates. Fixes ggml-org#12152

agray3 · 2025-03-11T15:15:26Z

@slaren I've now adapted this such that it is completely independent from llama.cpp by copying the pointers from the GGML graph to the GPU as you suggest - I think it is now much more robust. Could you possibly take another look? It simplifies the code by removing the need for graph parameter updates, and has also the small perf benefit shown above.

agray3 · 2025-03-11T15:17:03Z

@IMbackK as above, this PR removes the need for graph parameter updates and the associated issues you reported. Could you possibly review? Thanks

IMbackK · 2025-03-24T11:52:07Z

Is this still supposed to be a draft?

agray3 · 2025-03-25T09:16:45Z

Is this still supposed to be a draft?

Thanks, I didn't notice I still had it as a draft, I've now marked it as ready for review. @slaren could you possibly let us know whether or not you think this change is now acceptable?

slaren

Sorry for the delay. This approach looks good to me. There are still some remaining issues:

The pointers need to be local to the ggml_backend_cuda_context, they cannot be global since multiple contexts may be used at the same time
Remove the declaration from ggml-backend.h

agray3 · 2025-03-26T08:47:27Z

Sorry for the delay. This approach looks good to me. There are still some remaining issues:

The pointers need to be local to the ggml_backend_cuda_context, they cannot be global since multiple contexts may be used at the same time

Remove the declaration from ggml-backend.h

Thanks @slaren , I have now addressed these issues by moving the declaration to cpy.cuh, and the pointers to the ggml_cuda_graph structure in the cuda context.

IMbackK

I dont like the complexity of this, but i cant think of a way of doing better than this.

slaren

This causes a crash or garbage output, which I suspect is caused because evaluations with batch size > 1 may still try to use the indirect pointers
There may also be issues due to using the synchronous cudaMalloc/cudaMemcpy, since everything else is run in a stream
The ggml_cuda_cpy_fn_ptrs and the function to retrieve them no longer seem to serve a purpose, so they should be removed
It breaks the HIP and MUSA builds

agray3 · 2025-04-01T10:49:34Z

This causes a crash or garbage output, which I suspect is caused because evaluations with batch size > 1 may still try to use the indirect pointers

I've not been able to reproduce this yet (previously looked OK from my local tests) but I've now added a guard to ensure indirection is only in use when cuda graphs are active. Please let me know if you still see any issues.

There may also be issues due to using the synchronous cudaMalloc/cudaMemcpy, since everything else is run in a stream

Fixed. Note that for the alloc I'm using a stream sync rather than a stream ordered alloc since the latter is only supported from CUDA 11.2, but I don't think it will make much difference since this will only be occasional.

The ggml_cuda_cpy_fn_ptrs and the function to retrieve them no longer seem to serve a purpose, so they should be removed

Good point, I've now removed all these.

It breaks the HIP and MUSA builds

Fixed.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2024

agray3 marked this pull request as draft August 14, 2024 15:47

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024

Simplify and improve CUDA graphs through use of indirect copy pointers

ccd8e56

ggml-org#9017 Co-Authored-By: agray3 <[email protected]>

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024

Revert " Simplify and improve CUDA graphs through use of indirect cop…

56fc498

…y pointers ggml-org#9017" This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.

Nexesenex mentioned this pull request Oct 18, 2024

Adding @agray3's graph caching approach ikawrakow/ik_llama.cpp#94

Closed

agray3 mentioned this pull request Feb 17, 2025

Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance. #11867

Open

agray3 mentioned this pull request Mar 3, 2025

CUDA: HIP: maintain_cuda_graph use of cudaGraphKernelNodeGetParams is incorrect. #12152

Closed

agray3 force-pushed the ag_indirect_copy_dest branch from 38f4863 to e9a1be0 Compare March 11, 2025 15:08

agray3 marked this pull request as ready for review March 25, 2025 09:15

slaren reviewed Mar 25, 2025

View reviewed changes

Addressed comments

1a2441a

IMbackK approved these changes Mar 26, 2025

View reviewed changes

slaren requested changes Mar 29, 2025

View reviewed changes

agray3 added 5 commits March 31, 2025 08:12

fix HIP builds

a3d1318

properly sync to stream

6d7df91

removed ggml_cuda_cpy_fn_ptrs

04a7307

move stream sync before free

c255a0f

guard to only use indirection with graphs

21fae96

style fixes

61622c0

slaren approved these changes Apr 1, 2025

View reviewed changes

check for errors

fd88d2b

slaren merged commit 3f9da22 into ggml-org:master Apr 3, 2025
48 checks passed

gaugarg-nv mentioned this pull request Jun 1, 2025

ggml: avoid rebuild of GGML graph for each token (#7456) #8366

Closed

4 tasks

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Uh oh!

Conversation

agray3 commented Aug 13, 2024

Uh oh!

agray3 commented Aug 13, 2024

Uh oh!

Nexesenex commented Aug 13, 2024

Uh oh!

agray3 commented Aug 13, 2024

Uh oh!

slaren commented Aug 13, 2024

Uh oh!

agray3 commented Aug 13, 2024

Uh oh!

agray3 commented Aug 14, 2024

Uh oh!

slaren commented Aug 14, 2024

Uh oh!

agray3 commented Aug 14, 2024

Uh oh!

slaren commented Aug 14, 2024

Uh oh!

agray3 commented Aug 14, 2024

Uh oh!

Nexesenex commented Oct 17, 2024

Uh oh!

agray3 commented Oct 19, 2024

Uh oh!

Nexesenex commented Oct 19, 2024

Uh oh!

agray3 commented Mar 11, 2025

Uh oh!

agray3 commented Mar 11, 2025

Uh oh!

IMbackK commented Mar 24, 2025

Uh oh!

agray3 commented Mar 25, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

agray3 commented Mar 26, 2025

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

agray3 commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!