Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify and improve CUDA graphs through use of indirect copy pointers #9017

Merged
merged 9 commits into from
Apr 3, 2025

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented Aug 13, 2024

Previously there was complexity in the CUDA graphs implementation due frequently changing parameters to copy kernels associated with K and V cache pointers. This patch simplifies by using indirection to avoid such parameters frequently changing, avoiding the need for frequent graph updates.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 13, 2024
@agray3
Copy link
Contributor Author

agray3 commented Aug 13, 2024

@slaren could you possibly review this whenever you get the bandwidth? Note that as well as simplifying the CUDA graphs code, this change also gives ~1-2% performance uplift by avoiding CUDA Graph updates for each token.

@Nexesenex
Copy link
Contributor

Is this PR compatible with #8366, or does it supersedes it?

@agray3
Copy link
Contributor Author

agray3 commented Aug 13, 2024

Is this PR compatible with #8366, or does it supersedes it?

Yes, it is compatible (it doesn't supersede since #8366 provides further benefits). If/when this change is merged then I will re-base #8366 on it (which will actually also simplify #8366).
__

@slaren
Copy link
Member

slaren commented Aug 13, 2024

The idea of keeping a list of pointers in device memory to avoid the update to the graphs is interesting, but the way this is implemented is shifting some of the complexity from the CUDA backend to the application side. My view generally is that adding custom functions to the backends that require special handling from the application side should only be done as a last resort, and the priority should be to provide a simple and unified interface to the applications.

I think it would be possible to implement this entirely in the CUDA backend side by scanning the graph to obtain and update the list of pointers. I suppose it may be worth it if updating the nodes in the CUDA graph is significantly slower than copying a list of pointers to device memory, but if the difference is small, it may be hard to justify the added complexity to the CUDA backend code.

@agray3
Copy link
Contributor Author

agray3 commented Aug 13, 2024

Thanks @slaren. The current code involves repeated updates to the graph, and the proposed approach does give a significant performance advantage (even with the exrtra memcopies). E.g. On A100 for llama 7B Q4 I get (tokens/s):

        PR9017  Current 
Run1	157.02	154.35
Run2	157.03	154.62
Run3	156.78	154.45
Run4	156.45	153.98
Run5	156.78	154.26
		
Average	156.82	154.33
Speedup	1.016   1

This 1.6% speedup is not dramatic, but given the huge worldwide usage of llama.cpp I'd argue that it would accumulate to an enormous overall time, cost and energy saving. Plus it is a step in the right direction (IMO) of reducing the need to do a full rebuild of the GGML graph every step.

But I acknowledge that it does add e few lines of extra complexity to the llama.cpp file - I'll have a think about that can be better abstracted into GGML.

@agray3
Copy link
Contributor Author

agray3 commented Aug 14, 2024

I've now fully abstracted into the GGML CUDA backend, with just a single call from llama.cpp.

@slaren
Copy link
Member

slaren commented Aug 14, 2024

@agray3 I am sorry, I think there has been a misunderstanding. The problem is not the location of the few lines of code to build the list of pointers, the problem is skipping several layers of abstraction and going directly from llama.cpp to the CUDA backend code. Not only this code is going to be hard to maintain and will certainly require exceptions for some architectures, but ggml is a separate library from llama.cpp and it used in more applications, and the goal is to continue expanding the capabilities to use ggml in other projects. Simply put, it is not ok to add new functions to the CUDA backend interface to achieve this, and much less so to the ggml-backend interface. The only way I can see to implement this would be to build the list of pointers automatically and transparently by inspecting the graph within the CUDA backend.

@agray3
Copy link
Contributor Author

agray3 commented Aug 14, 2024

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

@slaren
Copy link
Member

slaren commented Aug 14, 2024

OK, I understand, thanks for your patience (I'm still getting used to the ecosystem). If I now understand correctly, the problem is that GGML is now assuming that the application makes this new call, and will break if that call is not present. What if this call was made optional, with automatic fallback to the existing behavior if the call is not present?

The problem is that we cannot add new functions to the backend interface every time it is more convenient to implement some optimization by doing so, because it will pollute the application code and the backend interface, and will quickly become unmaintainable. Even if this is a small change now, there are currently 7 backends supported in ggml, and all of them would like to add similar functions to simplify their implementation. We cannot go this route unless it is absolutely necessary, and I don't think that this case qualifies.

Note that we can't do this by "inspecting the graph within the CUDA backend" since this pointer array don't exist there, it is built up token-by-token.

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes. The only difference is how the kernel receives the updated pointers, either as a kernel argument argument updated in the CUDA graph, or as a pointer obtained from device memory.

@agray3
Copy link
Contributor Author

agray3 commented Aug 14, 2024

Please correct me if I am wrong, but as far as I can tell, these pointers are the same that appear as the destination of the GGML_OP_CPY operations, and thus could be collected from the graph in the same way that they are currently when updating the graph nodes.

Yes, you are right. I was getting mixed up between GGML and CUDA graphs. Currently we extract from the GGML graph and insert into the CUDA graph, but we could instead extract from the GGML graph and pass to the GPU via a memcpy. I'll experiment with that, thanks.

@agray3 agray3 marked this pull request as draft August 14, 2024 15:47
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 15, 2024
…y pointers ggml-org#9017"

This reverts commit 1dea402e4cb8f64737aa49ba98bc9647656e4d26.
@Nexesenex
Copy link
Contributor

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

@agray3
Copy link
Contributor Author

agray3 commented Oct 19, 2024

Hey @agray3. Would you mind to actualize this PR as well, so I can merge it with my fork?

Any boost of performance, even small, is welcome! :D

Thanks in any case!

This will require a bit more rebasing to be compatible with my other patch - I'm away for a few days so will take a look when I'm back.

@Nexesenex
Copy link
Contributor

@agray3: Thanks! Have a great time meanwhile!

…ointers

Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.

Fixes ggml-org#12152
@agray3 agray3 force-pushed the ag_indirect_copy_dest branch from 38f4863 to e9a1be0 Compare March 11, 2025 15:08
@agray3
Copy link
Contributor Author

agray3 commented Mar 11, 2025

@slaren I've now adapted this such that it is completely independent from llama.cpp by copying the pointers from the GGML graph to the GPU as you suggest - I think it is now much more robust. Could you possibly take another look? It simplifies the code by removing the need for graph parameter updates, and has also the small perf benefit shown above.

@agray3
Copy link
Contributor Author

agray3 commented Mar 11, 2025

@IMbackK as above, this PR removes the need for graph parameter updates and the associated issues you reported. Could you possibly review? Thanks

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 24, 2025

Is this still supposed to be a draft?

@agray3 agray3 marked this pull request as ready for review March 25, 2025 09:15
@agray3
Copy link
Contributor Author

agray3 commented Mar 25, 2025

Is this still supposed to be a draft?

Thanks, I didn't notice I still had it as a draft, I've now marked it as ready for review. @slaren could you possibly let us know whether or not you think this change is now acceptable?

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. This approach looks good to me. There are still some remaining issues:

  • The pointers need to be local to the ggml_backend_cuda_context, they cannot be global since multiple contexts may be used at the same time
  • Remove the declaration from ggml-backend.h

@agray3
Copy link
Contributor Author

agray3 commented Mar 26, 2025

Sorry for the delay. This approach looks good to me. There are still some remaining issues:

  • The pointers need to be local to the ggml_backend_cuda_context, they cannot be global since multiple contexts may be used at the same time
  • Remove the declaration from ggml-backend.h

Thanks @slaren , I have now addressed these issues by moving the declaration to cpy.cuh, and the pointers to the ggml_cuda_graph structure in the cuda context.

Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont like the complexity of this, but i cant think of a way of doing better than this.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This causes a crash or garbage output, which I suspect is caused because evaluations with batch size > 1 may still try to use the indirect pointers
  • There may also be issues due to using the synchronous cudaMalloc/cudaMemcpy, since everything else is run in a stream
  • The ggml_cuda_cpy_fn_ptrs and the function to retrieve them no longer seem to serve a purpose, so they should be removed
  • It breaks the HIP and MUSA builds

@agray3
Copy link
Contributor Author

agray3 commented Apr 1, 2025

  • This causes a crash or garbage output, which I suspect is caused because evaluations with batch size > 1 may still try to use the indirect pointers

I've not been able to reproduce this yet (previously looked OK from my local tests) but I've now added a guard to ensure indirection is only in use when cuda graphs are active. Please let me know if you still see any issues.

  • There may also be issues due to using the synchronous cudaMalloc/cudaMemcpy, since everything else is run in a stream

Fixed. Note that for the alloc I'm using a stream sync rather than a stream ordered alloc since the latter is only supported from CUDA 11.2, but I don't think it will make much difference since this will only be occasional.

  • The ggml_cuda_cpy_fn_ptrs and the function to retrieve them no longer seem to serve a purpose, so they should be removed

Good point, I've now removed all these.

  • It breaks the HIP and MUSA builds

Fixed.

@slaren slaren merged commit 3f9da22 into ggml-org:master Apr 3, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants