Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple float CUDA port #376

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open

Simple float CUDA port #376

wants to merge 43 commits into from

Conversation

rogerallen
Copy link

This is my attempt at a simple port to CUDA. I'm hopeful this can serve as an example for anyone who wants to learn how to use CUDA for LLMs.

I was inspired & have used code from https://github.com/ankan-ban/llama2.cu as a starting point. When I came upon that code, I noticed the upstream repository had progressed and their llama2.cu code was no longer working. So, I tried to restructure the CUDA code in a way that run.cu could be kept up-to-date via diff. So far, keeping up hasn't been too bad.

The ankan-ban repository also seems focused on making 16-bit float and 8-bit int work. That is very cool and I hope they continue, but I hoped there would be room for a straight float port. It is not my intent to step on any toes here & I'm sorry if this comes across that way. I just think this might mesh up with the existing code better.

Even with a simple port like this, I do notice a significant performance increase. Using the stories110M.bin, I am seeing a 5x perf increase vs the runomp app on my 14-core i9 Intel laptop with an NVIDIA RTX 4050 running WSL2. My older linux desktop 12-core i9 system with a NVIDIA RTX 3070 sees about 15x.

Both Linux & Windows builds are working.

To make it easy to compare the C and CUDA, I extracted each function inside the forward routine and wrapped it with a USE_CUDA define to allow easy comparison from C to CUDA.

I used cuBLAS to leverage that library's expertise for the SGEMV function. It adds some startup time overhead via cublasCreate, so I'm waffling on keeping that code at the moment. I might go back to the previous matmul kernel code from ankan-ban.

In the rest of the code, I tried to keep it mostly untouched. But, since nvcc is a C++ compiler, there were a few times that we had to cast values in order to avoid errors. To get the Windows build working, I worked around one bit of C-syntax that made cl.exe unhappy.

I'm not very familiar with github pull requests, so bear with me if I have anything wrong.

builds & perf basically matches 'make run'
using float, not half, though.  Trying for an
apples-to-apples comparison.
- output is gibberish
- very, very slow

- doesn't trigger anything in compute-sanitizer
found < that needed to be a <=

and used some llama2.cu code.
Need to grok why still.
still can't find issue with suspect code.
Note why multi_head_attention_kernel uses llama2.cu code instead
added some TODO items & notes
@karpathy
Copy link
Owner

karpathy commented Sep 3, 2023

Thank you for the PR! I'm traveling right now so a bit slower on reply, but looking forward to taking a look

@rogerallen
Copy link
Author

Found a straightforward way for anyone to see the perf impact of CUDA via Google Colab.

First, bring up a Google Colab notebook and select a GPU runtime.

Next, wget the necessary bits from the repo + stories110M

!wget https://github.com/rogerallen/llama2.cu/raw/master/Makefile
!wget https://github.com/rogerallen/llama2.cu/raw/master/run.cu
!wget https://github.com/rogerallen/llama2.cu/raw/master/run.c
!wget https://github.com/rogerallen/llama2.cu/raw/master/tokenizer.bin
!wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

Build

!make runomp runcuda

Run cuda & run C

!./runcuda stories110M.bin -s 111 -i "Mary had a little lamb"
!./run stories110M.bin -s 111 -i "Mary had a little lamb"

I get matching output & about 320-350 tok/s for CUDA and 20-30 for C.

@GilesBathgate
Copy link

GilesBathgate commented Nov 24, 2023

@rogerallen I don't have much RAM on my GPU (10GB). For the llama_7b model I put in some debug, and first cudaMalloc is ~7GB, then second cudaMalloc is ~6GB

I notice the comment:

// allocate & copy mmap data to the gpu first
// TODO: allocate & copy just a portion to the GPU if the weights are too big
 // to fit in the GPU, then copy the data only as needed while running.

Can it be done? Will it work for my modest GPU?

@rogerallen
Copy link
Author

I have not implemented this feature. I wrote that comment without really thinking through the situation. Considering that we have to push all the weights through the GPU every forward() pass to generate a token this probably isn't that interesting to do.
We are limited by the PCIE bandwidth. If the weights are ~13GB and your PCIE bandwidth is ~16GB/s we'll be limited to about 1 token/second performance. So, while it'll work, the performance will not be that interesting.

My suggestion would be to look at some of the other CUDA ports using FP16 & FP8 weights. Those should divide the weights by 2x or 4x, hopefully fit in your GPU and provide interesting performance.

@GilesBathgate
Copy link

GilesBathgate commented Nov 24, 2023

I tried run-q8.cu but it was just segfaulting, then tried using his quantise-q8-cuda branch to first quantize the weights. It now works without crashing but outputs garbage.

@GilesBathgate
Copy link

This one works well! Pure Cuda inference for 4-bit AWQ quantized models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants