Simple float CUDA port #376

rogerallen · 2023-09-02T21:03:55Z

This is my attempt at a simple port to CUDA. I'm hopeful this can serve as an example for anyone who wants to learn how to use CUDA for LLMs.

I was inspired & have used code from https://github.com/ankan-ban/llama2.cu as a starting point. When I came upon that code, I noticed the upstream repository had progressed and their llama2.cu code was no longer working. So, I tried to restructure the CUDA code in a way that run.cu could be kept up-to-date via diff. So far, keeping up hasn't been too bad.

The ankan-ban repository also seems focused on making 16-bit float and 8-bit int work. That is very cool and I hope they continue, but I hoped there would be room for a straight float port. It is not my intent to step on any toes here & I'm sorry if this comes across that way. I just think this might mesh up with the existing code better.

Even with a simple port like this, I do notice a significant performance increase. Using the stories110M.bin, I am seeing a 5x perf increase vs the runomp app on my 14-core i9 Intel laptop with an NVIDIA RTX 4050 running WSL2. My older linux desktop 12-core i9 system with a NVIDIA RTX 3070 sees about 15x.

Both Linux & Windows builds are working.

To make it easy to compare the C and CUDA, I extracted each function inside the forward routine and wrapped it with a USE_CUDA define to allow easy comparison from C to CUDA.

I used cuBLAS to leverage that library's expertise for the SGEMV function. It adds some startup time overhead via cublasCreate, so I'm waffling on keeping that code at the moment. I might go back to the previous matmul kernel code from ankan-ban.

In the rest of the code, I tried to keep it mostly untouched. But, since nvcc is a C++ compiler, there were a few times that we had to cast values in order to avoid errors. To get the Windows build working, I worked around one bit of C-syntax that made cl.exe unhappy.

I'm not very familiar with github pull requests, so bear with me if I have anything wrong.

builds & perf basically matches 'make run'

using float, not half, though. Trying for an apples-to-apples comparison.

- output is gibberish - very, very slow - doesn't trigger anything in compute-sanitizer

found < that needed to be a <= and used some llama2.cu code. Need to grok why still.

still can't find issue with suspect code.

Note why multi_head_attention_kernel uses llama2.cu code instead

added some TODO items & notes

karpathy · 2023-09-03T21:14:13Z

Thank you for the PR! I'm traveling right now so a bit slower on reply, but looking forward to taking a look

rogerallen · 2023-09-04T15:23:50Z

Found a straightforward way for anyone to see the perf impact of CUDA via Google Colab.

First, bring up a Google Colab notebook and select a GPU runtime.

Next, wget the necessary bits from the repo + stories110M

!wget https://github.com/rogerallen/llama2.cu/raw/master/Makefile
!wget https://github.com/rogerallen/llama2.cu/raw/master/run.cu
!wget https://github.com/rogerallen/llama2.cu/raw/master/run.c
!wget https://github.com/rogerallen/llama2.cu/raw/master/tokenizer.bin
!wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

Build

!make runomp runcuda

Run cuda & run C

!./runcuda stories110M.bin -s 111 -i "Mary had a little lamb"
!./run stories110M.bin -s 111 -i "Mary had a little lamb"

I get matching output & about 320-350 tok/s for CUDA and 20-30 for C.

GilesBathgate · 2023-11-24T18:42:04Z

@rogerallen I don't have much RAM on my GPU (10GB). For the llama_7b model I put in some debug, and first cudaMalloc is ~7GB, then second cudaMalloc is ~6GB

I notice the comment:

// allocate & copy mmap data to the gpu first
// TODO: allocate & copy just a portion to the GPU if the weights are too big
 // to fit in the GPU, then copy the data only as needed while running.

Can it be done? Will it work for my modest GPU?

rogerallen · 2023-11-24T20:09:22Z

I have not implemented this feature. I wrote that comment without really thinking through the situation. Considering that we have to push all the weights through the GPU every forward() pass to generate a token this probably isn't that interesting to do.
We are limited by the PCIE bandwidth. If the weights are ~13GB and your PCIE bandwidth is ~16GB/s we'll be limited to about 1 token/second performance. So, while it'll work, the performance will not be that interesting.

My suggestion would be to look at some of the other CUDA ports using FP16 & FP8 weights. Those should divide the weights by 2x or 4x, hopefully fit in your GPU and provide interesting performance.

GilesBathgate · 2023-11-24T20:34:07Z

I tried run-q8.cu but it was just segfaulting, then tried using his quantise-q8-cuda branch to first quantize the weights. It now works without crashing but outputs garbage.

GilesBathgate · 2023-11-24T22:30:20Z

This one works well! Pure Cuda inference for 4-bit AWQ quantized models.

rogerallen added 30 commits August 19, 2023 15:32

add gitignore

580d796

copy from run.c. no changes

45dfeb1

add runcuda target, fix errors

cfb9285

builds & perf basically matches 'make run'

start adding llama2.cu-based code

a69a61e

using float, not half, though. Trying for an apples-to-apples comparison.

Merge remote-tracking branch 'upstream/master'

3246407

run.cu matches latest run.c

db259ed

fix warning

d9ea173

code "runs" & does not crash but...

609c5af

- output is gibberish - very, very slow - doesn't trigger anything in compute-sanitizer

CUDA WORKS!

3b6088e

found < that needed to be a <= and used some llama2.cu code. Need to grok why still.

change to runcuda target

4def1cb

still can't find issue with suspect code.

Add some info to the README

0c6ae1c

Merge remote-tracking branch 'upstream/master'

524fe8b

merge run.cu with latest run.c updates

a152a86

Merge remote-tracking branch 'upstream/master'

5854a4d

update run.cu with run.c changes

6ea0d60

Add comment

74ab11e

Note why multi_head_attention_kernel uses llama2.cu code instead

update f_silu_elementwise_mul_w3 to match c

8368f8c

figured out how to free GPU weights

bb708b4

added some TODO items & notes

refactor rmsnorm & add some TODOs

1a6a6bd

use cublas

2176e04

tweak readme

2e52f97

add python venv folder

b695638

add pycache

f5d3a9d

add some commentary/plans

486c3f4

Merge remote-tracking branch 'upstream/master'

eec2611

merge latest run.c changes

5f146e1

Polish the readme a bit.

cf7a5fa

added cuda windows build

c17d916

Merge remote-tracking branch 'upstream/master'

5fadc32

adjust comment to match run.c change

01f48dc

rogerallen added 10 commits August 27, 2023 12:14

refine ignores

678ce3b

add some more docs

403ae27

Merge remote-tracking branch 'upstream/master'

63d23fc

tweak readme & slight refactor of softmax_gpu

db10e91

refactoring to match C code better

d347208

WIP adding cublas

014d931

cleanup

d12f863

fix windows build for cublas

e2d0c34

tweak todos

e4af8f2

update README & runnotcuda target

55cf76e

rogerallen added 3 commits October 14, 2023 11:07

Merge remote-tracking branch 'upstream/master'

2a88996

Match run.c key/value_cache changes

4cd55c1

ignore runq binary

8a03817

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple float CUDA port #376

Simple float CUDA port #376

rogerallen commented Sep 2, 2023

karpathy commented Sep 3, 2023

rogerallen commented Sep 4, 2023

GilesBathgate commented Nov 24, 2023 •

edited

Loading

rogerallen commented Nov 24, 2023

GilesBathgate commented Nov 24, 2023 •

edited

Loading

GilesBathgate commented Nov 24, 2023

Simple float CUDA port #376

Are you sure you want to change the base?

Simple float CUDA port #376

Conversation

rogerallen commented Sep 2, 2023

karpathy commented Sep 3, 2023

rogerallen commented Sep 4, 2023

GilesBathgate commented Nov 24, 2023 • edited Loading

rogerallen commented Nov 24, 2023

GilesBathgate commented Nov 24, 2023 • edited Loading

GilesBathgate commented Nov 24, 2023

GilesBathgate commented Nov 24, 2023 •

edited

Loading

GilesBathgate commented Nov 24, 2023 •

edited

Loading