Heterogeneous Speculative Decoding (CPU + GPU) #5065

jiqing-feng · 2024-05-27T06:01:11Z

This PR enables heterogeneous speculative decoding by the parameter cpu_draft_worker

To make it happen, some changes need to be reviewed:

compile CPU ops in cuda backend.
tensor-driven dispatch for custom_ops

jiqing-feng · 2024-07-03T07:06:08Z

Hi @cadedaniel. This PR extended the usage of speculative decoding, which enables the draft model to run on the CPU. I only replaced the proposal worker with CPUWorker. However, the performance is not acceptable because the draft model in the CPU is slow.

I'd like to know if we could do it with an async method, which means the CPU and CUDA can run at the same time. The process can be: CPU run 1st req -> async CUDA run 1st req -> CPU run 2nd req -> await CUDA finish 1st req -> async CUDA run 2nd req -> CPU run 1st req ......

It could be interesting and maybe the most benefit of heterogenous speculative decoding.

I was looking for some instructions like how to run 2 requests at the same time and if it is possible to use the async llm engine?

Thx!

cc @LiuXiaoxuanPKU

cadedaniel · 2024-07-04T09:32:17Z

It's challenging to do this because even if you get async draft model, it won't have the latest accepted tokens from the target model during drafting. Probably means need a faster proposal method on CPU, like a high quality ngram model.

jiqing-feng · 2024-07-10T06:23:13Z

Hi @cadedaniel . Thanks for your reply, what I mean is async target model. Anyway, we could review this PR first.

When I ran speculative decoding in GPU (both target and draft model on the same Cuda device), I found it was very slow, so I checked the matched tokens and found issue #6285. Do you mind taking a look at this issue? It seems like a critical bug in speculative decoding.

I think this PR could be ready to review when the issue is fixed. Thx!

cc @LiuXiaoxuanPKU

jiqing-feng · 2024-07-11T08:08:19Z

Hi @cadedaniel . I think this PR should be ready to be reviewed since issue #6285 has been fixed. Do you mind reviewing it? Thx!

The main problem is that I compile both CPU and GPU ops, so we need to figure out a way if users want to only run on CPU, would like to hear your advice :)

jiqing-feng · 2024-07-18T02:32:45Z

Hi @cadedaniel . I think this PR should be ready to be reviewed since issue #6285 has been fixed. Do you mind reviewing it? Thx!

The main problem is that I compile both CPU and GPU ops, so we need to figure out a way if users want to only run on CPU, would like to hear your advice :)

Hi @cadedaniel @LiuXiaoxuanPKU @bigPYJ1151

Once we figure out a solution for this issue, I could update the Cmake file to pass the tests. Please let me know your opinion :)

cadedaniel · 2024-07-18T05:37:33Z

Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and the configuration used with this PR? See below for the current performance improvement on GPU models.

Thanks.

Cade

jiqing-feng · 2024-07-18T05:43:09Z

Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and the configuration used with this PR? See below for the current performance improvement on GPU models.

Thanks.

Cade

Sure! I am preparing the benchmark script and environments. I cannot run too big models such as 70B because I have limited computation resources. The llama-7b + llama-68m will be my choice.

The performance data will be posted here once it's finished, thx!

jiqing-feng · 2024-08-14T10:28:59Z

Hi @cadedaniel

I tested this PR with llama-7b + llama-68m in 1 SPR(Intel 4th gen Xeon) with a Nvidia 4090 card. The result is:
baseline: 59 tokens/s
num_speculative_tokens= 1: 64 tokens/s
num_speculative_tokens= 2: 64 tokens/s
num_speculative_tokens= 3: 60 tokens/s
num_speculative_tokens= 4: 60 tokens/s

I initialize the llm by

llm = LLM(
            model=model,
            dtype=torch.float16,
            speculative_model=assistant_model, # The draft model. Must have same vocabulary as target model.
            tensor_parallel_size=1,
            num_speculative_tokens=2 if assistant_model else None, # The number of speculative tokens to score.
            use_v2_block_manager=True if assistant_model else None,
            max_num_seqs=args.max_num_seqs,
            enforce_eager=True,
            cpu_draft_worker=True if assistant_model else None,
        )

I only tested max_num_seqs=1, do you think it is enough? If not, please let me know how to test and how to set the QPS as in your chart. Thx!

cadedaniel · 2024-08-14T20:58:43Z

I feel the improvements are too marginal to support this proposer unfortunately.

…odel

jiqing-feng · 2024-09-12T01:09:57Z

Hi @cadedaniel . Do you mind starting to review this PR? I have cleaned all the tests except the AMD failed tests. There are still 2 issues that need your help:

Why AMD tests failed and how to reproduce to debug it?
Where and what kind of tests I should add?

jiqing-feng · 2024-09-20T06:29:34Z

Hi @cadedaniel . This PR is ready to be reviewed, I have cleaned all tests and added an end-to-end test for this feature. Please let me know your opinion. Thx!

jiqing-feng added 10 commits May 16, 2024 11:05

hete spec decode engine

aaece57

compile ops for cuda and cpu

21fb773

can run hete spec decode

5f02fdd

add parameter cpu_draft_worker to run draft model on CPU

8febd81

rm useless comments

d9af7a6

merge main

74fb5d5

fix conflict

44acebe

add copy comment

b4b8744

rebase

cc7998e

fix bug

8f7ecf3

jiqing-feng and others added 5 commits July 3, 2024 06:16

rebase

794613e

fix style

52022a5

rebbase

fa40a93

fix style

aa4d556

Merge branch 'main' into hete_spec_decode

f7491eb

jiqing-feng marked this pull request as ready for review July 4, 2024 08:36

jiqing-feng added 3 commits July 11, 2024 04:15

fix format

344f5d7

rebase

53cf9b6

fix format

23a4575

jiqing-feng changed the title ~~[WIP] Hete spec decode~~ Hete spec decode Jul 18, 2024

rebase main

2cab72f

jiqing-feng and others added 14 commits August 27, 2024 11:55

rebase main

1db03a2

fix cpu worker core binding

1033dcc

fix import cpu ops

a0f172c

enable cpu TP

14df487

add cpu-draft-worker parameter

7bbd35b

fix style

c895b50

Merge branch 'main' into hete_spec_decode

91499e1

fix param name

d7b742c

fix cpu-draft-args

e016db9

Merge branch 'main' into hete_spec_decode

0d58142

fix cmake list to avoid amd error

679664b

fix ops name

729483e

fix distributed tests and disable distributed verified if cpu-draft-m…

13e5e2a

…odel

fix tests

581c529

jiqing-feng added 9 commits September 19, 2024 07:39

rebase

753e1d0

skip build cpu if rocm and fix code style

c670338

install onednn

c3e9488

fix install onednn position

5d7233f

ondnn install

8bfc4e6

fix cpu op

e01732e

change dockerfile base image to ubuntu22.04

07eb1a1

fix cmake list

27da2ee

install libc6

da1728a

jiqing-feng added 5 commits September 20, 2024 06:50

revert dockerfile to ubuntu 20.04

a883fce

disable avx512 to pass cpu compile

6aba90b

reuse multi step worker for CPU

83bc114

fix SDPA assert

d057f34

fix format

77e97e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heterogeneous Speculative Decoding (CPU + GPU) #5065

Heterogeneous Speculative Decoding (CPU + GPU) #5065

jiqing-feng commented May 27, 2024 •

edited

Loading

jiqing-feng commented Jul 3, 2024 •

edited

Loading

cadedaniel commented Jul 4, 2024

jiqing-feng commented Jul 10, 2024

jiqing-feng commented Jul 11, 2024 •

edited

Loading

jiqing-feng commented Jul 18, 2024 •

edited

Loading

cadedaniel commented Jul 18, 2024

jiqing-feng commented Jul 18, 2024 •

edited

Loading

jiqing-feng commented Aug 14, 2024 •

edited

Loading

cadedaniel commented Aug 14, 2024

jiqing-feng commented Sep 12, 2024

jiqing-feng commented Sep 20, 2024

Heterogeneous Speculative Decoding (CPU + GPU) #5065

Are you sure you want to change the base?

Heterogeneous Speculative Decoding (CPU + GPU) #5065

Conversation

jiqing-feng commented May 27, 2024 • edited Loading

jiqing-feng commented Jul 3, 2024 • edited Loading

cadedaniel commented Jul 4, 2024

jiqing-feng commented Jul 10, 2024

jiqing-feng commented Jul 11, 2024 • edited Loading

jiqing-feng commented Jul 18, 2024 • edited Loading

cadedaniel commented Jul 18, 2024

jiqing-feng commented Jul 18, 2024 • edited Loading

jiqing-feng commented Aug 14, 2024 • edited Loading

cadedaniel commented Aug 14, 2024

jiqing-feng commented Sep 12, 2024

jiqing-feng commented Sep 20, 2024

jiqing-feng commented May 27, 2024 •

edited

Loading

jiqing-feng commented Jul 3, 2024 •

edited

Loading

jiqing-feng commented Jul 11, 2024 •

edited

Loading

jiqing-feng commented Jul 18, 2024 •

edited

Loading

jiqing-feng commented Jul 18, 2024 •

edited

Loading

jiqing-feng commented Aug 14, 2024 •

edited

Loading