Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterogeneous Speculative Decoding (CPU + GPU) #5065

Open
wants to merge 63 commits into
base: main
Choose a base branch
from

Conversation

jiqing-feng
Copy link

@jiqing-feng jiqing-feng commented May 27, 2024

This PR enables heterogeneous speculative decoding by the parameter cpu_draft_worker

To make it happen, some changes need to be reviewed:

  1. compile CPU ops in cuda backend.
  2. tensor-driven dispatch for custom_ops

@jiqing-feng
Copy link
Author

jiqing-feng commented Jul 3, 2024

Hi @cadedaniel. This PR extended the usage of speculative decoding, which enables the draft model to run on the CPU. I only replaced the proposal worker with CPUWorker. However, the performance is not acceptable because the draft model in the CPU is slow.

I'd like to know if we could do it with an async method, which means the CPU and CUDA can run at the same time. The process can be: CPU run 1st req -> async CUDA run 1st req -> CPU run 2nd req -> await CUDA finish 1st req -> async CUDA run 2nd req -> CPU run 1st req ......

It could be interesting and maybe the most benefit of heterogenous speculative decoding.

I was looking for some instructions like how to run 2 requests at the same time and if it is possible to use the async llm engine?

Thx!

cc @LiuXiaoxuanPKU

@jiqing-feng jiqing-feng marked this pull request as ready for review July 4, 2024 08:36
@cadedaniel
Copy link
Collaborator

It's challenging to do this because even if you get async draft model, it won't have the latest accepted tokens from the target model during drafting. Probably means need a faster proposal method on CPU, like a high quality ngram model.

@jiqing-feng
Copy link
Author

Hi @cadedaniel . Thanks for your reply, what I mean is async target model. Anyway, we could review this PR first.

When I ran speculative decoding in GPU (both target and draft model on the same Cuda device), I found it was very slow, so I checked the matched tokens and found issue #6285. Do you mind taking a look at this issue? It seems like a critical bug in speculative decoding.

I think this PR could be ready to review when the issue is fixed. Thx!

cc @LiuXiaoxuanPKU

@jiqing-feng
Copy link
Author

jiqing-feng commented Jul 11, 2024

Hi @cadedaniel . I think this PR should be ready to be reviewed since issue #6285 has been fixed. Do you mind reviewing it? Thx!

The main problem is that I compile both CPU and GPU ops, so we need to figure out a way if users want to only run on CPU, would like to hear your advice :)

@jiqing-feng jiqing-feng changed the title [WIP] Hete spec decode Hete spec decode Jul 18, 2024
@jiqing-feng
Copy link
Author

jiqing-feng commented Jul 18, 2024

Hi @cadedaniel . I think this PR should be ready to be reviewed since issue #6285 has been fixed. Do you mind reviewing it? Thx!

The main problem is that I compile both CPU and GPU ops, so we need to figure out a way if users want to only run on CPU, would like to hear your advice :)

Hi @cadedaniel @LiuXiaoxuanPKU @bigPYJ1151

Once we figure out a solution for this issue, I could update the Cmake file to pass the tests. Please let me know your opinion :)

@cadedaniel
Copy link
Collaborator

Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and the configuration used with this PR? See below for the current performance improvement on GPU models.

Thanks.

Cade

image

@jiqing-feng
Copy link
Author

jiqing-feng commented Jul 18, 2024

Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and the configuration used with this PR? See below for the current performance improvement on GPU models.

Thanks.

Cade

image

Sure! I am preparing the benchmark script and environments. I cannot run too big models such as 70B because I have limited computation resources. The llama-7b + llama-68m will be my choice.

The performance data will be posted here once it's finished, thx!

@jiqing-feng
Copy link
Author

jiqing-feng commented Aug 14, 2024

Hi @cadedaniel

I tested this PR with llama-7b + llama-68m in 1 SPR(Intel 4th gen Xeon) with a Nvidia 4090 card. The result is:
baseline: 59 tokens/s
num_speculative_tokens= 1: 64 tokens/s
num_speculative_tokens= 2: 64 tokens/s
num_speculative_tokens= 3: 60 tokens/s
num_speculative_tokens= 4: 60 tokens/s

I initialize the llm by

llm = LLM(
            model=model,
            dtype=torch.float16,
            speculative_model=assistant_model, # The draft model. Must have same vocabulary as target model.
            tensor_parallel_size=1,
            num_speculative_tokens=2 if assistant_model else None, # The number of speculative tokens to score.
            use_v2_block_manager=True if assistant_model else None,
            max_num_seqs=args.max_num_seqs,
            enforce_eager=True,
            cpu_draft_worker=True if assistant_model else None,
        )

I only tested max_num_seqs=1, do you think it is enough? If not, please let me know how to test and how to set the QPS as in your chart. Thx!

@cadedaniel
Copy link
Collaborator

I feel the improvements are too marginal to support this proposer unfortunately.

@jiqing-feng
Copy link
Author

Hi @cadedaniel . Do you mind starting to review this PR? I have cleaned all the tests except the AMD failed tests. There are still 2 issues that need your help:

  1. Why AMD tests failed and how to reproduce to debug it?
  2. Where and what kind of tests I should add?

@jiqing-feng
Copy link
Author

Hi @cadedaniel . This PR is ready to be reviewed, I have cleaned all tests and added an end-to-end test for this feature. Please let me know your opinion. Thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants