-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heterogeneous Speculative Decoding (CPU + GPU) #5065
base: main
Are you sure you want to change the base?
Conversation
Hi @cadedaniel. This PR extended the usage of speculative decoding, which enables the draft model to run on the CPU. I only replaced the proposal worker with CPUWorker. However, the performance is not acceptable because the draft model in the CPU is slow. I'd like to know if we could do it with an async method, which means the CPU and CUDA can run at the same time. The process can be: CPU run 1st req -> async CUDA run 1st req -> CPU run 2nd req -> await CUDA finish 1st req -> async CUDA run 2nd req -> CPU run 1st req ...... It could be interesting and maybe the most benefit of heterogenous speculative decoding. I was looking for some instructions like how to run 2 requests at the same time and if it is possible to use the async llm engine? Thx! |
It's challenging to do this because even if you get async draft model, it won't have the latest accepted tokens from the target model during drafting. Probably means need a faster proposal method on CPU, like a high quality ngram model. |
Hi @cadedaniel . Thanks for your reply, what I mean is async target model. Anyway, we could review this PR first. When I ran speculative decoding in GPU (both target and draft model on the same Cuda device), I found it was very slow, so I checked the matched tokens and found issue #6285. Do you mind taking a look at this issue? It seems like a critical bug in speculative decoding. I think this PR could be ready to review when the issue is fixed. Thx! |
Hi @cadedaniel . I think this PR should be ready to be reviewed since issue #6285 has been fixed. Do you mind reviewing it? Thx! The main problem is that I compile both CPU and GPU ops, so we need to figure out a way if users want to only run on CPU, would like to hear your advice :) |
Hi @cadedaniel @LiuXiaoxuanPKU @bigPYJ1151 Once we figure out a solution for this issue, I could update the Cmake file to pass the tests. Please let me know your opinion :) |
Hello @jiqing-feng . I can review the PR if you can show a performance benefit in some configuration that users will benefit from. Can you demonstrate a performance improvement and the configuration used with this PR? See below for the current performance improvement on GPU models. Thanks. Cade |
Sure! I am preparing the benchmark script and environments. I cannot run too big models such as 70B because I have limited computation resources. The llama-7b + llama-68m will be my choice. The performance data will be posted here once it's finished, thx! |
Hi @cadedaniel I tested this PR with llama-7b + llama-68m in 1 SPR(Intel 4th gen Xeon) with a Nvidia 4090 card. The result is: I initialize the llm by llm = LLM(
model=model,
dtype=torch.float16,
speculative_model=assistant_model, # The draft model. Must have same vocabulary as target model.
tensor_parallel_size=1,
num_speculative_tokens=2 if assistant_model else None, # The number of speculative tokens to score.
use_v2_block_manager=True if assistant_model else None,
max_num_seqs=args.max_num_seqs,
enforce_eager=True,
cpu_draft_worker=True if assistant_model else None,
) I only tested |
I feel the improvements are too marginal to support this proposer unfortunately. |
Hi @cadedaniel . Do you mind starting to review this PR? I have cleaned all the tests except the AMD failed tests. There are still 2 issues that need your help:
|
Hi @cadedaniel . This PR is ready to be reviewed, I have cleaned all tests and added an end-to-end test for this feature. Please let me know your opinion. Thx! |
This PR enables heterogeneous speculative decoding by the parameter
cpu_draft_worker
To make it happen, some changes need to be reviewed: