You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.
This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.
Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.
Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.
I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:
LMDeploy (handles 20 concurrent requests fine):
Using latest (openmmlab/lmdeploy:v0.6.0a0-cu12) docker image on 48GB NVIDIA A40 GPU:
It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.
Related resources
No response
The text was updated successfully, but these errors were encountered:
Motivation
LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.
This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.
Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.
Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.
I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:
LMDeploy (handles 20 concurrent requests fine):
Using latest (
openmmlab/lmdeploy:v0.6.0a0-cu12
) docker image on 48GB NVIDIA A40 GPU:SGLang (OOM at >=4 concurrent requests):
Using latest (
lmsysorg/sglang:v0.3.0-cu121
) docker image on 48GB NVIDIA A40 GPU:For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24
It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.
Related resources
No response
The text was updated successfully, but these errors were encountered: