[Feature] 4-bit quantized prefix cache #1374

josephrocca · 2024-09-10T16:38:26Z

Motivation

LMDeploy's 4-bit quantized prefix cache (along with 4-bit AWQ for weights) allows running ~70B models on 48GB of RAM with good performance for many-user scenarios. The prefix cache can hold more than 40,000 context tokens.

This is very handy, since it's often easier to get a GPU (or dual GPUs) with 48GB RAM than it is to get 80GB+ GPUs.

Note that I've benchmarked the output quality/accuracy of 4-bit prefix cache vs no quantization, and there was no significant accuracy drop with my internal benchmarks. For my use case, at least, it's a free perf boost.

Today I wanted to try comparing SGLang performance to LMDeploy, but (for a 70B model on 48GB GPU) SGLang OOMs for even a small number of concurrent requests.

I'm testing with LLama 2 AWQ model with ~2k token context and ~100 token outputs:

LMDeploy (handles 20 concurrent requests fine):

Using latest (openmmlab/lmdeploy:v0.6.0a0-cu12) docker image on 48GB NVIDIA A40 GPU:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp $(nvidia-smi -L | wc -l) --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level INFO

SGLang (OOM at >=4 concurrent requests):

Using latest (lmsysorg/sglang:v0.3.0-cu121) docker image on 48GB NVIDIA A40 GPU:

python3 -m sglang.launch_server --model-path lmdeploy/llama2-chat-70b-4bit --context-length 8192 --host 0.0.0.0 --port 3000 --tp-size $(nvidia-smi -L | wc -l)

For reference, here's some example OOM logs from SGLang that I'm seeing: https://gist.github.com/josephrocca/1c688e312f5d570ca9a4652485ff6a24

It would be great if SGLang could become competitive with LMDeploy in this type of scenario, and I think it's hard to compete in a many user-scenario without a 4-bit quantized prefix cache.

Related resources

No response

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-09-10T17:32:40Z

@josephrocca May you adjust the --mem-fraction-static parameter?
https://github.com/sgl-project/sglang?tab=readme-ov-file#additional-server-arguments

merrymercy · 2024-09-10T17:33:55Z

This is a great feature and we welcome contributions on this.
For your OOM issue, can you try to tune some parameters?

sglang/docs/en/hyperparameter_tuning.md

Lines 28 to 32 in dff2860

    
           ### Avoid out-of-memory by tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests` 
        
           If you see out of memory (OOM) errors, you can decrease these parameters.   
        
           If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.   
        
           If OOM happens during decoding, try to decrease `--max-running-requests`.   
        
           You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] 4-bit quantized prefix cache #1374

[Feature] 4-bit quantized prefix cache #1374

josephrocca commented Sep 10, 2024 •

edited

Loading

zhyncs commented Sep 10, 2024

merrymercy commented Sep 10, 2024 •

edited

Loading

[Feature] 4-bit quantized prefix cache #1374

[Feature] 4-bit quantized prefix cache #1374

Comments

josephrocca commented Sep 10, 2024 • edited Loading

Motivation

LMDeploy (handles 20 concurrent requests fine):

SGLang (OOM at >=4 concurrent requests):

Related resources

zhyncs commented Sep 10, 2024

merrymercy commented Sep 10, 2024 • edited Loading

josephrocca commented Sep 10, 2024 •

edited

Loading

merrymercy commented Sep 10, 2024 •

edited

Loading