-
Notifications
You must be signed in to change notification settings - Fork 38
Blog for vllm optimizations on Intel GPU #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,128 @@ | ||||||||||||||
| --- | ||||||||||||||
| layout: post | ||||||||||||||
| title: "Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM" | ||||||||||||||
| author: "Intel vLLM Team" | ||||||||||||||
| --- | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| [Intel® Arc™ Pro B-Series GPU Family](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability with multi-GPU setups make it possible to run the latest, large and capable AI models locally, making advanced AI inference accessible to professionals looking to deploy Large Language Models (LLMs) without the premium costs typically associated with AI hardware. | ||||||||||||||
|
|
||||||||||||||
| vLLM is at the core of the software stack enabling fast and cost-effective LLM serving on Intel Arc Pro B-Series GPUs. Over the past few months, Intel developers have been actively collaborating with the vLLM community to enable and optimize key features and ensure seamless performance with multi-GPU scaling and PCIe P2P data transfer on Intel Arc Pro B-Series GPUs. | ||||||||||||||
|
|
||||||||||||||
| vLLM is at the core of the software stack enabling fast and cost-effective LLM serving on Intel Arc Pro B-Series GPUs. Over the past few months, Intel developers have been actively collaborating with the vLLM community to enable and optimize key features and ensure seamless performance with multi-GPU scaling and PCIe P2P data transfer on Intel Arc Pro B-Series GPUs. | ||||||||||||||
|
|
||||||||||||||
| Based on vLLM v1 engine, Intel® Arc™ Pro B-series GPUs provide vLLM key features and optimizations including: | ||||||||||||||
rogerxfeng8 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
| - Solid inference performance for DeepSeek distilled llama/qwen models | ||||||||||||||
rogerxfeng8 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||
| - Long context length (>50K) with good scaling on batch size | ||||||||||||||
| - Support for embedding, reranker, pooling models | ||||||||||||||
| - Support multi-modality models | ||||||||||||||
|
||||||||||||||
| - Support multi-modality models | |
| - Support for multi-modal models |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - By-layer online quantization to reduce the required GPU memory | |
| - Per-layer online quantization to reduce the required GPU memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Advanced Optimizations for MoE Models | |
| ## Advanced Optimizations for MoE Models | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Optimization 1. Single kernel launched in persistent loop | |
| ### Optimization 1. Single kernel launched in persistent loop | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Optimization 2. Dynamic balancing of computing groups | |
| ### Optimization 2. Dynamic balancing of computing groups | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Optimization 3. Fast MXFP4 to BFLOAT16 algorithm with prepack for memory load efficiency | |
| ### Optimization 3. Fast MXFP4 to BFLOAT16 algorithm with prepack for memory load efficiency | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Bitcast-bf16 ((x << 12) >> 6 & 0x81c0) * 2^126 | |
| `Bitcast-bf16 ((x << 12) >> 6 & 0x81c0) * 2^126` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we code format this line?
Uh oh!
There was an error while loading. Please reload this page.