diff --git a/_posts/2025-01-14-struct-decode-intro.md b/_posts/2025-01-14-struct-decode-intro.md index 6116b46..dabc4ee 100644 --- a/_posts/2025-01-14-struct-decode-intro.md +++ b/_posts/2025-01-14-struct-decode-intro.md @@ -2,6 +2,7 @@ layout: post title: "Structured Decoding in vLLM: a gentle introduction" author: "Guest Post by BentoML and Red Hat" +image: /assets/figures/struct-decode-intro/vllm-xgrammar-decode-time-per-output-token.png --- **TL/DR**: diff --git a/_posts/2025-01-21-stack-release.md b/_posts/2025-01-21-stack-release.md index 81c7248..3250bdc 100644 --- a/_posts/2025-01-21-stack-release.md +++ b/_posts/2025-01-21-stack-release.md @@ -1,8 +1,6 @@ --- layout: post -title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" -thumbnail-img: /assets/figures/stack/stack-thumbnail.png -share-img: /assets/figures/stack/stack-thumbnail.png +title: "High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack" author: LMCache Team image: /assets/figures/stack/stack-thumbnail.png --- diff --git a/_posts/2025-02-24-ptpc-fp8-rocm.md b/_posts/2025-02-24-ptpc-fp8-rocm.md index 5c76b7e..8ef998f 100644 --- a/_posts/2025-02-24-ptpc-fp8-rocm.md +++ b/_posts/2025-02-24-ptpc-fp8-rocm.md @@ -3,8 +3,6 @@ layout: post title: "PTPC-FP8: Boosting vLLM Performance on AMD ROCm" author: "AMD and Embedded LLM" image: /assets/figures/ptpc/PTPC-tumbnail.png -thumbnail-img: /assets/figures/ptpc/PTPC-tumbnail.png -share-img: /assets/figures/ptpc/PTPC-tumbnail.png math: true --- diff --git a/_posts/2025-04-05-llama4.md b/_posts/2025-04-05-llama4.md index 42aca6a..a8e6df2 100644 --- a/_posts/2025-04-05-llama4.md +++ b/_posts/2025-04-05-llama4.md @@ -3,8 +3,6 @@ layout: post title: "Llama 4 in vLLM" author: "The vLLM Team" image: /assets/figures/llama4/perf.png -thumbnail-img: /assets/figures/llama4/perf.png -share-img: /assets/figures/llama4/perf.png --- We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later: diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 88691b9..68c4f90 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -3,8 +3,6 @@ layout: post title: "Transformers backend integration in vLLM" author: "The Hugging Face Team" image: /assets/figures/transformers-backend/transformers-backend.png -thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png -share-img: /assets/figures/transformers-backend/transformers-backend.png --- The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index) diff --git a/_posts/2025-04-23-openrlhf-vllm.md b/_posts/2025-04-23-openrlhf-vllm.md index 6b6e39d..c5e77ea 100644 --- a/_posts/2025-04-23-openrlhf-vllm.md +++ b/_posts/2025-04-23-openrlhf-vllm.md @@ -1,10 +1,8 @@ --- layout: post title: "Accelerating RLHF with vLLM, Best Practice from OpenRLHF" -author: "The OpenRLHF Team" -image: /assets/figures/openrlhf-vllm/ray.png -thumbnail-img: /assets/figures/openrlhf-vllm/ray.png -share-img: /assets/figures/openrlhf-vllm/ray.png +author: "The OpenRLHF Team" +image: /assets/figures/openrlhf-vllm/ray.png --- As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead. This challenge is particularly pronounced with models that excel at complex reasoning tasks (such as OpenAI-o1 and DeepSeek-R1), where generating long chain-of-thought (CoT) outputs can account for up to 90% of total training time. These models must produce detailed, step-by-step reasoning that can span thousands of tokens, making inference significantly more time-consuming than the training phase itself. As a pioneering inference framework, vLLM provides a user-friendly interface for generating RLHF samples and updating model weights. diff --git a/_posts/2025-06-30-minimax-m1.md b/_posts/2025-06-30-minimax-m1.md index d49c0ca..0e0404a 100644 --- a/_posts/2025-06-30-minimax-m1.md +++ b/_posts/2025-06-30-minimax-m1.md @@ -2,8 +2,9 @@ layout: post title: "MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference" author: "MiniMax" -benchmark-img: /assets/figures/minimax-m1/benchmark.png -moe-img: /assets/figures/minimax-m1/moe.png +image: /assets/figures/minimax-m1/benchmark.png +benchmark-img: /assets/figures/minimax-m1/benchmark.png +moe-img: /assets/figures/minimax-m1/moe.png lightning_attention-img: /assets/figures/minimax-m1/lightning_attention.png --- diff --git a/_posts/2025-09-11-qwen3-next.md b/_posts/2025-09-11-qwen3-next.md index 7b75274..cb1eeea 100644 --- a/_posts/2025-09-11-qwen3-next.md +++ b/_posts/2025-09-11-qwen3-next.md @@ -3,8 +3,6 @@ layout: post title: "vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency" author: "The vLLM Team" image: /assets/figures/qwen3-next/qwen.png -thumbnail-img: /assets/figures/qwen3-next/qwen.png -share-img: /assets/figures/qwen3-next/qwen.png --- We’re excited to announce that **vLLM now supports Qwen3-Next**, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a **hybrid architecture with extreme efficiency for long context support**, and vLLM offers full support of its functionalities. diff --git a/_posts/2025-09-16-vllm-meetup.md b/_posts/2025-09-16-vllm-meetup.md index 4f9cd42..e329c61 100644 --- a/_posts/2025-09-16-vllm-meetup.md +++ b/_posts/2025-09-16-vllm-meetup.md @@ -1,7 +1,8 @@ --- layout: post title: "The First vLLM Meetup in Korea" -author: "vLLM Team" +author: "vLLM Team" +image: /assets/figures/vllm-meetup/image-3.png ---
diff --git a/_posts/2025-09-29-deepseek-v3-2.md b/_posts/2025-09-29-deepseek-v3-2.md index c3983e2..cf43b75 100644 --- a/_posts/2025-09-29-deepseek-v3-2.md +++ b/_posts/2025-09-29-deepseek-v3-2.md @@ -1,10 +1,8 @@ --- layout: post title: "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action" -author: "vLLM Team" +author: "vLLM Team" image: /assets/figures/deepseek-v3-2/dsa-explained.png -thumbnail-img: /assets/figures/deepseek-v3-2/dsa-explained.png -share-img: /assets/figures/deepseek-v3-2/dsa-explained.png --- ### Introduction diff --git a/_posts/2025-10-09-blackwell-inferencemax.md b/_posts/2025-10-09-blackwell-inferencemax.md index 2c71c43..ba40ecf 100644 --- a/_posts/2025-10-09-blackwell-inferencemax.md +++ b/_posts/2025-10-09-blackwell-inferencemax.md @@ -1,7 +1,8 @@ ---- -layout: post -title: "SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference" -author: "vLLM Team" +--- +layout: post +title: "SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference" +author: "vLLM Team" +image: /assets/figures/blackwell-inferencemax/gpt-oss-120b-1k-1k.png --- ### Introduction