Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _posts/2025-01-14-struct-decode-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
layout: post
title: "Structured Decoding in vLLM: a gentle introduction"
author: "Guest Post by BentoML and Red Hat"
image: /assets/figures/struct-decode-intro/vllm-xgrammar-decode-time-per-output-token.png
---

**TL/DR**:
Expand Down
4 changes: 1 addition & 3 deletions _posts/2025-01-21-stack-release.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
---
layout: post
title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”"
thumbnail-img: /assets/figures/stack/stack-thumbnail.png
share-img: /assets/figures/stack/stack-thumbnail.png
title: "High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack"
author: LMCache Team
image: /assets/figures/stack/stack-thumbnail.png
---
Expand Down
2 changes: 0 additions & 2 deletions _posts/2025-02-24-ptpc-fp8-rocm.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ layout: post
title: "PTPC-FP8: Boosting vLLM Performance on AMD ROCm"
author: "AMD and Embedded LLM"
image: /assets/figures/ptpc/PTPC-tumbnail.png
thumbnail-img: /assets/figures/ptpc/PTPC-tumbnail.png
share-img: /assets/figures/ptpc/PTPC-tumbnail.png
math: true
---

Expand Down
2 changes: 0 additions & 2 deletions _posts/2025-04-05-llama4.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ layout: post
title: "Llama 4 in vLLM"
author: "The vLLM Team"
image: /assets/figures/llama4/perf.png
thumbnail-img: /assets/figures/llama4/perf.png
share-img: /assets/figures/llama4/perf.png
---

We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
Expand Down
2 changes: 0 additions & 2 deletions _posts/2025-04-11-transformers-backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ layout: post
title: "Transformers backend integration in vLLM"
author: "The Hugging Face Team"
image: /assets/figures/transformers-backend/transformers-backend.png
thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png
share-img: /assets/figures/transformers-backend/transformers-backend.png
---

The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index)
Expand Down
6 changes: 2 additions & 4 deletions _posts/2025-04-23-openrlhf-vllm.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
---
layout: post
title: "Accelerating RLHF with vLLM, Best Practice from OpenRLHF"
author: "The OpenRLHF Team"
image: /assets/figures/openrlhf-vllm/ray.png
thumbnail-img: /assets/figures/openrlhf-vllm/ray.png
share-img: /assets/figures/openrlhf-vllm/ray.png
author: "The OpenRLHF Team"
image: /assets/figures/openrlhf-vllm/ray.png
---

As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead. This challenge is particularly pronounced with models that excel at complex reasoning tasks (such as OpenAI-o1 and DeepSeek-R1), where generating long chain-of-thought (CoT) outputs can account for up to 90% of total training time. These models must produce detailed, step-by-step reasoning that can span thousands of tokens, making inference significantly more time-consuming than the training phase itself. As a pioneering inference framework, vLLM provides a user-friendly interface for generating RLHF samples and updating model weights.
Expand Down
5 changes: 3 additions & 2 deletions _posts/2025-06-30-minimax-m1.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@
layout: post
title: "MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference"
author: "MiniMax"
benchmark-img: /assets/figures/minimax-m1/benchmark.png
moe-img: /assets/figures/minimax-m1/moe.png
image: /assets/figures/minimax-m1/benchmark.png
benchmark-img: /assets/figures/minimax-m1/benchmark.png
moe-img: /assets/figures/minimax-m1/moe.png
lightning_attention-img: /assets/figures/minimax-m1/lightning_attention.png
---

Expand Down
2 changes: 0 additions & 2 deletions _posts/2025-09-11-qwen3-next.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ layout: post
title: "vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency"
author: "The vLLM Team"
image: /assets/figures/qwen3-next/qwen.png
thumbnail-img: /assets/figures/qwen3-next/qwen.png
share-img: /assets/figures/qwen3-next/qwen.png
---

We’re excited to announce that **vLLM now supports Qwen3-Next**, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a **hybrid architecture with extreme efficiency for long context support**, and vLLM offers full support of its functionalities.
Expand Down
3 changes: 2 additions & 1 deletion _posts/2025-09-16-vllm-meetup.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
layout: post
title: "The First vLLM Meetup in Korea"
author: "vLLM Team"
author: "vLLM Team"
image: /assets/figures/vllm-meetup/image-3.png
---

<p align="center">
Expand Down
4 changes: 1 addition & 3 deletions _posts/2025-09-29-deepseek-v3-2.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
---
layout: post
title: "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action"
author: "vLLM Team"
author: "vLLM Team"
image: /assets/figures/deepseek-v3-2/dsa-explained.png
thumbnail-img: /assets/figures/deepseek-v3-2/dsa-explained.png
share-img: /assets/figures/deepseek-v3-2/dsa-explained.png
---

### Introduction
Expand Down
9 changes: 5 additions & 4 deletions _posts/2025-10-09-blackwell-inferencemax.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
layout: post
title: "SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference"
author: "vLLM Team"
---
layout: post
title: "SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference"
author: "vLLM Team"
image: /assets/figures/blackwell-inferencemax/gpt-oss-120b-1k-1k.png
---

### Introduction
Expand Down