Skip to content

Releases: vllm-project/vllm

v0.3.2

21 Feb 19:50
8fbd84b
Compare
Choose a tag to compare

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.3.2

v0.3.1

16 Feb 23:06
5f08050
Compare
Choose a tag to compare

Major Changes

This version fixes the following major bugs:

  • Memory leak with distributed execution. (Solved by using CuPY for collective communication).
  • Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.3.1

v0.3.0

31 Jan 08:07
1af090b
Compare
Choose a tag to compare

Major Changes

  • Experimental multi-lora support
  • Experimental prefix caching support
  • FP8 KV Cache support
  • Optimized MoE performance and Deepseek MoE support
  • CI tested PRs
  • Support batch completion in server

What's Changed

New Contributors

Read more

v0.2.7

04 Jan 01:36
2e0b6e7
Compare
Choose a tag to compare

Major Changes

  • Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
  • Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

New Contributors

Full Changelog: v0.2.6...v0.2.7

v0.2.6

17 Dec 18:35
671af2b
Compare
Choose a tag to compare

Major changes

  • Fast model execution with CUDA/HIP graph
  • W4A16 GPTQ support (thanks to @chu-tianxiang)
  • Fix memory profiling with tensor parallelism
  • Fix *.bin weight loading for Mixtral models

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.2.6

v0.2.5

14 Dec 07:58
31c1f32
Compare
Choose a tag to compare

Major changes

  • Optimize Mixtral performance with expert parallelism (thanks to @Yard1)
  • [BugFix] Fix input positions for long context with sliding window

What's Changed

Full Changelog: v0.2.4...v0.2.5

v0.2.4

11 Dec 19:50
4dd4b5c
Compare
Choose a tag to compare

Major changes

What's Changed

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3

03 Dec 20:30
0f90eff
Compare
Choose a tag to compare

Major changes

  • Refactoring on Worker, InputMetadata, and Attention
  • Fix TP support for AWQ models
  • Support Prometheus metrics
  • Fix Baichuan & Baichuan 2

What's Changed

New Contributors

Full Changelog: v0.2.2...v0.2.3

v0.2.2

19 Nov 05:58
c5f7740
Compare
Choose a tag to compare

Major changes

  • Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
  • Extensive refactoring for better tensor parallelism & quantization support
  • New models: Yi, ChatGLM, Phi
  • Changes in scheduler: from 1D flattened input tensor to 2D tensor
  • AWQ support for all models
  • Added LogitsProcessor API
  • Preliminary support for SqueezeLLM

What's Changed

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1.post1

17 Oct 16:31
Compare
Choose a tag to compare

This is an emergency release to fix a bug on tensor parallelism support.