Releases · vllm-project/vllm

21 Feb 19:50

github-actions

v0.3.2

8fbd84b

v0.3.2

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

Defensively copy sampling_params by @njhill in #2881
multi-LoRA as extra models in OpenAI server by @jvmncs in #2775
Add code-revision config argument for Hugging Face Hub by @mbm-ai in #2892
[Minor] Small fix to make distributed init logic in worker looks cleaner by @zhuohan123 in #2905
[Test] Add basic correctness test by @zhuohan123 in #2908
Support OLMo models. by @Isotr0py in #2832
Add warning to prevent changes to benchmark api server by @simon-mo in #2858
Fix vllm:prompt_tokens_total metric calculation by @ronensc in #2869
[ROCm] include gfx908 as supported by @jamestwhedbee in #2792
[FIX] Fix beam search test by @zhuohan123 in #2930
Make vLLM logging formatting optional by @Yard1 in #2877
Add metrics to RequestOutput by @Yard1 in #2876
Add Gemma model by @xiangxu-google in #2964
Upgrade transformers to v4.38.0 by @WoosukKwon in #2965
[FIX] Add Gemma model to the doc by @zhuohan123 in #2966
[ROCm] Upgrade transformers to v4.38.0 by @WoosukKwon in #2967
Support per-request seed by @njhill in #2514
Bump up version to v0.3.2 by @zhuohan123 in #2968

New Contributors

@jvmncs made their first contribution in #2775
@mbm-ai made their first contribution in #2892
@Isotr0py made their first contribution in #2832
@jamestwhedbee made their first contribution in #2792

Full Changelog: v0.3.1...v0.3.2

Contributors

jvmncs, Yard1, and 9 other contributors

Assets 10

16 Feb 23:06

github-actions

v0.3.1

5f08050

v0.3.1

Major Changes

This version fixes the following major bugs:

Memory leak with distributed execution. (Solved by using CuPY for collective communication).
Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len. by @sighingnow in #2688
fix some bugs about parameter description by @zspo in #2689
[Minor] Fix test_cache.py CI test failure by @pcmoritz in #2684
Add unit test for Mixtral MoE layer by @pcmoritz in #2677
Refactor Prometheus and Add Request Level Metrics by @rib-2 in #2316
Add Internlm2 by @Leymore in #2666
Fix compile error when using rocm by @zhaoyang-star in #2648
fix python 3.8 syntax by @simon-mo in #2716
Update README for meetup slides by @simon-mo in #2718
Use revision when downloading the quantization config file by @Pernekhan in #2697
remove hardcoded device="cuda" to support more device by @jikunshang in #2503
fix length_penalty default value to 1.0 by @zspo in #2667
Add one example to run batch inference distributed on Ray by @c21 in #2696
docs: update langchain serving instructions by @mspronesti in #2736
Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in #2741
Remove eos tokens from output by default by @zcnrex in #2611
add requirement: triton >= 2.1.0 by @whyiug in #2746
[Minor] Fix benchmark_latency by @WoosukKwon in #2765
[ROCm] Fix some kernels failed unit tests by @hongxiayang in #2498
Set local logging level via env variable by @gardberg in #2774
[ROCm] Fixup arch checks for ROCM by @dllehr-amd in #2627
Add fused top-K softmax kernel for MoE by @WoosukKwon in #2769
fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in #2489
[Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in #2750
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in #2790
Add documentation on how to do incremental builds by @pcmoritz in #2796
[Ray] Integration compiled DAG off by default by @rkooo567 in #2471
Disable custom all reduce by default by @WoosukKwon in #2808
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in #2768
Add documentation section about LoRA by @pcmoritz in #2834
Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in #2723
Serving Benchmark Refactoring by @ywang96 in #2433
[CI] Ensure documentation build is checked in CI by @simon-mo in #2842
Refactor llama family models by @esmeetu in #2637
Revert "Refactor llama family models" by @pcmoritz in #2851
Use CuPy for CUDA graphs by @WoosukKwon in #2811
Remove Yi model definition, please use LlamaForCausalLM instead by @pcmoritz in #2854
Add LoRA support for Mixtral by @tterrysun in #2831
Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in #2860
Fix internlm after #2860 by @pcmoritz in #2861
[Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in #2863
Fix docker python version by @NikolaBorisov in #2845
Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in #2867
Don't use cupy NCCL for AMD backends by @WoosukKwon in #2855
Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in #2880
[BugFix] Fix GC bug for LLM class by @WoosukKwon in #2882
Fix decilm.py by @pcmoritz in #2883
[ROCm] Dockerfile fix for flash-attention build by @hongxiayang in #2885
Prefix Caching- fix t4 triton error by @caoshiyi in #2517
Bump up to v0.3.1 by @WoosukKwon in #2887

New Contributors

@sighingnow made their first contribution in #2688
@rib-2 made their first contribution in #2316
@Leymore made their first contribution in #2666
@Pernekhan made their first contribution in #2697
@jikunshang made their first contribution in #2503
@c21 made their first contribution in #2696
@zcnrex made their first contribution in #2611
@whyiug made their first contribution in #2746
@gardberg made their first contribution in #2774
@dllehr-amd made their first contribution in #2627
@rkooo567 made their first contribution in #2471
@ywang96 made their first contribution in #2433
@tterrysun made their first contribution in #2831

Full Changelog: v0.3.0...v0.3.1

Contributors

pcmoritz, NikolaBorisov, and 24 other contributors

Assets 10

31 Jan 08:07

github-actions

v0.3.0

1af090b

v0.3.0

Major Changes

Experimental multi-lora support
Experimental prefix caching support
FP8 KV Cache support
Optimized MoE performance and Deepseek MoE support
CI tested PRs
Support batch completion in server

What's Changed

Miner fix of type hint by @beginlner in #2340
Build docker image with shared objects from "build" step by @payoto in #2237
Ensure metrics are logged regardless of requests by @ichernev in #2347
Changed scheduler to use deques instead of lists by @NadavShmayo in #2290
Fix eager mode performance by @WoosukKwon in #2377
[Minor] Remove unused code in attention by @WoosukKwon in #2384
Add baichuan chat template jinjia file by @EvilPsyCHo in #2390
[Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in #2336
Fix ipv4 ipv6 dualstack by @yunfeng-scale in #2408
[Minor] Rename phi_1_5 to phi by @WoosukKwon in #2385
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in #1011
[Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in #2425
Add gradio chatbot for openai webserver by @arkohut in #2307
[BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in #2371
Allow setting fastapi root_path argument by @chiragjn in #2341
Address Phi modeling update 2 by @huiwy in #2428
Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in #2374
Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in #2369
Aligning top_p and top_k Sampling by @chenxu2048 in #1885
[Minor] Fix err msg by @WoosukKwon in #2431
[Minor] Optimize cuda graph memory usage by @esmeetu in #2437
[CI] Add Buildkite by @simon-mo in #2355
Announce the second vLLM meetup by @WoosukKwon in #2444
Allow buildkite to retry build on agent lost by @simon-mo in #2446
Fix weigit loading for GQA with TP by @zhangch9 in #2379
CI: make sure benchmark script exit on error by @simon-mo in #2449
ci: retry on build failure as well by @simon-mo in #2457
Add StableLM3B model by @ita9naiwa in #2372
OpenAI refactoring by @FlorianJoncour in #2360
[Experimental] Prefix Caching Support by @caoshiyi in #1669
fix stablelm.py tensor-parallel-size bug by @YingchaoX in #2482
Minor fix in prefill cache example by @JasonZhu1313 in #2494
fix: fix some args desc by @zspo in #2487
[Neuron] Add an option to build with neuron by @liangfu in #2065
Don't download both safetensor and bin files. by @NikolaBorisov in #2480
[BugFix] Fix abort_seq_group by @beginlner in #2463
refactor completion api for readability by @simon-mo in #2499
Support OpenAI API server in benchmark_serving.py by @hmellor in #2172
Simplify broadcast logic for control messages by @zhuohan123 in #2501
[Bugfix] fix load local safetensors model by @esmeetu in #2512
Add benchmark serving to CI by @simon-mo in #2505
Add group as an argument in broadcast ops by @GindaChen in #2522
[Fix] Keep scheduler.running as deque by @njhill in #2523
migrate pydantic from v1 to v2 by @joennlae in #2531
[Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in #2424
Fix "Port could not be cast to integer value as " by @pcmoritz in #2545
Add qwen2 by @JustinLin610 in #2495
Fix progress bar and allow HTTPS in benchmark_serving.py by @hmellor in #2552
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in #2553
[Feature] Simple API token authentication by @taisazero in #1106
Add multi-LoRA support by @Yard1 in #1804
lint: format all python file instead of just source code by @simon-mo in #2567
[Bugfix] fix crash if max_tokens=None by @NikolaBorisov in #2570
Added include_stop_str_in_output and length_penalty parameters to OpenAI API by @galatolofederico in #2562
[Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in #2584
Support Batch Completion in Server by @simon-mo in #2529
fix names and license by @JustinLin610 in #2589
[Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in #2583
[ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in #2274
Support for Stable LM 2 by @dakotamahan-stability in #2598
Don't build punica kernels by default by @pcmoritz in #2605
AWQ: Up to 2.66x higher throughput by @casper-hansen in #2566
Use head_dim in config if exists by @xiangxu-google in #2622
Custom all reduce kernels by @hanzhi713 in #2192
[Minor] Fix warning on Ray dependencies by @WoosukKwon in #2630
Speed up Punica compilation by @WoosukKwon in #2632
Small async_llm_engine refactor by @andoorve in #2618
Update Ray version requirements by @simon-mo in #2636
Support FP8-E5M2 KV Cache by @zhaoyang-star in #2279
Fix error when tp > 1 by @zhaoyang-star in #2644
No repeated IPC open by @hanzhi713 in #2642
ROCm: Allow setting compilation target by @rlrs in #2581
DeepseekMoE support with Fused MoE kernel by @zwd003 in #2453
Fused MOE for Mixtral by @pcmoritz in #2542
Fix 'Actor methods cannot be called directly' when using --engine-use-ray by @HermitSun in #2664
Add swap_blocks unit tests by @sh1ng in #2616
Fix a small typo (tenosr -> tensor) by @pcmoritz in #2672
[Minor] Fix false warning when TP=1 by @WoosukKwon in #2674
Add quantized mixtral support by @WoosukKwon in #2673
Bump up version to v0.3.0 by @zhuohan123 in #2656

New Contributors

@payoto made their first contribution in #2237
@NadavShmayo made their first contribution in #2290
@EvilPsyCHo made their first contribution in #2390
@litone01 made their first contribution in #1011
@arkohut made their first contribution in #2307
@chiragjn made their first contribution in #2341
@huiwy made their first contribution in #2428
@chuanzhubin made their first contribution in #2374
@nautsimon made their first contribution in #2369
@zhangch9 made their first contribution in #2379
@ita9naiwa made their first contribution in #2372
@caoshiyi made their first contribution in https://gi...

Contributors

pcmoritz, NikolaBorisov, and 45 other contributors

Assets 10

04 Jan 01:36

github-actions

v0.2.7

2e0b6e7

v0.2.7

Major Changes

Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

Minor fix for gpu-memory-utilization description by @SuhongMoon in #2162
[BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in #2163
[BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in #2164
Add SSL arguments to API servers by @hmellor in #2109
typo fix by @oushu1zhangxiangxuan1 in #2166
[ROCm] Fixes for GPTQ on ROCm by @kliuae in #2180
Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in #2183
[Minor] Add warning on CUDA graph memory usage by @WoosukKwon in #2182
Added DeciLM-7b and DeciLM-7b-instruct by @avideci in #2062
[BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in #2208
Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in #2207
Remove Sampler copy stream by @Yard1 in #2209
Fix a broken link by @ronensc in #2222
Disable Ray usage stats collection by @WoosukKwon in #2206
[BugFix] Fix recovery logic for sequence group by @WoosukKwon in #2186
Update installation instructions to include CUDA 11.8 xFormers by @skt7 in #2246
Add "About" Heading to README.md by @blueceiling in #2260
[BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in #2258
[BUGFIX] Fix API server test by @zhuohan123 in #2270
[BUGFIX] Fix the path of test prompts by @zhuohan123 in #2273
[BUGFIX] Fix communication test by @zhuohan123 in #2285
Add support GPT-NeoX Models without attention biases by @dalgarak in #2301
[FIX] Fix kernel bug by @jeejeelee in #1959
fix typo and remove unused code by @esmeetu in #2305
Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in #2318
Fix Gradio example: remove deprecated parameter concurrency_count by @ronensc in #2315
Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in #2221
Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in #2321
[Minor] Revert the changes in test_cache by @WoosukKwon in #2335
Bump up to v0.2.7 by @WoosukKwon in #2337

New Contributors

@SuhongMoon made their first contribution in #2162
@hmellor made their first contribution in #2109
@oushu1zhangxiangxuan1 made their first contribution in #2166
@kliuae made their first contribution in #2180
@avideci made their first contribution in #2062
@hanzhi713 made their first contribution in #2207
@ronensc made their first contribution in #2222
@skt7 made their first contribution in #2246
@blueceiling made their first contribution in #2260
@dalgarak made their first contribution in #2301

Full Changelog: v0.2.6...v0.2.7

Contributors

esmeetu, Yard1, and 13 other contributors

Assets 10

17 Dec 18:35

github-actions

v0.2.6

671af2b

v0.2.6

Major changes

Fast model execution with CUDA/HIP graph
W4A16 GPTQ support (thanks to @chu-tianxiang)
Fix memory profiling with tensor parallelism
Fix *.bin weight loading for Mixtral models

What's Changed

Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in #2100
Fix Dockerfile.rocm by @tjtanaa in #2101
avoid multiple redefinition by @MitchellX in #1817
Add a flag to include stop string in output text by @yunfeng-scale in #1976
Add GPTQ support by @chu-tianxiang in #916
[Docs] Add quantization support to docs by @WoosukKwon in #2135
[ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in #2138
simplify loading weights logic by @esmeetu in #2133
Optimize model execution with CUDA graph by @WoosukKwon in #1926
[Minor] Delete Llama tokenizer warnings by @WoosukKwon in #2146
Fix all-reduce memory usage by @WoosukKwon in #2151
Pin PyTorch & xformers versions by @WoosukKwon in #2155
Remove dependency on CuPy by @WoosukKwon in #2152
[Docs] Add CUDA graph support to docs by @WoosukKwon in #2148
Temporarily enforce eager mode for GPTQ models by @WoosukKwon in #2154
[Minor] Add more detailed explanation on quantization argument by @WoosukKwon in #2145
[Minor] Fix xformers version by @WoosukKwon in #2158
[Minor] Add Phi 2 to supported models by @WoosukKwon in #2159
Make sampler less blocking by @Yard1 in #1889
[Minor] Fix a typo in .pt weight support by @WoosukKwon in #2160
Disable CUDA graph for SqueezeLLM by @WoosukKwon in #2161
Bump up to v0.2.6 by @WoosukKwon in #2157

New Contributors

@mezuzza made their first contribution in #2100
@MitchellX made their first contribution in #1817

Full Changelog: v0.2.5...v0.2.6

Contributors

mezuzza, esmeetu, and 6 other contributors

Assets 10

14 Dec 07:58

github-actions

v0.2.5

31c1f32

v0.2.5

Major changes

Optimize Mixtral performance with expert parallelism (thanks to @Yard1)
[BugFix] Fix input positions for long context with sliding window

What's Changed

Update Dockerfile to support Mixtral by @simon-mo in #2027
Remove python 3.10 requirement by @WoosukKwon in #2040
[CI/CD] Upgrade PyTorch version to v2.1.1 by @WoosukKwon in #2045
Upgrade transformers version to 4.36.0 by @WoosukKwon in #2046
Remove einops from dependencies by @WoosukKwon in #2049
gqa added to mpt attn by @megha95 in #1938
Update Dockerfile to build Megablocks by @simon-mo in #2042
Fix peak memory profiling by @WoosukKwon in #2031
Implement lazy model loader by @WoosukKwon in #2044
[ROCm] Upgrade xformers version dependency for ROCm; update documentations by @tjtanaa in #2079
Update installation instruction for CUDA 11.8 by @WoosukKwon in #2086
[Docs] Add notes on ROCm-supported models by @WoosukKwon in #2087
[BugFix] Fix input positions for long context with sliding window by @WoosukKwon in #2088
Mixtral expert parallelism by @Yard1 in #2090
Bump up to v0.2.5 by @WoosukKwon in #2095

Full Changelog: v0.2.4...v0.2.5

Contributors

Yard1, megha95, and 3 other contributors

Assets 10

11 Dec 19:50

github-actions

v0.2.4

4dd4b5c

v0.2.4

Major changes

Mixtral model support (officially from @mistralai)
AMD GPU support (collaboration with @EmbeddedLLM)

What's Changed

add custom server params by @esmeetu in #1868
support ChatGLMForConditionalGeneration by @dancingpipi in #1932
Save pytorch profiler output for latency benchmark by @Yard1 in #1871
Fix typo in adding_model.rst by @petergtz in #1947
Make InternLM follow rope_scaling in config.json by @theFool32 in #1956
Fix quickstart.rst example by @gottlike in #1964
Adding number of nvcc_threads during build as envar by @AguirreNicolas in #1893
fix typo in getenv call by @dskhudia in #1972
[Continuation] Merge EmbeddedLLM/vllm-rocm into vLLM main by @tjtanaa in #1836
Fix Baichuan2-7B-Chat by @firebook in #1987
[Docker] Add cuda arch list as build option by @simon-mo in #1950
Fix for KeyError on Loading LLaMA by @imgaojun in #1978
[Minor] Fix code style for baichuan by @WoosukKwon in #2003
Fix OpenAI server completion_tokens referenced before assignment by @js8544 in #1996
[Minor] Add comment on skipping rope caches by @WoosukKwon in #2004
Replace head_mapping params with num_kv_heads to attention kernel. by @wbn03 in #1997
Fix completion API echo and logprob combo by @simon-mo in #1992
Mixtral 8x7B support by @pierrestock in #2011
Minor fixes for Mixtral by @WoosukKwon in #2015
Change load format for Mixtral by @WoosukKwon in #2028
Update run_on_sky.rst by @eltociear in #2025
Update requirements.txt for mixtral by @0-hero in #2029
Revert #2029 by @WoosukKwon in #2030
[Minor] Fix latency benchmark script by @WoosukKwon in #2035
[Minor] Fix type annotation in Mixtral by @WoosukKwon in #2036
Update README.md to add megablocks requirement for mixtral by @0-hero in #2033
[Minor] Fix import error msg for megablocks by @WoosukKwon in #2038
Bump up to v0.2.4 by @WoosukKwon in #2034

New Contributors

@dancingpipi made their first contribution in #1932
@petergtz made their first contribution in #1947
@theFool32 made their first contribution in #1956
@gottlike made their first contribution in #1964
@AguirreNicolas made their first contribution in #1893
@dskhudia made their first contribution in #1972
@tjtanaa made their first contribution in #1836
@firebook made their first contribution in #1987
@imgaojun made their first contribution in #1978
@js8544 made their first contribution in #1996
@wbn03 made their first contribution in #1997
@pierrestock made their first contribution in #2011
@0-hero made their first contribution in #2029

Full Changelog: v0.2.3...v0.2.4

Contributors

gottlike, firebook, and 18 other contributors

Assets 10

03 Dec 20:30

github-actions

v0.2.3

0f90eff

v0.2.3

Major changes

Refactoring on Worker, InputMetadata, and Attention
Fix TP support for AWQ models
Support Prometheus metrics
Fix Baichuan & Baichuan 2

What's Changed

Add instructions to install vllm+cu118 by @WoosukKwon in #1717
Documentation about official docker image by @simon-mo in #1709
Fix the code block's format in deploying_with_docker page by @HermitSun in #1722
Migrate linter from pylint to ruff by @simon-mo in #1665
[FIX] Update the doc link in README.md by @zhuohan123 in #1730
[BugFix] Fix a bug in loading safetensors by @WoosukKwon in #1732
Fix hanging in the scheduler caused by long prompts by @chenxu2048 in #1534
[Fix] Fix bugs in scheduler by @linotfan in #1727
Rewrite torch.repeat_interleave to remove cpu synchronization by @beginlner in #1599
fix RAM OOM when load large models in tensor parallel mode. by @boydfd in #1395
[BugFix] Fix TP support for AWQ by @WoosukKwon in #1731
[FIX] Fix the case when input_is_parallel=False for ScaledActivation by @zhuohan123 in #1737
Add stop_token_ids in SamplingParams.repr by @chenxu2048 in #1745
[DOCS] Add engine args documentation by @casper-hansen in #1741
Set top_p=0 and top_k=-1 in greedy sampling by @beginlner in #1748
Fix repetition penalty aligned with huggingface by @beginlner in #1577
[build] Avoid building too many extensions by @ymwangg in #1624
[Minor] Fix model docstrings by @WoosukKwon in #1764
Added echo function to OpenAI API server. by @wanmok in #1504
Init model on GPU to reduce CPU memory footprint by @beginlner in #1796
Correct comments in parallel_state.py by @explainerauthors in #1818
Fix OPT weight loading by @WoosukKwon in #1819
[FIX] Fix class naming by @zhuohan123 in #1803
Move the definition of BlockTable a few lines above so we could use it in BlockAllocator by @explainerauthors in #1791
[FIX] Fix formatting error in main branch by @zhuohan123 in #1822
[Fix] Fix RoPE in ChatGLM-32K by @WoosukKwon in #1841
Better integration with Ray Serve by @FlorianJoncour in #1821
Refactor Attention by @WoosukKwon in #1840
[Docs] Add information about using shared memory in docker by @simon-mo in #1845
Disable Logs Requests should Disable Logging of requests. by @MichaelMcCulloch in #1779
Refactor worker & InputMetadata by @WoosukKwon in #1843
Avoid multiple instantiations of the RoPE class by @jeejeeli in #1828
[FIX] Fix docker build error (#1831) by @allenhaozi in #1832
Add profile option to latency benchmark by @WoosukKwon in #1839
Remove max_num_seqs in latency benchmark by @WoosukKwon in #1855
Support max-model-len argument for throughput benchmark by @aisensiy in #1858
Fix rope cache key error by @esmeetu in #1867
docs: add instructions for Langchain by @mspronesti in #1162
Support chat template and echo for chat API by @Tostino in #1756
Fix Baichuan tokenizer error by @WoosukKwon in #1874
Add weight normalization for Baichuan 2 by @WoosukKwon in #1876
Fix the typo in SamplingParams' docstring. by @xukp20 in #1886
[Docs] Update the AWQ documentation to highlight performance issue by @simon-mo in #1883
Fix the broken sampler tests by @WoosukKwon in #1896
Add Production Metrics in Prometheus format by @simon-mo in #1890
Add PyTorch-native implementation of custom layers by @WoosukKwon in #1898
Fix broken worker test by @WoosukKwon in #1900
chore(examples-docs): upgrade to OpenAI V1 by @mspronesti in #1785
Fix num_gpus when TP > 1 by @WoosukKwon in #1852
Bump up to v0.2.3 by @WoosukKwon in #1903

New Contributors

@boydfd made their first contribution in #1395
@explainerauthors made their first contribution in #1818
@FlorianJoncour made their first contribution in #1821
@MichaelMcCulloch made their first contribution in #1779
@jeejeeli made their first contribution in #1828
@allenhaozi made their first contribution in #1832
@aisensiy made their first contribution in #1858
@xukp20 made their first contribution in #1886

Full Changelog: v0.2.2...v0.2.3

Contributors

aisensiy, allenhaozi, and 19 other contributors

Assets 10

19 Nov 05:58

github-actions

v0.2.2

c5f7740

v0.2.2

Major changes

Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
Extensive refactoring for better tensor parallelism & quantization support
New models: Yi, ChatGLM, Phi
Changes in scheduler: from 1D flattened input tensor to 2D tensor
AWQ support for all models
Added LogitsProcessor API
Preliminary support for SqueezeLLM

What's Changed

Change scheduler & input tensor shape by @WoosukKwon in #1381
Add Mistral 7B to test_models by @WoosukKwon in #1366
fix typo by @WrRan in #1383
Fix TP bug by @WoosukKwon in #1389
Fix type hints by @lxrite in #1427
remove useless statements by @WrRan in #1408
Pin dependency versions by @thiagosalvatore in #1429
SqueezeLLM Support by @chooper1 in #1326
aquila model add rope_scaling by @Sanster in #1457
fix: don't skip first special token. by @gesanqiu in #1497
Support repetition_penalty by @beginlner in #1424
Fix bias in InternLM by @WoosukKwon in #1501
Delay GPU->CPU sync in sampling by @Yard1 in #1337
Refactor LLMEngine demo script for clarity and modularity by @iongpt in #1413
Fix logging issues by @Tostino in #1494
Add py.typed so consumers of vLLM can get type checking by @jroesch in #1509
vLLM always places spaces between special tokens by @blahblahasdf in #1373
[Fix] Fix duplicated logging messages by @zhuohan123 in #1524
Add dockerfile by @skrider in #1350
Fix integer overflows in attention & cache ops by @WoosukKwon in #1514
[Small] Formatter only checks lints in changed files by @cadedaniel in #1528
Add MptForCausalLM key in model_loader by @wenfeiy-db in #1526
[BugFix] Fix a bug when engine_use_ray=True and worker_use_ray=False and TP>1 by @beginlner in #1531
Adding a health endpoint by @Fluder-Paradyne in #1540
Remove MPTConfig by @WoosukKwon in #1529
Force paged attention v2 for long contexts by @Yard1 in #1510
docs: add description by @lots-o in #1553
Added logits processor API to sampling params by @noamgat in #1469
YaRN support implementation by @Yard1 in #1264
Add Quantization and AutoAWQ to docs by @casper-hansen in #1235
Support Yi model by @esmeetu in #1567
ChatGLM2 Support by @GoHomeToMacDonal in #1261
Upgrade to CUDA 12 by @zhuohan123 in #1527
[Worker] Fix input_metadata.selected_token_indices in worker by @ymwangg in #1546
Build CUDA11.8 wheels for release by @WoosukKwon in #1596
Add Yi model to quantization support by @forpanyang in #1600
Dockerfile: Upgrade Cuda to 12.1 by @GhaziSyed in #1609
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len by @irasin in #1617
Fix cpu heavy code in async function _AsyncLLMEngine._run_workers_async by @dominik-schwabe in #1628
Fix #1474 - gptj AssertionError : assert param_slice.shape == loaded_weight.shape by @lihuahua123 in #1631
[Minor] Move RoPE selection logic to get_rope by @WoosukKwon in #1633
Add DeepSpeed MII backend to benchmark script by @WoosukKwon in #1649
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models by @zhuohan123 in #1622
Remove MptConfig by @megha95 in #1668
feat(config): support parsing torch.dtype by @aarnphm in #1641
Fix loading error when safetensors contains empty tensor by @twaka in #1687
[Minor] Fix duplication of ignored seq group in engine step by @simon-mo in #1666
[models] Microsoft Phi 1.5 by @maximzubkov in #1664
[Fix] Update Supported Models List by @zhuohan123 in #1690
Return usage for openai requests by @ichernev in #1663
[Fix] Fix comm test by @zhuohan123 in #1691
Update the adding-model doc according to the new refactor by @zhuohan123 in #1692
Add 'not' to this annotation: "#FIXME(woosuk): Do not use internal method" by @linotfan in #1704
Support Min P Sampler by @esmeetu in #1642
Read quantization_config in hf config by @WoosukKwon in #1695
Support download models from www.modelscope.cn by @liuyhwangyh in #1588
follow up of #1687 when safetensors model contains 0-rank tensors by @twaka in #1696
Add AWQ support for all models by @WoosukKwon in #1714
Support fused add rmsnorm for LLaMA by @beginlner in #1667
[Fix] Fix warning msg on quantization by @WoosukKwon in #1715
Bump up the version to v0.2.2 by @WoosukKwon in #1689