Releases: vllm-project/vllm
Releases · vllm-project/vllm
v0.3.2
Major Changes
This version adds support for the OLMo and Gemma Model, as well as seed
parameter.
What's Changed
- Defensively copy
sampling_params
by @njhill in #2881 - multi-LoRA as extra models in OpenAI server by @jvmncs in #2775
- Add code-revision config argument for Hugging Face Hub by @mbm-ai in #2892
- [Minor] Small fix to make distributed init logic in worker looks cleaner by @zhuohan123 in #2905
- [Test] Add basic correctness test by @zhuohan123 in #2908
- Support OLMo models. by @Isotr0py in #2832
- Add warning to prevent changes to benchmark api server by @simon-mo in #2858
- Fix
vllm:prompt_tokens_total
metric calculation by @ronensc in #2869 - [ROCm] include gfx908 as supported by @jamestwhedbee in #2792
- [FIX] Fix beam search test by @zhuohan123 in #2930
- Make vLLM logging formatting optional by @Yard1 in #2877
- Add metrics to RequestOutput by @Yard1 in #2876
- Add Gemma model by @xiangxu-google in #2964
- Upgrade transformers to v4.38.0 by @WoosukKwon in #2965
- [FIX] Add Gemma model to the doc by @zhuohan123 in #2966
- [ROCm] Upgrade transformers to v4.38.0 by @WoosukKwon in #2967
- Support per-request seed by @njhill in #2514
- Bump up version to v0.3.2 by @zhuohan123 in #2968
New Contributors
- @jvmncs made their first contribution in #2775
- @mbm-ai made their first contribution in #2892
- @Isotr0py made their first contribution in #2832
- @jamestwhedbee made their first contribution in #2792
Full Changelog: v0.3.1...v0.3.2
v0.3.1
Major Changes
This version fixes the following major bugs:
- Memory leak with distributed execution. (Solved by using CuPY for collective communication).
- Support for Python 3.8.
Also with many smaller bug fixes listed below.
What's Changed
- Fixes assertion failure in prefix caching: the lora index mapping should respect
prefix_len
. by @sighingnow in #2688 - fix some bugs about parameter description by @zspo in #2689
- [Minor] Fix test_cache.py CI test failure by @pcmoritz in #2684
- Add unit test for Mixtral MoE layer by @pcmoritz in #2677
- Refactor Prometheus and Add Request Level Metrics by @rib-2 in #2316
- Add Internlm2 by @Leymore in #2666
- Fix compile error when using rocm by @zhaoyang-star in #2648
- fix python 3.8 syntax by @simon-mo in #2716
- Update README for meetup slides by @simon-mo in #2718
- Use revision when downloading the quantization config file by @Pernekhan in #2697
- remove hardcoded
device="cuda"
to support more device by @jikunshang in #2503 - fix length_penalty default value to 1.0 by @zspo in #2667
- Add one example to run batch inference distributed on Ray by @c21 in #2696
- docs: update langchain serving instructions by @mspronesti in #2736
- Set&Get llm internal tokenizer instead of the TokenizerGroup by @dancingpipi in #2741
- Remove eos tokens from output by default by @zcnrex in #2611
- add requirement: triton >= 2.1.0 by @whyiug in #2746
- [Minor] Fix benchmark_latency by @WoosukKwon in #2765
- [ROCm] Fix some kernels failed unit tests by @hongxiayang in #2498
- Set local logging level via env variable by @gardberg in #2774
- [ROCm] Fixup arch checks for ROCM by @dllehr-amd in #2627
- Add fused top-K softmax kernel for MoE by @WoosukKwon in #2769
- fix issue when model parameter is not a model id but path of the model. by @liuyhwangyh in #2489
- [Minor] More fix of test_cache.py CI test failure by @LiuXiaoxuanPKU in #2750
- [ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support by @hongxiayang in #2790
- Add documentation on how to do incremental builds by @pcmoritz in #2796
- [Ray] Integration compiled DAG off by default by @rkooo567 in #2471
- Disable custom all reduce by default by @WoosukKwon in #2808
- [ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention by @hongxiayang in #2768
- Add documentation section about LoRA by @pcmoritz in #2834
- Refactor 2 awq gemm kernels into m16nXk32 by @zcnrex in #2723
- Serving Benchmark Refactoring by @ywang96 in #2433
- [CI] Ensure documentation build is checked in CI by @simon-mo in #2842
- Refactor llama family models by @esmeetu in #2637
- Revert "Refactor llama family models" by @pcmoritz in #2851
- Use CuPy for CUDA graphs by @WoosukKwon in #2811
- Remove Yi model definition, please use
LlamaForCausalLM
instead by @pcmoritz in #2854 - Add LoRA support for Mixtral by @tterrysun in #2831
- Migrate InternLMForCausalLM to LlamaForCausalLM by @pcmoritz in #2860
- Fix internlm after #2860 by @pcmoritz in #2861
- [Fix] Fix memory profiling when GPU is used by multiple processes by @WoosukKwon in #2863
- Fix docker python version by @NikolaBorisov in #2845
- Migrate AquilaForCausalLM to LlamaForCausalLM by @esmeetu in #2867
- Don't use cupy NCCL for AMD backends by @WoosukKwon in #2855
- Align LoRA code between Mistral and Mixtral (fixes #2875) by @pcmoritz in #2880
- [BugFix] Fix GC bug for
LLM
class by @WoosukKwon in #2882 - Fix decilm.py by @pcmoritz in #2883
- [ROCm] Dockerfile fix for flash-attention build by @hongxiayang in #2885
- Prefix Caching- fix t4 triton error by @caoshiyi in #2517
- Bump up to v0.3.1 by @WoosukKwon in #2887
New Contributors
- @sighingnow made their first contribution in #2688
- @rib-2 made their first contribution in #2316
- @Leymore made their first contribution in #2666
- @Pernekhan made their first contribution in #2697
- @jikunshang made their first contribution in #2503
- @c21 made their first contribution in #2696
- @zcnrex made their first contribution in #2611
- @whyiug made their first contribution in #2746
- @gardberg made their first contribution in #2774
- @dllehr-amd made their first contribution in #2627
- @rkooo567 made their first contribution in #2471
- @ywang96 made their first contribution in #2433
- @tterrysun made their first contribution in #2831
Full Changelog: v0.3.0...v0.3.1
v0.3.0
Major Changes
- Experimental multi-lora support
- Experimental prefix caching support
- FP8 KV Cache support
- Optimized MoE performance and Deepseek MoE support
- CI tested PRs
- Support batch completion in server
What's Changed
- Miner fix of type hint by @beginlner in #2340
- Build docker image with shared objects from "build" step by @payoto in #2237
- Ensure metrics are logged regardless of requests by @ichernev in #2347
- Changed scheduler to use deques instead of lists by @NadavShmayo in #2290
- Fix eager mode performance by @WoosukKwon in #2377
- [Minor] Remove unused code in attention by @WoosukKwon in #2384
- Add baichuan chat template jinjia file by @EvilPsyCHo in #2390
- [Speculative decoding 1/9] Optimized rejection sampler by @cadedaniel in #2336
- Fix ipv4 ipv6 dualstack by @yunfeng-scale in #2408
- [Minor] Rename phi_1_5 to phi by @WoosukKwon in #2385
- [DOC] Add additional comments for LLMEngine and AsyncLLMEngine by @litone01 in #1011
- [Minor] Fix the format in quick start guide related to Model Scope by @zhuohan123 in #2425
- Add gradio chatbot for openai webserver by @arkohut in #2307
- [BUG] RuntimeError: deque mutated during iteration in abort_seq_group by @chenxu2048 in #2371
- Allow setting fastapi root_path argument by @chiragjn in #2341
- Address Phi modeling update 2 by @huiwy in #2428
- Update a more user-friendly error message, offering more considerate advice for beginners, when using V100 GPU #1901 by @chuanzhubin in #2374
- Update quickstart.rst with small clarifying change (fix typo) by @nautsimon in #2369
- Aligning
top_p
andtop_k
Sampling by @chenxu2048 in #1885 - [Minor] Fix err msg by @WoosukKwon in #2431
- [Minor] Optimize cuda graph memory usage by @esmeetu in #2437
- [CI] Add Buildkite by @simon-mo in #2355
- Announce the second vLLM meetup by @WoosukKwon in #2444
- Allow buildkite to retry build on agent lost by @simon-mo in #2446
- Fix weigit loading for GQA with TP by @zhangch9 in #2379
- CI: make sure benchmark script exit on error by @simon-mo in #2449
- ci: retry on build failure as well by @simon-mo in #2457
- Add StableLM3B model by @ita9naiwa in #2372
- OpenAI refactoring by @FlorianJoncour in #2360
- [Experimental] Prefix Caching Support by @caoshiyi in #1669
- fix stablelm.py tensor-parallel-size bug by @YingchaoX in #2482
- Minor fix in prefill cache example by @JasonZhu1313 in #2494
- fix: fix some args desc by @zspo in #2487
- [Neuron] Add an option to build with neuron by @liangfu in #2065
- Don't download both safetensor and bin files. by @NikolaBorisov in #2480
- [BugFix] Fix abort_seq_group by @beginlner in #2463
- refactor completion api for readability by @simon-mo in #2499
- Support OpenAI API server in
benchmark_serving.py
by @hmellor in #2172 - Simplify broadcast logic for control messages by @zhuohan123 in #2501
- [Bugfix] fix load local safetensors model by @esmeetu in #2512
- Add benchmark serving to CI by @simon-mo in #2505
- Add
group
as an argument in broadcast ops by @GindaChen in #2522 - [Fix] Keep
scheduler.running
as deque by @njhill in #2523 - migrate pydantic from v1 to v2 by @joennlae in #2531
- [Speculative decoding 2/9] Multi-step worker for draft model by @cadedaniel in #2424
- Fix "Port could not be cast to integer value as " by @pcmoritz in #2545
- Add qwen2 by @JustinLin610 in #2495
- Fix progress bar and allow HTTPS in
benchmark_serving.py
by @hmellor in #2552 - Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py by @JasonZhu1313 in #2553
- [Feature] Simple API token authentication by @taisazero in #1106
- Add multi-LoRA support by @Yard1 in #1804
- lint: format all python file instead of just source code by @simon-mo in #2567
- [Bugfix] fix crash if max_tokens=None by @NikolaBorisov in #2570
- Added
include_stop_str_in_output
andlength_penalty
parameters to OpenAI API by @galatolofederico in #2562 - [Doc] Fix the syntax error in the doc of supported_models. by @keli-wen in #2584
- Support Batch Completion in Server by @simon-mo in #2529
- fix names and license by @JustinLin610 in #2589
- [Fix] Use a correct device when creating OptionalCUDAGuard by @sh1ng in #2583
- [ROCm] add support to ROCm 6.0 and MI300 by @hongxiayang in #2274
- Support for Stable LM 2 by @dakotamahan-stability in #2598
- Don't build punica kernels by default by @pcmoritz in #2605
- AWQ: Up to 2.66x higher throughput by @casper-hansen in #2566
- Use head_dim in config if exists by @xiangxu-google in #2622
- Custom all reduce kernels by @hanzhi713 in #2192
- [Minor] Fix warning on Ray dependencies by @WoosukKwon in #2630
- Speed up Punica compilation by @WoosukKwon in #2632
- Small async_llm_engine refactor by @andoorve in #2618
- Update Ray version requirements by @simon-mo in #2636
- Support FP8-E5M2 KV Cache by @zhaoyang-star in #2279
- Fix error when tp > 1 by @zhaoyang-star in #2644
- No repeated IPC open by @hanzhi713 in #2642
- ROCm: Allow setting compilation target by @rlrs in #2581
- DeepseekMoE support with Fused MoE kernel by @zwd003 in #2453
- Fused MOE for Mixtral by @pcmoritz in #2542
- Fix 'Actor methods cannot be called directly' when using
--engine-use-ray
by @HermitSun in #2664 - Add swap_blocks unit tests by @sh1ng in #2616
- Fix a small typo (tenosr -> tensor) by @pcmoritz in #2672
- [Minor] Fix false warning when TP=1 by @WoosukKwon in #2674
- Add quantized mixtral support by @WoosukKwon in #2673
- Bump up version to v0.3.0 by @zhuohan123 in #2656
New Contributors
- @payoto made their first contribution in #2237
- @NadavShmayo made their first contribution in #2290
- @EvilPsyCHo made their first contribution in #2390
- @litone01 made their first contribution in #1011
- @arkohut made their first contribution in #2307
- @chiragjn made their first contribution in #2341
- @huiwy made their first contribution in #2428
- @chuanzhubin made their first contribution in #2374
- @nautsimon made their first contribution in #2369
- @zhangch9 made their first contribution in #2379
- @ita9naiwa made their first contribution in #2372
- @caoshiyi made their first contribution in https://gi...
v0.2.7
Major Changes
- Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
- Fix tensor parallelism support for Mixtral + GPTQ/AWQ
What's Changed
- Minor fix for gpu-memory-utilization description by @SuhongMoon in #2162
- [BugFix] Raise error when max_model_len is larger than KV cache size by @WoosukKwon in #2163
- [BugFix] Fix RoPE kernel on long sequences by @WoosukKwon in #2164
- Add SSL arguments to API servers by @hmellor in #2109
- typo fix by @oushu1zhangxiangxuan1 in #2166
- [ROCm] Fixes for GPTQ on ROCm by @kliuae in #2180
- Update Help Text for --gpu-memory-utilization Argument by @SuhongMoon in #2183
- [Minor] Add warning on CUDA graph memory usage by @WoosukKwon in #2182
- Added DeciLM-7b and DeciLM-7b-instruct by @avideci in #2062
- [BugFix] Fix weight loading for Mixtral with TP by @WoosukKwon in #2208
- Make _prepare_sample non blocking and pin memory of CPU input buffers by @hanzhi713 in #2207
- Remove Sampler copy stream by @Yard1 in #2209
- Fix a broken link by @ronensc in #2222
- Disable Ray usage stats collection by @WoosukKwon in #2206
- [BugFix] Fix recovery logic for sequence group by @WoosukKwon in #2186
- Update installation instructions to include CUDA 11.8 xFormers by @skt7 in #2246
- Add "About" Heading to README.md by @blueceiling in #2260
- [BUGFIX] Do not return ignored sentences twice in async llm engine by @zhuohan123 in #2258
- [BUGFIX] Fix API server test by @zhuohan123 in #2270
- [BUGFIX] Fix the path of test prompts by @zhuohan123 in #2273
- [BUGFIX] Fix communication test by @zhuohan123 in #2285
- Add support GPT-NeoX Models without attention biases by @dalgarak in #2301
- [FIX] Fix kernel bug by @jeejeelee in #1959
- fix typo and remove unused code by @esmeetu in #2305
- Enable CUDA graph for GPTQ & SqueezeLLM by @WoosukKwon in #2318
- Fix Gradio example: remove deprecated parameter
concurrency_count
by @ronensc in #2315 - Use NCCL instead of ray for control-plane communication to remove serialization overhead by @zhuohan123 in #2221
- Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK by @ronensc in #2321
- [Minor] Revert the changes in test_cache by @WoosukKwon in #2335
- Bump up to v0.2.7 by @WoosukKwon in #2337
New Contributors
- @SuhongMoon made their first contribution in #2162
- @hmellor made their first contribution in #2109
- @oushu1zhangxiangxuan1 made their first contribution in #2166
- @kliuae made their first contribution in #2180
- @avideci made their first contribution in #2062
- @hanzhi713 made their first contribution in #2207
- @ronensc made their first contribution in #2222
- @skt7 made their first contribution in #2246
- @blueceiling made their first contribution in #2260
- @dalgarak made their first contribution in #2301
Full Changelog: v0.2.6...v0.2.7
v0.2.6
Major changes
- Fast model execution with CUDA/HIP graph
- W4A16 GPTQ support (thanks to @chu-tianxiang)
- Fix memory profiling with tensor parallelism
- Fix *.bin weight loading for Mixtral models
What's Changed
- Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in #2100
- Fix Dockerfile.rocm by @tjtanaa in #2101
- avoid multiple redefinition by @MitchellX in #1817
- Add a flag to include stop string in output text by @yunfeng-scale in #1976
- Add GPTQ support by @chu-tianxiang in #916
- [Docs] Add quantization support to docs by @WoosukKwon in #2135
- [ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in #2138
- simplify loading weights logic by @esmeetu in #2133
- Optimize model execution with CUDA graph by @WoosukKwon in #1926
- [Minor] Delete Llama tokenizer warnings by @WoosukKwon in #2146
- Fix all-reduce memory usage by @WoosukKwon in #2151
- Pin PyTorch & xformers versions by @WoosukKwon in #2155
- Remove dependency on CuPy by @WoosukKwon in #2152
- [Docs] Add CUDA graph support to docs by @WoosukKwon in #2148
- Temporarily enforce eager mode for GPTQ models by @WoosukKwon in #2154
- [Minor] Add more detailed explanation on
quantization
argument by @WoosukKwon in #2145 - [Minor] Fix xformers version by @WoosukKwon in #2158
- [Minor] Add Phi 2 to supported models by @WoosukKwon in #2159
- Make sampler less blocking by @Yard1 in #1889
- [Minor] Fix a typo in .pt weight support by @WoosukKwon in #2160
- Disable CUDA graph for SqueezeLLM by @WoosukKwon in #2161
- Bump up to v0.2.6 by @WoosukKwon in #2157
New Contributors
- @mezuzza made their first contribution in #2100
- @MitchellX made their first contribution in #1817
Full Changelog: v0.2.5...v0.2.6
v0.2.5
Major changes
- Optimize Mixtral performance with expert parallelism (thanks to @Yard1)
- [BugFix] Fix input positions for long context with sliding window
What's Changed
- Update Dockerfile to support Mixtral by @simon-mo in #2027
- Remove python 3.10 requirement by @WoosukKwon in #2040
- [CI/CD] Upgrade PyTorch version to v2.1.1 by @WoosukKwon in #2045
- Upgrade transformers version to 4.36.0 by @WoosukKwon in #2046
- Remove einops from dependencies by @WoosukKwon in #2049
- gqa added to mpt attn by @megha95 in #1938
- Update Dockerfile to build Megablocks by @simon-mo in #2042
- Fix peak memory profiling by @WoosukKwon in #2031
- Implement lazy model loader by @WoosukKwon in #2044
- [ROCm] Upgrade xformers version dependency for ROCm; update documentations by @tjtanaa in #2079
- Update installation instruction for CUDA 11.8 by @WoosukKwon in #2086
- [Docs] Add notes on ROCm-supported models by @WoosukKwon in #2087
- [BugFix] Fix input positions for long context with sliding window by @WoosukKwon in #2088
- Mixtral expert parallelism by @Yard1 in #2090
- Bump up to v0.2.5 by @WoosukKwon in #2095
Full Changelog: v0.2.4...v0.2.5
v0.2.4
Major changes
- Mixtral model support (officially from @mistralai)
- AMD GPU support (collaboration with @EmbeddedLLM)
What's Changed
- add custom server params by @esmeetu in #1868
- support ChatGLMForConditionalGeneration by @dancingpipi in #1932
- Save pytorch profiler output for latency benchmark by @Yard1 in #1871
- Fix typo in adding_model.rst by @petergtz in #1947
- Make InternLM follow
rope_scaling
inconfig.json
by @theFool32 in #1956 - Fix quickstart.rst example by @gottlike in #1964
- Adding number of nvcc_threads during build as envar by @AguirreNicolas in #1893
- fix typo in getenv call by @dskhudia in #1972
- [Continuation] Merge EmbeddedLLM/vllm-rocm into vLLM main by @tjtanaa in #1836
- Fix Baichuan2-7B-Chat by @firebook in #1987
- [Docker] Add cuda arch list as build option by @simon-mo in #1950
- Fix for KeyError on Loading LLaMA by @imgaojun in #1978
- [Minor] Fix code style for baichuan by @WoosukKwon in #2003
- Fix OpenAI server completion_tokens referenced before assignment by @js8544 in #1996
- [Minor] Add comment on skipping rope caches by @WoosukKwon in #2004
- Replace head_mapping params with num_kv_heads to attention kernel. by @wbn03 in #1997
- Fix completion API echo and logprob combo by @simon-mo in #1992
- Mixtral 8x7B support by @pierrestock in #2011
- Minor fixes for Mixtral by @WoosukKwon in #2015
- Change load format for Mixtral by @WoosukKwon in #2028
- Update run_on_sky.rst by @eltociear in #2025
- Update requirements.txt for mixtral by @0-hero in #2029
- Revert #2029 by @WoosukKwon in #2030
- [Minor] Fix latency benchmark script by @WoosukKwon in #2035
- [Minor] Fix type annotation in Mixtral by @WoosukKwon in #2036
- Update README.md to add megablocks requirement for mixtral by @0-hero in #2033
- [Minor] Fix import error msg for megablocks by @WoosukKwon in #2038
- Bump up to v0.2.4 by @WoosukKwon in #2034
New Contributors
- @dancingpipi made their first contribution in #1932
- @petergtz made their first contribution in #1947
- @theFool32 made their first contribution in #1956
- @gottlike made their first contribution in #1964
- @AguirreNicolas made their first contribution in #1893
- @dskhudia made their first contribution in #1972
- @tjtanaa made their first contribution in #1836
- @firebook made their first contribution in #1987
- @imgaojun made their first contribution in #1978
- @js8544 made their first contribution in #1996
- @wbn03 made their first contribution in #1997
- @pierrestock made their first contribution in #2011
- @0-hero made their first contribution in #2029
Full Changelog: v0.2.3...v0.2.4
v0.2.3
Major changes
- Refactoring on Worker, InputMetadata, and Attention
- Fix TP support for AWQ models
- Support Prometheus metrics
- Fix Baichuan & Baichuan 2
What's Changed
- Add instructions to install vllm+cu118 by @WoosukKwon in #1717
- Documentation about official docker image by @simon-mo in #1709
- Fix the code block's format in deploying_with_docker page by @HermitSun in #1722
- Migrate linter from
pylint
toruff
by @simon-mo in #1665 - [FIX] Update the doc link in README.md by @zhuohan123 in #1730
- [BugFix] Fix a bug in loading safetensors by @WoosukKwon in #1732
- Fix hanging in the scheduler caused by long prompts by @chenxu2048 in #1534
- [Fix] Fix bugs in scheduler by @linotfan in #1727
- Rewrite torch.repeat_interleave to remove cpu synchronization by @beginlner in #1599
- fix RAM OOM when load large models in tensor parallel mode. by @boydfd in #1395
- [BugFix] Fix TP support for AWQ by @WoosukKwon in #1731
- [FIX] Fix the case when
input_is_parallel=False
forScaledActivation
by @zhuohan123 in #1737 - Add stop_token_ids in SamplingParams.repr by @chenxu2048 in #1745
- [DOCS] Add engine args documentation by @casper-hansen in #1741
- Set top_p=0 and top_k=-1 in greedy sampling by @beginlner in #1748
- Fix repetition penalty aligned with huggingface by @beginlner in #1577
- [build] Avoid building too many extensions by @ymwangg in #1624
- [Minor] Fix model docstrings by @WoosukKwon in #1764
- Added echo function to OpenAI API server. by @wanmok in #1504
- Init model on GPU to reduce CPU memory footprint by @beginlner in #1796
- Correct comments in parallel_state.py by @explainerauthors in #1818
- Fix OPT weight loading by @WoosukKwon in #1819
- [FIX] Fix class naming by @zhuohan123 in #1803
- Move the definition of BlockTable a few lines above so we could use it in BlockAllocator by @explainerauthors in #1791
- [FIX] Fix formatting error in main branch by @zhuohan123 in #1822
- [Fix] Fix RoPE in ChatGLM-32K by @WoosukKwon in #1841
- Better integration with Ray Serve by @FlorianJoncour in #1821
- Refactor Attention by @WoosukKwon in #1840
- [Docs] Add information about using shared memory in docker by @simon-mo in #1845
- Disable Logs Requests should Disable Logging of requests. by @MichaelMcCulloch in #1779
- Refactor worker & InputMetadata by @WoosukKwon in #1843
- Avoid multiple instantiations of the RoPE class by @jeejeeli in #1828
- [FIX] Fix docker build error (#1831) by @allenhaozi in #1832
- Add profile option to latency benchmark by @WoosukKwon in #1839
- Remove
max_num_seqs
in latency benchmark by @WoosukKwon in #1855 - Support max-model-len argument for throughput benchmark by @aisensiy in #1858
- Fix rope cache key error by @esmeetu in #1867
- docs: add instructions for Langchain by @mspronesti in #1162
- Support chat template and
echo
for chat API by @Tostino in #1756 - Fix Baichuan tokenizer error by @WoosukKwon in #1874
- Add weight normalization for Baichuan 2 by @WoosukKwon in #1876
- Fix the typo in SamplingParams' docstring. by @xukp20 in #1886
- [Docs] Update the AWQ documentation to highlight performance issue by @simon-mo in #1883
- Fix the broken sampler tests by @WoosukKwon in #1896
- Add Production Metrics in Prometheus format by @simon-mo in #1890
- Add PyTorch-native implementation of custom layers by @WoosukKwon in #1898
- Fix broken worker test by @WoosukKwon in #1900
- chore(examples-docs): upgrade to OpenAI V1 by @mspronesti in #1785
- Fix num_gpus when TP > 1 by @WoosukKwon in #1852
- Bump up to v0.2.3 by @WoosukKwon in #1903
New Contributors
- @boydfd made their first contribution in #1395
- @explainerauthors made their first contribution in #1818
- @FlorianJoncour made their first contribution in #1821
- @MichaelMcCulloch made their first contribution in #1779
- @jeejeeli made their first contribution in #1828
- @allenhaozi made their first contribution in #1832
- @aisensiy made their first contribution in #1858
- @xukp20 made their first contribution in #1886
Full Changelog: v0.2.2...v0.2.3
v0.2.2
Major changes
- Bump up to PyTorch v2.1 + CUDA 12.1 (vLLM+CUDA 11.8 is also provided)
- Extensive refactoring for better tensor parallelism & quantization support
- New models: Yi, ChatGLM, Phi
- Changes in scheduler: from 1D flattened input tensor to 2D tensor
- AWQ support for all models
- Added LogitsProcessor API
- Preliminary support for SqueezeLLM
What's Changed
- Change scheduler & input tensor shape by @WoosukKwon in #1381
- Add Mistral 7B to
test_models
by @WoosukKwon in #1366 - fix typo by @WrRan in #1383
- Fix TP bug by @WoosukKwon in #1389
- Fix type hints by @lxrite in #1427
- remove useless statements by @WrRan in #1408
- Pin dependency versions by @thiagosalvatore in #1429
- SqueezeLLM Support by @chooper1 in #1326
- aquila model add rope_scaling by @Sanster in #1457
- fix: don't skip first special token. by @gesanqiu in #1497
- Support repetition_penalty by @beginlner in #1424
- Fix bias in InternLM by @WoosukKwon in #1501
- Delay GPU->CPU sync in sampling by @Yard1 in #1337
- Refactor LLMEngine demo script for clarity and modularity by @iongpt in #1413
- Fix logging issues by @Tostino in #1494
- Add py.typed so consumers of vLLM can get type checking by @jroesch in #1509
- vLLM always places spaces between special tokens by @blahblahasdf in #1373
- [Fix] Fix duplicated logging messages by @zhuohan123 in #1524
- Add dockerfile by @skrider in #1350
- Fix integer overflows in attention & cache ops by @WoosukKwon in #1514
- [Small] Formatter only checks lints in changed files by @cadedaniel in #1528
- Add
MptForCausalLM
key in model_loader by @wenfeiy-db in #1526 - [BugFix] Fix a bug when engine_use_ray=True and worker_use_ray=False and TP>1 by @beginlner in #1531
- Adding a health endpoint by @Fluder-Paradyne in #1540
- Remove
MPTConfig
by @WoosukKwon in #1529 - Force paged attention v2 for long contexts by @Yard1 in #1510
- docs: add description by @lots-o in #1553
- Added logits processor API to sampling params by @noamgat in #1469
- YaRN support implementation by @Yard1 in #1264
- Add Quantization and AutoAWQ to docs by @casper-hansen in #1235
- Support Yi model by @esmeetu in #1567
- ChatGLM2 Support by @GoHomeToMacDonal in #1261
- Upgrade to CUDA 12 by @zhuohan123 in #1527
- [Worker] Fix input_metadata.selected_token_indices in worker by @ymwangg in #1546
- Build CUDA11.8 wheels for release by @WoosukKwon in #1596
- Add Yi model to quantization support by @forpanyang in #1600
- Dockerfile: Upgrade Cuda to 12.1 by @GhaziSyed in #1609
- config parser: add ChatGLM2 seq_length to
_get_and_verify_max_len
by @irasin in #1617 - Fix cpu heavy code in async function _AsyncLLMEngine._run_workers_async by @dominik-schwabe in #1628
- Fix #1474 - gptj AssertionError : assert param_slice.shape == loaded_weight.shape by @lihuahua123 in #1631
- [Minor] Move RoPE selection logic to
get_rope
by @WoosukKwon in #1633 - Add DeepSpeed MII backend to benchmark script by @WoosukKwon in #1649
- TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models by @zhuohan123 in #1622
- Remove
MptConfig
by @megha95 in #1668 - feat(config): support parsing torch.dtype by @aarnphm in #1641
- Fix loading error when safetensors contains empty tensor by @twaka in #1687
- [Minor] Fix duplication of ignored seq group in engine step by @simon-mo in #1666
- [models] Microsoft Phi 1.5 by @maximzubkov in #1664
- [Fix] Update Supported Models List by @zhuohan123 in #1690
- Return usage for openai requests by @ichernev in #1663
- [Fix] Fix comm test by @zhuohan123 in #1691
- Update the adding-model doc according to the new refactor by @zhuohan123 in #1692
- Add 'not' to this annotation: "#FIXME(woosuk): Do not use internal method" by @linotfan in #1704
- Support Min P Sampler by @esmeetu in #1642
- Read quantization_config in hf config by @WoosukKwon in #1695
- Support download models from www.modelscope.cn by @liuyhwangyh in #1588
- follow up of #1687 when safetensors model contains 0-rank tensors by @twaka in #1696
- Add AWQ support for all models by @WoosukKwon in #1714
- Support fused add rmsnorm for LLaMA by @beginlner in #1667
- [Fix] Fix warning msg on quantization by @WoosukKwon in #1715
- Bump up the version to v0.2.2 by @WoosukKwon in #1689
New Contributors
- @lxrite made their first contribution in #1427
- @thiagosalvatore made their first contribution in #1429
- @chooper1 made their first contribution in #1326
- @beginlner made their first contribution in #1424
- @iongpt made their first contribution in #1413
- @Tostino made their first contribution in #1494
- @jroesch made their first contribution in #1509
- @skrider made their first contribution in #1350
- @cadedaniel made their first contribution in #1528
- @wenfeiy-db made their first contribution in #1526
- @Fluder-Paradyne made their first contribution in #1540
- @lots-o made their first contribution in #1553
- @noamgat made their first contribution in #1469
- @casper-hansen made their first contribution in #1235
- @GoHomeToMacDonal made their first contribution in #1261
- @ymwangg made their first contribution in #1546
- @forpanyang made their first contribution in #1600
- @GhaziSyed made their first contribution in #1609
- @irasin made their first contribution in #1617
- @dominik-schwabe made their first contribution in #1628
- @lihuahua123 made their first contribution in #1631
- @megha95 made their first contribution in #1668
- @aarnphm made their first contribution in #1641
- @simon-mo made their first contribution in #1666
- @maximzubkov made their first contribution in #1664
- @ichernev made their first contribution in #1663
- @linotfan made their first contribution in #1704
- @liuyhwangyh made their first contribution in #1588
Full Changelog: v0.2.1...v0.2.2
v0.2.1.post1
This is an emergency release to fix a bug on tensor parallelism support.