LigherLLM is a simplified version (under 2.5k loc) of LightLLM's first commit, designed for educational purposes to understand the core concepts of LLM inference server handling concurrent requests.
All rights and original work belong to the LightLLM team. This is an educational adaptation of their excellent work. For the original project, please refer to README_ORIG.md and LightLLM project.
We need a Linux machine with NVIDIA GPU card. First let's setup a virtual env and install dependencies:
git clone https://github.com/larme/lighterllm.git
cd lighterllm && python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
Then we can run an offline model inference example:
python -m examples.naive_model_executor meta-llama/Llama-3.2-3B-Instruct
We can edit examples/naive_model_executor.py
, print out info to study how the model executor works with a naive scheduler.
We can also start a multiprocess server by running:
python -m lightllm.server.api_server --model_dir meta-llama/Llama-3.2-3B-Instruct --max_total_token_num 8192 --host 0.0.0.0
then query the server by running:
curl 127.0.0.1:8000/generate \
-X POST \
-d '{"inputs":"What is AI?","parameters":{"max_new_tokens":1024, "frequency_penalty":1}}' \
-H 'Content-Type: application/json'
I chose LightLLM as the foundation for this educational project because it incorporates most modern features essential for efficient LLM serving like continuous batching, KV cache, paged/token attention and multiprocess architecture (which will run tokenization, model inference, and detokenization in different processes and the model inference process will focus on model execution to maximize GPU utilization). In fact I think LightLLM is the first framework to use multiprocess architecture. SGLang is inspired by LightLLM and vLLM has also migrated to similar approach.
Also early version of lightllm's model executor kernels are written in OpenAI Triton so it's relatively easier to understand than CUDA kernels.
I made the
- Simplifying the model executor codes:
- Keep only Llama family model support
- remove tensor parallelism
- remove codes for old triton version
- QoL improvements
- Llama 3.1/3.2 support
- safetensors support
- auto set
eos_id
- huggingface model tag support
- Offline model executor example and other materials for study (WIP)