LigherLLM

LigherLLM is a simplified version (under 2.5k loc) of LightLLM's first commit, designed for educational purposes to understand the core concepts of LLM inference server handling concurrent requests.

Attribution

All rights and original work belong to the LightLLM team. This is an educational adaptation of their excellent work. For the original project, please refer to README_ORIG.md and LightLLM project.

Quick Start

We need a Linux machine with NVIDIA GPU card. First let's setup a virtual env and install dependencies:

git clone https://github.com/larme/lighterllm.git
cd lighterllm && python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt

Then we can run an offline model inference example:

python -m examples.naive_model_executor meta-llama/Llama-3.2-3B-Instruct

We can edit examples/naive_model_executor.py, print out info to study how the model executor works with a naive scheduler.

We can also start a multiprocess server by running:

python -m lightllm.server.api_server --model_dir meta-llama/Llama-3.2-3B-Instruct --max_total_token_num 8192 --host 0.0.0.0

then query the server by running:

curl 127.0.0.1:8000/generate \
     -X POST \
     -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":1024, "frequency_penalty":1}}' \
     -H 'Content-Type: application/json'

Why LightLLM as the Base

I chose LightLLM as the foundation for this educational project because it incorporates most modern features essential for efficient LLM serving like continuous batching, KV cache, paged/token attention and multiprocess architecture (which will run tokenization, model inference, and detokenization in different processes and the model inference process will focus on model execution to maximize GPU utilization). In fact I think LightLLM is the first framework to use multiprocess architecture. SGLang is inspired by LightLLM and vLLM has also migrated to similar approach.

Also early version of lightllm's model executor kernels are written in OpenAI Triton so it's relatively easier to understand than CUDA kernels.

Changes from LightLLM's first commit

I made the

Simplifying the model executor codes:
- Keep only Llama family model support
- remove tensor parallelism
- remove codes for old triton version
QoL improvements
- Llama 3.1/3.2 support
- safetensors support
- auto set eos_id
- huggingface model tag support
Offline model executor example and other materials for study (WIP)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
examples		examples
lightllm		lightllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ORIG.md		README_ORIG.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LigherLLM

Attribution

Quick Start

Why LightLLM as the Base

Changes from LightLLM's first commit

About

Uh oh!

Releases

Packages

Languages

License

larme/lighterllm

Folders and files

Latest commit

History

Repository files navigation

LigherLLM

Attribution

Quick Start

Why LightLLM as the Base

Changes from LightLLM's first commit

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages