GitHub - jjyao/llama2.rs: A Rust implementation of Llama 2 inference engine.

A Rust implementation of Llama 2 inference engine.

Instruction

First download the model:

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin

and then run:

cargo run --release -- --checkpoint stories15M.bin --temperature 0.0 --steps 256 --prompt "Once upon a time" --tp-size=2

Key and value tensors of previous tokens are cached.

Prefill and decode stages are separated.

Each TP worker is one thread and all-reduce collective communication is implemented by message passing via channel.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
tokenizer.bin		tokenizer.bin