Lazarus

A deterministic data & execution substrate that enables reliable training.

Lazarus makes training runs reproducible the same way lockfiles make builds reproducible.

Offline batch plans, reproducible batching, measurable efficiency — the stuff you wish every training stack shipped with.

Runs on macOS; designed for Apple Silicon first (MLX backend).

The core idea: The BatchPlan is the contract. Trainers enforce it; they don't invent it. Build plans offline, version them, verify them in CI/CD (fingerprints + schema validation), and replay them exactly across distributed workers that share the same plan artifact. BatchPlans are fingerprinted against the tokenizer and length cache, so you can detect drift when data or tokenization changes.

Dataset → Tokenizer → Length Cache → BatchPlan Artifact → Trainer (enforces) → Checkpoints
                 fingerprint └─────────── fingerprint ┘

Most training pipelines entangle data loading, batching, and execution inside the trainer, making runs hard to reproduce, debug, or scale. Lazarus separates planning from execution: batching decisions are made once, recorded as artifacts, and enforced consistently across runs and workers.

Quick Start with uvx

No installation needed - run directly with uvx:

# Encode text to see how a tokenizer splits it
uvx chuk-lazarus tokenizer encode -t "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Hello, world!"

# Run a health check on any tokenizer
uvx chuk-lazarus tokenizer doctor -t "gpt2"

# Compare how two tokenizers handle the same text
uvx chuk-lazarus tokenizer compare -t1 "gpt2" -t2 "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --text "Machine learning is amazing"

Installation

# Install with uv (recommended)
uv add chuk-lazarus

# Or with pip
pip install chuk-lazarus

# For OpenAI tokenizers (gpt-4, gpt-3.5-turbo, o1, etc.)
uv add "chuk-lazarus[openai]"

# For faster tokenization (optional MLX backend)
uv add "chuk-lazarus[fast]"

# For the inference server and Python client
uv add "chuk-lazarus[server]"

After installation, use the chuk-lazarus command directly:

chuk-lazarus tokenizer encode -t "gpt2" --text "Hello"

CLI Reference

Tokenizer Commands

# Encode text - see token IDs and boundaries
chuk-lazarus tokenizer encode -t "gpt2" --text "The quick brown fox"

# Decode token IDs back to text
chuk-lazarus tokenizer decode -t "gpt2" --ids "464,2068,7586,21831"

# Search the vocabulary
chuk-lazarus tokenizer vocab -t "gpt2" --search "hello"

# Compare two tokenizers
chuk-lazarus tokenizer compare -t1 "gpt2" -t2 "meta-llama/Llama-2-7b" --text "Test"

# Health check with auto-fix
chuk-lazarus tokenizer doctor -t "model-name" --fix

# Generate fingerprint for compatibility
chuk-lazarus tokenizer fingerprint -t "gpt2" --save fingerprint.json

Corpus Analysis

# Coverage analysis - UNK rate, tokens per word
chuk-lazarus tokenizer analyze coverage -t "gpt2" --file corpus.txt

# Fit score - tokenizer-dataset compatibility (0-100)
chuk-lazarus tokenizer analyze fit-score -t "gpt2" --file corpus.txt

# Efficiency analysis - tokens per sample, fragmentation
chuk-lazarus tokenizer analyze efficiency -t "gpt2" --file corpus.txt

Data Commands

# Build a length cache (tokenize once, reuse lengths)
chuk-lazarus data lengths build -d train.jsonl -t "gpt2" -o lengths.jsonl

# Build a batch plan for reproducible training
chuk-lazarus data batchplan build -l lengths.jsonl -e 3 -b 4096 -o batch_plan/ --predictable

# Show batch plan info
chuk-lazarus data batchplan info -p batch_plan/ --show-batches 5

# Analyze batching efficiency
chuk-lazarus data batching analyze --cache lengths.jsonl --bucket-edges 128,256,512

# Run comprehensive pipeline benchmark
chuk-lazarus bench --num-samples 1000
chuk-lazarus bench -d train.jsonl -t gpt2 --bucket-edges 128,256,512

# Benchmark reports are saved as JSON + markdown for tracking regressions:
# length histogram, bucket efficiency, pack vs pad comparison,
# throughput metrics, memory footprint, and actionable recommendations

BatchPlans are recommended for production and distributed training; streaming batching (below) is intended for online, exploratory, or RL-style workloads.

Minimal end-to-end deterministic pipeline:

chuk-lazarus data lengths build -d train.jsonl -t gpt2 -o lengths.jsonl
chuk-lazarus data batchplan build -l lengths.jsonl -e 1 -b 4096 -o batch_plan/ --predictable
chuk-lazarus train sft --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --data train.jsonl --batch-plan batch_plan/

Puzzle Arcade Integration

Stream training data from the puzzle arcade server for online/RL training:

# Connect to puzzle gym and collect episodes
chuk-lazarus gym run --host localhost --port 8023 --puzzles sudoku,binary --episodes 100

from chuk_lazarus.data.batching.streaming import (
    TelnetGymClient, TelnetClientConfig, PuzzleGame, PuzzleDifficulty,
    ReplayBuffer, ReplayBufferConfig, StreamSample, SampleSource,
)

# Connect to puzzle server
config = TelnetClientConfig(host="localhost", port=8023)
async with TelnetGymClient(config) as client:
    # Start a puzzle
    obs = await client.start_puzzle(PuzzleGame.SUDOKU, PuzzleDifficulty.EASY)

    # Get optimal moves as training data
    hint = await client.get_hint()
    print(f"Next move: {hint.message}")

    # Collect into replay buffer
    buffer = ReplayBuffer(ReplayBufferConfig(max_size=10000))
    sample = StreamSample(
        input_ids=tokenize(prompt),
        loss_mask=loss_mask,
        source=SampleSource.GYM,
        difficulty=0.3,
    )
    buffer.add(sample)

Supported puzzles: Sudoku, KenKen, Kakuro, Binary, Futoshiki, Nonogram, Logic Grid, Killer Sudoku, Lights Out, Mastermind, Slitherlink, Bridges, Hitori, Shikaku, Hidato, Tents, Fillomino, Star Battle, Sokoban, Knapsack, Nurikabe, Minesweeper.

BatchPlan-Driven Training

Training in Lazarus is driven entirely by precomputed BatchPlans. The trainer does not decide batching, sequencing, or token budgets — it enforces them.

Invariant: If two runs use the same BatchPlan artifact (including its fingerprints) and seed, Lazarus guarantees identical batch structure and ordering across runs and workers.

Identical means: same sample IDs per step, in the same order, with the same packing boundaries and token budgets. (Numerical results may differ slightly across hardware/kernel implementations; the batch schedule remains identical.)

# Canonical deterministic training (always use --batch-plan)
chuk-lazarus train sft \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --data train.jsonl \
  --batch-plan batch_plan/ \
  --use-lora

# Dev convenience (builds plan on the fly; still fingerprints and saves it)
chuk-lazarus train sft \
  --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
  --data train.jsonl \
  --build-plan --predictable \
  --use-lora

# Train with DPO
chuk-lazarus train dpo --model ./checkpoints/sft/final --data preferences.jsonl --batch-plan batch_plan/

# Generate synthetic training data
chuk-lazarus generate --type math --output ./data/lazarus

# Run inference
chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?"

# KV-direct stateful engine
chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is 2+2?" --engine kv_direct

Context Libraries (Prefill + Generate)

Pre-fill a document into a windowed KV checkpoint library, then query any part of it at generation time — no re-reading the source:

# Prefill: tokenize and build a windowed checkpoint library
lazarus context prefill \
    --model google/gemma-3-4b-it \
    --input shakespeare.txt \
    --checkpoint ./shakespeare_ctx/ \
    --window-size 512

# Interrupted? Resume automatically from where it left off
lazarus context prefill \
    --model google/gemma-3-4b-it \
    --input shakespeare.txt \
    --checkpoint ./shakespeare_ctx/

# Generate: query the library with compass routing
lazarus context generate \
    --model google/gemma-3-4b-it \
    --checkpoint ./shakespeare_ctx/ \
    --prompt "What does Hamlet say about death?" \
    --max-tokens 200

# Residual modes control routing precision vs disk usage
lazarus context prefill \
    --model google/gemma-3-4b-it \
    --input document.txt \
    --checkpoint ./ctx/ \
    --residual-mode darkspace --frame-bank ./frame_bank.npz

Selective phases — run only the parts you need with --phases:

# Just prefill windows (skip interval + compass extraction)
lazarus context prefill --model google/gemma-3-4b-it --input shakespeare.txt \
    --checkpoint ./ctx/ --phases windows

# Recalibrate compass on an existing library (~22 min vs ~32+30+22 for full)
lazarus context prefill --model google/gemma-3-4b-it --input shakespeare.txt \
    --checkpoint ./ctx/ --phases compass

# Run interval extraction and compass together
lazarus context prefill --model google/gemma-3-4b-it --input shakespeare.txt \
    --checkpoint ./ctx/ --phases interval,compass

# Add pages to an existing library
lazarus context prefill --model google/gemma-3-4b-it --input shakespeare.txt \
    --checkpoint ./ctx/ --store-pages --phases pages

Available phases: windows, interval, compass, darkspace, pages, all (default). When --phases doesn't include windows, the prefill loop is skipped — the engine loads the existing library and runs only the requested extraction passes.

How it works: The prefill command splits the source into fixed-size windows, runs a forward pass on each, and saves boundary KV checkpoints plus residual vectors for routing. At generation time, the compass router uses residual similarity to find the right windows for a query, injects their KV checkpoints, and generates — all without re-reading or re-processing the source text.

Library layout:

shakespeare_ctx/
├── manifest.json              — model ID, window size, total tokens, num windows
├── tokens.bin                 — raw token IDs (uint32, little-endian)
├── windows.json               — per-window metadata: offsets, counts, text preview
├── checkpoints.npz            — boundary KV per window (inject without re-prefill)
├── residuals.npz              — Markov state vectors at window boundaries (also an injection path)
├── interval_residuals.npz     — 8 interior residuals per window (sub-window retrieval)
├── compass_residuals.npz      — commitment-layer residuals for routing
└── compass_basis.npz          — PCA basis for compass routing (auto-calibrated)

Residual mode	What's saved	Use case
`interval` (default)	8 samples/window + compass PCA	General-purpose, good balance of precision and disk
`full`	Every position + compass PCA	Maximum retrieval precision
`darkspace`	Whitened frame bank projections	Cross-corpus routing with pre-computed frame banks
`none`	Checkpoints + metadata only	Minimal disk, no routing

Reliability features:

Incremental saves every 5 minutes — only new windows are appended, not rewritten
Memory eviction — saved checkpoints are freed from GPU memory, keeping usage constant
Two-stage Ctrl-C — first interrupt saves gracefully, second hard-exits
Automatic resume — partial libraries are detected and continued from the last window

Knowledge Store (Build + Query + Chat)

Compress a document into a lightweight knowledge store using TF-IDF routing and 1D vector injection — then query or chat against it without re-reading the source:

# Build: tokenize, prefill windows, extract injection entries
lazarus knowledge build \
    --model google/gemma-3-4b-it \
    --input article.txt \
    --output ./article_kb/

# Query: single-shot question with TF-IDF routing
lazarus knowledge query \
    --model google/gemma-3-4b-it \
    --store ./article_kb/ \
    --prompt "Who won the competition?"

# Chat: interactive multi-turn conversation grounded in the document
lazarus knowledge chat \
    --model google/gemma-3-4b-it \
    --store ./article_kb/

How it works: The build command splits the document into fixed-size windows (default 512 tokens), runs a forward pass on each, and extracts injection entries via K-norm sampling — keeping only the tokens with the highest addressing energy in the model's attention. At query time, TF-IDF routing selects the best windows, their boundary residuals are injected as Markov state, and focused context replay provides the narrative grounding for generation.

Store layout:

article_kb/
├── manifest.json         — model ID, window size, total tokens
├── kv_index.npz          — K-vectors + token IDs + coefficients (518 bytes/entry)
├── final_residual.npy    — document Markov state (~10 KB)
├── tokens.bin            — token archive (replay fallback)
└── windows.json          — window boundaries + preview text

Storage: A 370K-token document produces an ~12 MB store vs ~56 GB for a full KV cache — roughly 4,000× compression.

UnifiedPipeline

The UnifiedPipeline auto-detects model family and provides a simplified API. One-liner setup, no boilerplate:

from chuk_lazarus.inference import UnifiedPipeline, UnifiedPipelineConfig, DType

# One-liner model loading - auto-detects family!
pipeline = UnifiedPipeline.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Simple chat API
result = pipeline.chat("What is the capital of France?")
print(result.text)
print(result.stats.summary)  # "25 tokens in 0.42s (59.5 tok/s)"
print(f"Model family: {pipeline.family_type}")  # ModelFamilyType.LLAMA

Key features:

Auto-detection of model family from HuggingFace config
Typed configuration with Pydantic (UnifiedPipelineConfig, GenerationConfig)
Async support (UnifiedPipeline.from_pretrained_async)
Chat history management (ChatHistory)
Streaming generation (generate_stream)
No magic strings - uses enums (DType, Role, ModelFamilyType)
Engine selection via EngineMode (STANDARD, KV_DIRECT) — stateful generation with explicit KV store control

# Simplified inference examples
uv run python examples/inference/simple_inference.py --prompt "Write a haiku"
uv run python examples/inference/llama_inference.py --model smollm2-360m
uv run python examples/inference/granite_inference.py --model granite-3.1-2b
uv run python examples/inference/gemma_inference.py --chat

Model Family Examples

Run inference with specific model families:

# Llama family (TinyLlama, SmolLM2, Llama 2/3, Mistral)
uv run python examples/inference/llama_inference.py --model tinyllama
uv run python examples/inference/llama_inference.py --model smollm2-360m
uv run python examples/inference/llama_inference.py --list  # Show all presets

# Gemma 3 (1B, 4B, 12B, 27B with 128K context)
uv run python examples/inference/gemma_inference.py --chat
uv run python examples/inference/gemma_inference.py --model gemma-3-4b

# Granite (IBM, dense and hybrid MoE variants)
uv run python examples/inference/granite_inference.py --model granite-3.1-2b

# Llama 4 Scout (Hybrid Mamba-Transformer MoE)
uv run python examples/inference/llama4_inference.py

# StarCoder2 (Code generation, 3B/7B/15B)
uv run python examples/inference/starcoder2_inference.py --prompt "def fibonacci(n):"
uv run python examples/inference/starcoder2_inference.py --interactive  # Interactive mode

# Jamba (Hybrid Mamba-Transformer MoE, 256K context)
uv run python examples/inference/jamba_inference.py --test-tiny  # Test without download
uv run python examples/inference/jamba_inference.py --list       # Show models

FunctionGemma (Function Calling)

Run function calling inference with Google's FunctionGemma model:

# Run FunctionGemma for tool use / function calling
uv run python examples/models/gemma/01_functiongemma_inference.py

FunctionGemma is a 270M parameter model optimized for on-device function calling, supporting:

Tool use / API calling
MCP (Model Context Protocol) integration
Lightweight RAG pipelines
On-device agents

See docs/inference.md for detailed inference documentation.

Inference Server

Serve any model as an OpenAI-compatible HTTP API — works with mcp-cli, LangChain, the openai SDK, and any other OpenAI-compatible client:

# Install server dependencies
uv add "chuk-lazarus[server]"

# Start the server (downloads model on first run)
lazarus serve --model google/gemma-3-4b-it

# With authentication
lazarus serve --model google/gemma-3-4b-it --api-key mysecret

The server starts at http://localhost:8080. OpenAI-compatible endpoints are at /v1:

# curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3-4b-it", "messages": [{"role": "user", "content": "Hello!"}]}'

# openai SDK
python -c "
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8080/v1', api_key='lazarus')
print(client.chat.completions.create(
    model='google/gemma-3-4b-it',
    messages=[{'role': 'user', 'content': 'Hello!'}]
).choices[0].message.content)
"

Or use the built-in Python client library:

from chuk_lazarus.client import LazarusClient, ChatMessage, ClientRole

with LazarusClient() as client:
    # Non-streaming
    response = client.chat(
        model="google/gemma-3-4b-it",
        messages=[ChatMessage(role=ClientRole.USER, content="Hello!")],
    )
    print(response.content)

    # Streaming
    for chunk in client.stream_chat(
        model="google/gemma-3-4b-it",
        messages=[ChatMessage(role=ClientRole.USER, content="Count to ten.")],
    ):
        print(chunk, end="", flush=True)

Use with mcp-cli:

# Start the server
lazarus-serve --model google/gemma-3-4b-it --api-key lazarus

# In another terminal
mcp-cli chat --provider lazarus --server time --model google/gemma-3-4b-it

See docs/server.md for the full server guide and docs/client.md for the client library.

Introspection (Model Analysis)

Analyze model behavior using logit lens, ablation studies, attention visualization, and MoE expert identification:

# Run logit lens analysis - see how predictions evolve across layers
chuk-lazarus introspect analyze -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 -p "The capital of France is"

# Track specific tokens through layers
chuk-lazarus introspect analyze -m model -p "Hello" --track "world,there" --layer-strategy all

# Compare two models' predictions
chuk-lazarus introspect compare -m1 google/gemma-3-270m-it -m2 google/functiongemma-270m-it -p "Get the weather" --track "get_"

# Ablation study - find causal circuits
chuk-lazarus introspect ablate -m model -p "What's the weather?" -c function_call --layers 8-15

# Multi-layer ablation - test layers together
chuk-lazarus introspect ablate -m model -p "45 * 45 = " -c "2025" --layers 22,23 --multi

# Test if task type is baked into embeddings (RLVF hypothesis)
chuk-lazarus introspect embedding -m model

# Analyze operand encoding structure (holistic vs compositional)
chuk-lazarus introspect operand-directions -m model

# Test commutativity (lookup table vs algorithm)
chuk-lazarus introspect commutativity -m model

# Activation patching between prompts
chuk-lazarus introspect patch -m model --source "7*8=" --target "7+8="

# Low-level hook demonstration
chuk-lazarus introspect hooks -m model -p "Test" --layers 0,4,8 --capture-attention

# MoE Expert Analysis (for MoE models like GPT-OSS, Mixtral, Llama 4)
chuk-lazarus introspect moe-expert analyze -m openai/gpt-oss-20b
chuk-lazarus introspect moe-expert heatmap -m openai/gpt-oss-20b -p "def fib(n):"
chuk-lazarus introspect moe-expert pipeline -m openai/gpt-oss-20b --num-prompts 20
chuk-lazarus introspect moe-expert vocab-contrib -m openai/gpt-oss-20b --top-k 30
chuk-lazarus introspect moe-expert compression -m openai/gpt-oss-20b --threshold 0.8

# Circuit Graph Export
chuk-lazarus introspect circuit export -i ablation_results.json -o circuit.html -f html
chuk-lazarus introspect circuit export -i ablation_results.json -o circuit.dot -f dot

MoE Expert Identification - Discover what each expert specializes in:

from mlx_lm import load
from chuk_lazarus.introspection import ExpertIdentifier, identify_experts

# Load any MoE model
model, tokenizer = load("openai/gpt-oss-20b")

# Identify all experts in a layer
result = identify_experts(model, tokenizer, layer_idx=12)
print(result.summary())

# Results show expert specializations:
# CODE: Experts [1, 14, 22, 23, 27, 28]
# MATH: Experts [6, 7, 19, 24, 30, 31]
# CONTENT_WORDS: Experts [0, 2, 3, 4, 5, 8, 9, ...]
# NAMES: Experts [15, 26]

# Get detailed identity for specific expert
expert_6 = result.expert_identities[6]
print(expert_6.detailed_report())
# Expert 6: math (52% confidence)
# Top tokens: ['+', '2', 'x', '3', ...]
# Semantic clusters: ['numeric_values']

MoE Routing Analysis - Capture and analyze routing decisions:

from chuk_lazarus.introspection import MoEHooks, MoECaptureConfig

hooks = MoEHooks(model)
hooks.configure(MoECaptureConfig(
    capture_router_logits=True,
    capture_selected_experts=True,
))

logits = hooks.forward(input_ids)

# Analyze routing
utilization = hooks.get_expert_utilization(layer_idx=12)
print(f"Load balance: {utilization.load_balance_score:.2%}")

entropy = hooks.get_router_entropy(layer_idx=12)
print(f"Router confidence: {1 - entropy.normalized_entropy:.2%}")

Logit Lens and Ablation:

from chuk_lazarus.introspection import ModelAnalyzer, AnalysisConfig, LayerStrategy

# Async API for logit lens analysis
async with ModelAnalyzer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0") as analyzer:
    result = await analyzer.analyze("The capital of France is")
    print(result.predicted_token)  # " Paris"
    for layer in result.layer_predictions:
        print(f"Layer {layer.layer_idx}: {layer.top_token}")

# Track token evolution
config = AnalysisConfig(track_tokens=["Paris", " Paris"])
result = await analyzer.analyze("The capital of France is", config)
for evo in result.token_evolutions:
    print(f"{evo.token} emerges at layer {evo.emergence_layer}")

from chuk_lazarus.introspection import AblationStudy, AblationConfig

# Ablation studies - identify causal circuits
study = AblationStudy.from_pretrained("openai/gpt-oss-20b")
config = AblationConfig(max_new_tokens=15)

original = study.ablate_and_generate("45 * 45 = ", layers=[], config=config)
ablated = study.ablate_and_generate("45 * 45 = ", layers=[22, 23], config=config)
print(f"Original: {original}")  # "2025..."
print(f"L22+L23 ablated: {ablated}")  # Broken output

See docs/introspection.md for detailed introspection documentation.

Residual Stream Navigation Map

Visualise how the model "navigates" from a blank residual to a committed answer — rendered as a 3D sphere in your browser.

How it works:

The top-3 singular vectors of the fact unembedding directions (rows of lm_head.weight) form a 3D basis that spans "fact space". Every fact token is projected onto this basis and normalised to sit on the unit sphere surface. The residual stream at each layer is projected into the same space — its direction shows which fact the model is pointing toward, while its sharpness (concentration of logit-lens probabilities over fact tokens) is encoded as dot size and glow.

Three phases are automatically detected from the mean sharpness curve:

Phase	Layers	What's happening
Dark accumulation	0 → b₁	Residual is nearly orthogonal to all facts — dim speck wandering the sphere
Routing	b₁ → b₂	Sharpness rising — dot brightens and moves toward the target landmark
Fact explosion	b₂ → end	Sharpness saturates — large bright orb locks onto the fact landmark

Step 1 — Extract activation data:

# Capital of France (default facts, 40 generation steps)
uv run python examples/inference/nav_map_extract.py \
    --prompt "The largest planet in the solar system is" \
    --target Jupiter \
    --facts Jupiter Mars Venus Earth Sun Moon \
    --steps 40 --output nav_map_jupiter.json

# Custom prompt, smaller model for faster iteration
uv run python examples/inference/nav_map_extract.py \
    --model mlx-community/gemma-3-1b-it-bf16 \
    --prompt "The capital of France is" --target Paris \
    --steps 20 --output nav_map_paris.json

Step 2 — Open the visualizer:

open examples/inference/nav_map.html
# Then drag nav_map_*.json onto the page

Two view modes:

Layer depth — fix a generation step, animate layer 0 → 33. Watch the residual crystallize: a tiny dim speck in early layers grows into a bright orb that arrives at the target fact landmark.
Token drift — fix a layer (try 33 for the committed output), animate step 0 → N. The dot walks across the sphere token-by-token: bright near the target fact when predicting it, dimmer and wandering when generating filler tokens.

Controls:

Key	Action
`←` / `→`	Scrub the active axis (layer in Layer depth, step in Token drift)
`↑` / `↓`	Scrub the pinned axis
`Space`	Play / pause
`d`	Switch to Layer depth mode
`t`	Switch to Token drift mode
Mouse drag	Orbit the sphere

Output format (nav_map_*.json, version 2.0):

{
  "version": "2.0",
  "model": "mlx-community/gemma-3-4b-it-bf16",
  "phase_boundaries": [11, 22],
  "facts": [{"token": "Jupiter", "xyz": [...], "color": "#E24B4A", "is_target": true}],
  "frames": [{
    "step": 0, "token": " is", "predicted_token": " Jupiter",
    "layers": [{"layer": 0, "xyz": [...], "sharpness": 0.001, "top_token": "the", ...}]
  }]
}

Note: --target and --facts must be single tokens in the model vocabulary. Multi-token entries are skipped with a warning. Check which facts survived in the terminal output before interpreting the map.

Python API

from chuk_lazarus.utils.tokenizer_loader import load_tokenizer
from chuk_lazarus.data.tokenizers.analyze import analyze_coverage, calculate_fit_score
from chuk_lazarus.data.tokenizers.fingerprint import compute_fingerprint

# Load any HuggingFace tokenizer
tokenizer = load_tokenizer("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

# Analyze coverage on your corpus
texts = ["Your training data...", "More examples..."]
coverage = analyze_coverage(texts, tokenizer)
print(f"UNK rate: {coverage.unk_rate:.2%}")
print(f"Tokens per word: {coverage.tokens_per_word:.2f}")

# Calculate fit score
fit = calculate_fit_score(texts, tokenizer)
print(f"Fit score: {fit.score}/100 ({fit.grade})")

# Generate fingerprint for compatibility checks
fp = compute_fingerprint(tokenizer)
print(f"Fingerprint: {fp.fingerprint}")

Architecture

src/chuk_lazarus/
├── cli/                    # Command-line interface
├── data/
│   ├── batching/           # Token-budget batching, packing, distributed planning
│   │   ├── core/           # Bucketing, sampling, metrics
│   │   ├── planning/       # Batch plans, predictability, packing
│   │   ├── generation/     # Batch I/O, length caching
│   │   ├── streaming/      # Gym integration, replay buffers
│   │   └── analyze/        # Efficiency analysis
│   ├── samples/            # Sample schema and validation
│   ├── tokenizers/         # Tokenizer toolkit (analysis, preprocessing, runtime)
│   └── generators/         # Synthetic data generation
├── models_v2/              # Composable model architecture
│   ├── core/               # Backend, config, enums, registry
│   ├── components/         # Attention, FFN, embeddings, SSM, recurrent
│   ├── blocks/             # Transformer, Mamba, recurrent, hybrid blocks
│   ├── backbones/          # Stacks of blocks with embeddings
│   ├── heads/              # LM, classifier, regression heads
│   ├── models/             # CausalLM, classifiers
│   ├── families/           # Llama, Mamba implementations
│   ├── adapters/           # LoRA adapters
│   └── losses/             # Loss functions (pure math)
├── training/               # BatchPlan-driven reference trainers (SFT, DPO, GRPO, PPO)
├── inference/              # Unified inference pipeline
│   ├── unified.py          # UnifiedPipeline, EngineMode, make_engine()
│   ├── loader.py           # HFLoader, DType, WeightConverter
│   ├── chat.py             # ChatHistory, Role, format_chat_prompt
│   ├── generation.py       # GenerationConfig, generate, generate_stream
│   └── context/            # Stateful KV generation, checkpointing, long-context engines
│       ├── protocols.py    # ModelBackboneProtocol, TransformerLayerProtocol
│       ├── adapters/       # GemmaBackboneAdapter, LlamaBackboneAdapter
│       ├── kv_generator.py # KVDirectGenerator + make_kv_generator() factory
│       ├── kv_checkpoint.py # KVCheckpoint — save/resume prefill state to disk
│       ├── checkpoint_library.py # CheckpointLibrary — pre-filled knowledge bases
│       ├── bounded_engine.py # BoundedKVEngine (Mode 3, HOT/WARM/COLD tiers)
│       └── unlimited_engine.py # UnlimitedContextEngine (Mode 4, window chaining)
├── introspection/          # Model introspection and analysis
│   ├── analyzer.py         # ModelAnalyzer async API with Pydantic models
│   ├── hooks.py            # ModelHooks for capturing intermediate states
│   ├── logit_lens.py       # Layer-by-layer prediction analysis
│   ├── attention.py        # Attention pattern analysis
│   ├── moe.py              # MoE introspection (routing, expert identification)
│   ├── ablation/           # Ablation studies for causal discovery
│   └── visualizers/        # Heatmaps and evolution plots
├── distributed/            # Distributed training utilities
└── utils/                  # Utilities

Key Modules

Module	Description
Models	Composable architecture: components, blocks, backbones, heads, families (Llama, Gemma, Granite)
Inference	`UnifiedPipeline` with auto-detection, chat history, streaming; `KVDirectGenerator` (model-agnostic via Gemma/Llama adapters) with `make_kv_generator()` factory; `KVCheckpoint` for resumable prefill and offline context loading
Introspection	Model analysis: logit lens, attention visualization, MoE expert identification, ablation studies
Tokenizers	Comprehensive toolkit for analysis, preprocessing, and runtime management
Batching	Token-budget batching, sequence packing, distributed batch planning
Streaming	Puzzle arcade integration, replay buffers, online learning
Training	BatchPlan-driven trainers — enforce, don't decide

Features

Introspection: Logit lens, attention visualization, MoE expert identification, ablation studies, token evolution tracking
Tokenizer Toolkit: Encode, decode, analyze, compare, fingerprint, and debug any tokenizer
Character Tokenizer: Built-in character-level tokenizer for classification experiments
Tokenizer Doctor: Health check with auto-fix for missing chat templates
Chat Template Registry: 7 built-in formats (ChatML, Llama, Phi, Gemma, Zephyr, Vicuna, Alpaca)
Batching Infrastructure: Token-budget batching, sequence packing (measurable via chuk-lazarus bench)
BatchPlan Artifacts: Versioned, fingerprinted batch schedules for reproducibility and CI/CD
Pipeline Benchmark: Pack vs pad comparison, throughput metrics, memory footprint analysis
BatchPlan-Driven Training: Trainers enforce plans, not build them — deterministic by design
Puzzle Arcade Integration: Stream training data from 24 puzzle types for online/RL learning
Replay Buffers: Priority sampling, difficulty tracking, curriculum support
Analysis: Coverage, entropy, efficiency, fit scoring, vocabulary induction
Instrumentation: Histograms, OOV analysis, waste metrics, vocab comparison

What Lazarus is NOT:

Not a trainer framework competing with Lightning/Accelerate
Not a new optimizer zoo or model architecture lab
Not a "magic trainer" that decides things for you

What Lazarus IS: A reproducible planning/execution substrate you can plug into anything.

Artifacts

BatchPlans are the core artifact. When you build a batch plan, Lazarus creates:

batch_plan/
├── plan.jsonl          # Batch schedule: sample IDs, packing, token counts per step
├── metadata.json       # Epochs, token budget, strategy, version info
├── fingerprints.json   # Tokenizer + length cache fingerprints for drift detection
└── stats.json          # Efficiency metrics: utilization, waste, packing ratio

Schema promise: The plan.jsonl format is stable. Each line is a JSON object:

{"step":0,"samples":[12,88,104],"tokens":4096,"packing":[[0,128],[128,256]]}

Fields: step (global index), samples (sample IDs), tokens (batch total), packing (boundaries).

metadata.json includes:

plan_format_version: Schema version for forward compatibility
tool_version: Lazarus version that created the plan
seed: Random seed used (if predictable mode)
created_at: Timestamp

CI/CD validation:

# Validate a plan artifact before training (CI-friendly)
chuk-lazarus data batchplan validate -p batch_plan/ --strict

If the tokenizer or data changes, fingerprint mismatch is detected before training starts.

Documentation

Getting Started - Installation and quick reference
CLI Reference - Command-line interface documentation
Models Guide - Composable model architecture, components, LoRA adapters
Inference Guide - Run inference with pretrained HuggingFace models
Inference Server - OpenAI-compatible HTTP server for serving models
Client Library - Python client for the inference server
Introspection Guide - Logit lens, attention visualization, model analysis
Tokenizers Guide - Comprehensive tokenizer toolkit
Batching Guide - Token-budget batching, packing, distributed training
Training Guide - BatchPlan-driven training
API Reference - Python API documentation

Supported Models

Family	Models	Notes
Llama	TinyLlama, Llama 2 (7B, 13B), Llama 3.1/3.2, Llama 4 Scout	Llama 4 uses Mamba-Transformer hybrid
SmolLM2	135M, 360M, 1.7B	No auth required, fast inference
Mistral	7B Instruct v0.3	Sliding window attention
Gemma	Gemma 3 (270M, 1B, 4B, 12B, 27B), FunctionGemma	128K context, function calling
Granite	3.0/3.1 (2B, 8B), 4.0 Tiny (1B, 1.5B MoE)	IBM, dense and MoE variants
StarCoder2	3B, 7B, 15B	Code generation
Jamba	v0.1, 1.5 Mini (52B), 1.5 Large (398B)	AI21 hybrid Mamba-Transformer MoE, 256K context
Mamba	130M, 370M, 790M, 1.4B, 2.8B	Pure SSM architecture

OpenAI Tokenizers

Support for OpenAI's tokenizers via tiktoken:

uvx "chuk-lazarus[openai]" tokenizer encode -t "gpt-4" --text "Hello, world!"
uvx "chuk-lazarus[openai]" tokenizer compare -t1 "gpt-4" -t2 "gpt-4o" --text "Test"

Supported: gpt-4, gpt-4-turbo, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, o1, o1-mini, o3-mini

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
experiments		experiments
gpt-oss-lite-v2		gpt-oss-lite-v2
src/chuk_lazarus		src/chuk_lazarus
tests		tests
texts		texts
tools		tools
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
analyze_times_table.py		analyze_times_table.py
attention_head_analysis.py		attention_head_analysis.py
circuit-experiment.md		circuit-experiment.md
external_memory_injection.py		external_memory_injection.py
interference_matrix.py		interference_matrix.py
memory-structure.md		memory-structure.md
nav_map_paris.json		nav_map_paris.json
pyproject.toml		pyproject.toml
retrieval-circuit-research.md		retrieval-circuit-research.md
retrieval_circuit_analysis.py		retrieval_circuit_analysis.py
superbowl_facts.json		superbowl_facts.json
times_table_extraction.py		times_table_extraction.py
times_table_structure.json		times_table_structure.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lazarus

Quick Start with uvx

Installation

CLI Reference

Tokenizer Commands

Corpus Analysis

Data Commands

Puzzle Arcade Integration

BatchPlan-Driven Training

Context Libraries (Prefill + Generate)

Knowledge Store (Build + Query + Chat)

UnifiedPipeline

Model Family Examples

FunctionGemma (Function Calling)

Inference Server

Introspection (Model Analysis)

Residual Stream Navigation Map

Python API

Architecture

Key Modules

Features

Artifacts

Documentation

Supported Models

OpenAI Tokenizers

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Lazarus

Quick Start with uvx

Installation

CLI Reference

Tokenizer Commands

Corpus Analysis

Data Commands

Puzzle Arcade Integration

BatchPlan-Driven Training

Context Libraries (Prefill + Generate)

Knowledge Store (Build + Query + Chat)

UnifiedPipeline

Model Family Examples

FunctionGemma (Function Calling)

Inference Server

Introspection (Model Analysis)

Residual Stream Navigation Map

Python API

Architecture

Key Modules

Features

Artifacts

Documentation

Supported Models

OpenAI Tokenizers

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages