Skip to content

GuideLLM v0.3.1

Latest

Choose a tag to compare

@sjmonson sjmonson released this 10 Oct 14:35
· 142 commits to main since this release
121dcdc

GuideLLM v0.3.1

Overview

Minor release focused on container build/tagging stability, UI polish and terminology alignment, improved OpenAI backend robustness/configurability, clearer JSON output, and new documentation (llama.cpp usage and a vLLM simulator walkthrough). Workflows now produce versioned artifacts and maintain latest/stable tags automatically.

To get started, install with:

pip install guidellm[recommended]==0.3.1

Or from source with:

pip install 'guidellm[recommended] @ git+https://github.com/vllm-project/guidellm.git'@v0.3.1

What's New

  • Recommended Extras Group: Install OpenAI tokenizer dependencies via guidellm[recommended] (tiktoken, blobfile)
  • llama.cpp Guide: New docs covering llama-server, model aliasing, and metadata handling
  • vLLM Simulator Example: Step-by-step “first benchmark” walkthrough with sample output images
  • Container Maintenance Workflow: Scheduled cleanup of old PR images; auto-retag latest and stable

What's Changed

  • UI Polish: Clearer labels (e.g., “Time Per Request”, “Measured RPS (Mean)”) and slider text
  • Versioned Reports: PROD/STAGING report URLs pinned to versioned UI builds
  • Container Build System: New top-level Containerfile using Fedora Python minimal + PDM; build type via GUIDELLM_BUILD_TYPE
  • Metrics JSON Output: UTF-8 encoding with pretty-printed, indented JSON
  • Endpoint Max Tokens Keys: Output-token limit now governed per-endpoint via GUIDELLM__OPENAI__MAX_OUTPUT_KEY

What's Fixed

  • Streaming Robustness: Safely handle missing delta.content for chat streams
  • Endpoint Token Keys: Configurable max output key per endpoint (max_tokens vs max_completion_tokens)
  • CI Stability: Fixes to RC tagging, GH Pages publish paths, and workflow typos; disable dry-run for image cleanup

Compatibility Notes

  • Python: 3.9–3.13
  • OS: Linux and MacOS
  • Dependencies: Optional extras via guidellm[recommended]; currently includes packages for OpenAI's tokenizer but may expand in the future
  • Breaking: Previously all endpoints used both max_tokens and max_completion_tokens to bound output; this caused issues with some servers
    • The key is now controlled per-endpoint (defaults to max_tokens for legacy completions and max_completion_tokens for chat/completions)

New Contributors

Changelog

  • UI & Presentation
    • #386: Update TPOT to ITL across labels and code
    • #298: Update RPS slider label
    • #301: Fix GH Pages UI publish path (src/ui/out)
    • #317: Correct type hint to fix Pydantic serialization warning
  • Backend
    • #399: Make max_tokens/max_completion_tokens key configurable per endpoint
    • #316: Handle missing content in streaming delta
  • Containers & CI
    • #254: Overhaul container image and CI (new top-level Containerfile, PDM build)
    • #379: Container CI bugfix and disable dry-run on image cleaner
    • #310: Use versioned builds (and version-pinned report links)
    • #389: Fix container RC tag
    • #398: Fix container RC tag (Attempt 2)
    • #400: Fix failing CI
    • #401: Fix typo in CI
    • #301: Correct UI src path in workflows (publish_dir)
  • Output & Tooling
    • #372: Pretty-print and UTF-8 encode metrics JSON files
  • Documentation
    • #318: Add documentation on how to use with llama.cpp
    • #328: Add “first benchmark testing example” (vLLM simulator)
  • Packaging
    • #313: Add recommended extras group

Changelog link: v0.3.0...v0.3.1