Releases: auroraGPT-ANL/inference-gateway
Inference Gateway for FIRST v0.1.0 - Initial Public Release
This is the initial public release of the FIRST (Federated Inference Resource Scheduling Toolkit) Inference Gateway.
Key Features:
-
OpenAI-Compatible API: Provides a familiar interface (/v1/chat/completions) for interacting with large language models on HPC.
-
Globus Integration: Leverages Globus Auth for secure user authentication/authorization and Globus Compute for orchestrating inference tasks on remote compute resources (HPC clusters, workstations).
-
Federated & Direct Endpoint Routing: Supports routing requests to specific backend endpoints or automatically selecting from a pool of federated resources.
-
Flexible Backend Support: Designed to work with various inference servers, with initial support and examples focused on vLLM.
-
Deployment Options: Includes instructions and configurations for deployment using Docker (recommended) or bare metal setups.
-
Comprehensive Setup Guide: Detailed README.md covering prerequisites, gateway setup, backend setup (including Globus Compute function registration and endpoint configuration), and verification steps.
-
Authentication Helper: Provides a CLI script (inference-auth-token.py) to simplify obtaining Globus access tokens for API interaction.
-
Basic Monitoring: Includes optional Docker Compose setup for Prometheus and Grafana monitoring.
-
Benchmarking Script: Offers a tool to load-test deployed endpoints.
This release establishes the core functionality for securely exposing LLM inference capabilities from diverse compute resources via a standardized API. It's suitable for teams looking to provide managed access to LLMs running on institutional clusters or powerful local machines.