Inference Routing

Inference routing gives sandboxed agents access to LLM APIs through a single, explicit endpoint: inference.local. There is no implicit catch-all interception for arbitrary hosts. Requests are routed only when the process targets inference.local via HTTPS and the request matches a supported inference API pattern.

All inference execution happens locally inside the sandbox via the openshell-router crate. The gateway is control-plane only: it stores configuration and delivers resolved route bundles to sandboxes over gRPC.

Architecture Overview

sequenceDiagram
    participant Agent as Agent Process
    participant Proxy as Sandbox Proxy
    participant Router as openshell-router
    participant Gateway as Gateway (gRPC)
    participant Backend as Inference Backend

    Note over Gateway,Router: Control plane (startup + periodic refresh)
    Gateway->>Router: GetInferenceBundle (routes, credentials)

    Note over Agent,Backend: Data plane (per-request)
    Agent->>Proxy: CONNECT inference.local:443
    Proxy->>Proxy: TLS terminate (MITM)
    Proxy->>Proxy: Parse HTTP, detect pattern
    Proxy->>Router: proxy_with_candidates()
    Router->>Router: Select route by protocol
    Router->>Router: Rewrite auth + model
    Router->>Backend: HTTPS request
    Backend->>Router: Response headers + body stream
    Router->>Proxy: StreamingProxyResponse (headers first)
    Proxy->>Agent: HTTP/1.1 headers (chunked TE)
    loop Each body chunk
        Router->>Proxy: chunk via next_chunk()
        Proxy->>Agent: Chunked-encoded frame
    end
    Proxy->>Agent: Chunk terminator (0\r\n\r\n)

Provider Profiles

File: crates/openshell-core/src/inference.rs

InferenceProviderProfile is the single source of truth for provider-specific inference knowledge: default endpoint, supported protocols, credential key lookup order, auth header style, and default headers.

Three profiles are defined:

Provider	Default Base URL	Protocols	Auth	Default Headers
`openai`	`https://api.openai.com/v1`	`openai_chat_completions`, `openai_completions`, `openai_responses`, `model_discovery`	`Authorization: Bearer`	(none)
`anthropic`	`https://api.anthropic.com/v1`	`anthropic_messages`, `model_discovery`	`x-api-key`	`anthropic-version: 2023-06-01`
`nvidia`	`https://integrate.api.nvidia.com/v1`	`openai_chat_completions`, `openai_completions`, `openai_responses`, `model_discovery`	`Authorization: Bearer`	(none)

Each profile also defines credential_key_names (e.g. ["OPENAI_API_KEY"]) and base_url_config_keys (e.g. ["OPENAI_BASE_URL"]) used by the gateway to resolve credentials and endpoint overrides from provider records.

Unknown provider types return None from profile_for() and default to Bearer auth with no default headers via auth_for_provider_type().

Vendor Auth Projection Boundaries

Inference provider profiles describe backend request shape, not full sandbox tool auth projection. The next phase keeps these concerns separate:

Anthropic path: claude code and any other Anthropic-facing tool may rely on provider records that discover ANTHROPIC_API_KEY or related fields, but the decision to expose those values to the child process belongs to the tool adapter contract in the sandbox layer.
GitHub / Copilot path: GitHub token discovery (GITHUB_TOKEN, GH_TOKEN) is not enough on its own to define a GitHub Copilot model-access contract. Any Copilot-backed model flow must explicitly document which endpoints, token shapes, and tool adapters are supported.
Fail-closed rule: if a tool/vendor combination is not explicitly documented, OpenShell should not silently treat generic provider discovery as authorization to project those credentials into a sandbox child process.

In other words, provider profiles remain the source of truth for upstream inference protocol handling once traffic is routed, but tool adapters remain the source of truth for what vendor-facing auth material can enter the child process before routing begins.

Control Plane (Gateway)

File: crates/openshell-server/src/inference.rs

The gateway implements the Inference gRPC service defined in proto/inference.proto.

Cluster inference set/get

SetClusterInference takes a provider_name and model_id. It:

Validates that both fields are non-empty.
Fetches the named provider record from the store.
Validates the provider by resolving its route (checking that the provider type is supported and has a usable API key).
By default, performs a lightweight provider-shaped probe against the resolved upstream endpoint (for example, a tiny chat/messages request with max_tokens: 1) to confirm the endpoint is reachable and accepts the expected auth/request shape. --no-verify disables this probe when the endpoint is not up yet.
Builds a managed route spec that stores only provider_name and model_id. The spec intentionally leaves base_url, api_key, and protocols empty -- these are resolved dynamically at bundle time from the provider record.
Upserts the route with name inference.local. Version starts at 1 and increments monotonically on each update.

GetClusterInference returns provider_name, model_id, and version for the managed route. Returns NOT_FOUND if cluster inference is not configured.

Bundle delivery

GetInferenceBundle resolves the managed route at request time:

Loads the inference.local route from the store.
Looks up the referenced provider record by provider_name.
Resolves endpoint, API key, protocols, and provider type from the provider record using the InferenceProviderProfile registry.
If the provider's config map contains a base URL override key (e.g. OPENAI_BASE_URL), that value overrides the profile default.
Returns a GetInferenceBundleResponse with the resolved route(s), a revision hash (DefaultHasher over route fields), and generated_at_ms timestamp.

Because resolution happens at request time, credential rotation and endpoint changes on the provider record take effect on the next bundle fetch without re-running SetClusterInference.

An empty route list is valid and indicates cluster inference is not yet configured.

Proto definitions

File: proto/inference.proto

Key messages:

SetClusterInferenceRequest -- provider_name + model_id + optional no_verify override, with verification enabled by default
SetClusterInferenceResponse -- provider_name + model_id + version
GetInferenceBundleResponse -- repeated ResolvedRoute routes + revision + generated_at_ms
ResolvedRoute -- name, base_url, protocols, api_key, model_id, provider_type

Data Plane (Sandbox)

Files:

crates/openshell-sandbox/src/proxy.rs -- proxy interception, inference context, request routing
crates/openshell-sandbox/src/l7/inference.rs -- pattern detection, HTTP parsing, response formatting
crates/openshell-sandbox/src/lib.rs -- inference context initialization, route refresh
crates/openshell-sandbox/src/grpc_client.rs -- fetch_inference_bundle()

In cluster mode, the sandbox starts a background refresh loop as soon as the inference context is created. The loop polls the gateway every 5 seconds by default (OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS override) and uses the bundle revision hash to skip no-op cache writes.

Interception flow

The proxy handles only CONNECT requests to inference.local. Non-CONNECT requests (any method, any host) are rejected with 403.

When a CONNECT inference.local:443 arrives:

Proxy responds 200 Connection Established.
handle_inference_interception() TLS-terminates the client connection using the sandbox CA (MITM).
Raw HTTP requests are parsed from the TLS tunnel using try_parse_http_request() (supports Content-Length and chunked transfer encoding).
Each parsed request is passed to route_inference_request().
The tunnel supports HTTP keep-alive: multiple requests can be processed sequentially.
Buffer starts at 64 KiB (INITIAL_INFERENCE_BUF) and grows up to 10 MiB (MAX_INFERENCE_BUF). Requests exceeding the max get 413 Payload Too Large.

Request classification

File: crates/openshell-sandbox/src/l7/inference.rs -- default_patterns() and detect_inference_pattern()

Supported built-in patterns:

Method	Path	Protocol	Kind
`POST`	`/v1/chat/completions`	`openai_chat_completions`	`chat_completion`
`POST`	`/v1/completions`	`openai_completions`	`completion`
`POST`	`/v1/responses`	`openai_responses`	`responses`
`POST`	`/v1/messages`	`anthropic_messages`	`messages`
`GET`	`/v1/models`	`model_discovery`	`models_list`
`GET`	`/v1/models/*`	`model_discovery`	`models_get`

Query strings are stripped before matching. Path matching is exact for most patterns; /v1/models/* matches any sub-path (e.g. /v1/models/gpt-4.1). Absolute-form URIs (e.g. https://inference.local/v1/chat/completions) are normalized to path-only form by normalize_inference_path() before detection.

If no pattern matches, the proxy returns 403 Forbidden with {"error": "connection not allowed by policy"}.

Route cache

InferenceContext holds a Router, the pattern list, and an Arc<RwLock<Vec<ResolvedRoute>>> route cache.
In cluster mode, spawn_route_refresh() polls GetInferenceBundle every 30 seconds (ROUTE_REFRESH_INTERVAL_SECS). On failure, stale routes are kept.
In file mode (--inference-routes), routes load once at startup from YAML. No refresh task is spawned.
In cluster mode, an empty initial bundle still enables the inference context so the refresh task can pick up later configuration.

Bundle-to-route conversion

bundle_to_resolved_routes() in lib.rs converts proto ResolvedRoute messages to router ResolvedRoute structs. Auth header style and default headers are derived from provider_type using openshell_core::inference::auth_for_provider_type().

Router Behavior

Files:

crates/openshell-router/src/lib.rs -- Router, proxy_with_candidates(), proxy_with_candidates_streaming()
crates/openshell-router/src/backend.rs -- proxy_to_backend(), proxy_to_backend_streaming(), URL construction
crates/openshell-router/src/config.rs -- RouteConfig, ResolvedRoute, YAML loading

Route selection

proxy_with_candidates() finds the first route whose protocols list contains the detected source protocol (normalized to lowercase). If no route matches, returns RouterError::NoCompatibleRoute.

Request rewriting

proxy_to_backend() rewrites outgoing requests:

Auth injection: Uses the route's AuthHeader -- either Authorization: Bearer <key> or a custom header (e.g. x-api-key: <key> for Anthropic).
Header stripping: Removes authorization, x-api-key, host, and any header names that will be set from route defaults.
Default headers: Applies route-level default headers (e.g. anthropic-version: 2023-06-01) unless the client already sent them.
Model rewrite: Parses the request body as JSON and replaces the model field with the route's configured model. Non-JSON bodies are forwarded unchanged.
URL construction: build_backend_url() appends the request path to the route endpoint. If the endpoint already ends with /v1 and the request path starts with /v1/, the duplicate prefix is deduplicated.

Header sanitization

Before forwarding inference requests, the proxy strips sensitive and hop-by-hop headers from both requests and responses:

Request: authorization, x-api-key, host, content-length, and hop-by-hop headers (connection, keep-alive, proxy-authenticate, proxy-authorization, proxy-connection, te, trailer, transfer-encoding, upgrade).
Response: content-length and hop-by-hop headers.

Response streaming

The router supports two response modes:

Buffered (proxy_with_candidates()): Reads the entire upstream response body into memory before returning a ProxyResponse { status, headers, body: Bytes }. Used by mock routes and in-process system inference calls where latency is not a concern.
Streaming (proxy_with_candidates_streaming()): Returns a StreamingProxyResponse as soon as response headers arrive from the backend. The body is exposed as a StreamingBody enum with a next_chunk() method that yields Option<Bytes> incrementally.

StreamingBody has two variants:

Variant	Source	Behavior
`Live(reqwest::Response)`	Real HTTP backend	Calls `response.chunk()` to yield each body fragment as it arrives from the network
`Buffered(Option<Bytes>)`	Mock routes or fallback	Yields the entire body on the first call, then `None`

The sandbox proxy (route_inference_request() in proxy.rs) uses the streaming path for all inference requests:

Calls proxy_with_candidates_streaming() to get headers immediately.
Formats and sends the HTTP/1.1 response header with Transfer-Encoding: chunked via format_http_response_header().
Loops on body.next_chunk(), wrapping each fragment in HTTP chunked encoding via format_chunk().
Sends the chunk terminator (0\r\n\r\n) via format_chunk_terminator().

This eliminates full-body buffering for streaming responses (SSE). Time-to-first-byte is determined by the backend's first chunk latency rather than the full generation time.

Mock routes

File: crates/openshell-router/src/mock.rs

Routes with mock:// scheme endpoints return canned responses without making HTTP requests. Mock responses are protocol-aware (OpenAI chat completion, OpenAI completion, Anthropic messages, or generic JSON). Mock routes include an x-openshell-mock: true response header.

HTTP client

The router uses a reqwest::Client with a 60-second timeout. Timeouts and connection failures map to RouterError::UpstreamUnavailable.

Standalone Route File

File: crates/openshell-router/src/config.rs

Standalone sandboxes can load static routes from YAML via --inference-routes:

routes:
  - route: inference.local
    endpoint: http://localhost:1234/v1
    model: local-model
    protocols: [openai_chat_completions]
    api_key: lm-studio
    # Or reference an environment variable:
    # api_key_env: OPENAI_API_KEY

Fields:

route -- route name (informational)
endpoint -- backend base URL
model -- model ID to force on outgoing requests
protocols -- list of supported protocol strings
provider_type -- optional; determines auth style and default headers via InferenceProviderProfile
api_key -- inline API key (mutually exclusive with api_key_env)
api_key_env -- environment variable name containing the API key

Validation at load time requires either api_key or api_key_env to resolve, and at least one protocol. Protocols are normalized (lowercased, trimmed, deduplicated).

Error Model

Status	Condition
`403`	Request on `inference.local` does not match a recognized inference API pattern
`503`	Pattern matched but route cache is empty (cluster inference not configured)
`400`	No compatible route for the detected source protocol
`401`	Upstream returned unauthorized
`502`	Upstream protocol error or internal router error
`503`	Upstream unavailable (timeout or connection failure)
`413`	Request body exceeds 10 MiB buffer limit

System Inference Route

In addition to the user-facing inference.local route, the gateway supports a second managed route named sandbox-system for platform system functions (e.g. an embedded agent harness for policy analysis).

Key differences from user inference

Aspect	User (`inference.local`)	System (`sandbox-system`)
Consumer	Agent code inside sandbox	Supervisor binary only
Access	Proxy-intercepted CONNECT	In-process API on `InferenceContext`
Network surface	HTTPS to `inference.local:443`	None -- function call
Route cache	`InferenceContext.routes`	`InferenceContext.system_routes`

In-process API

InferenceContext::system_inference() provides the supervisor with direct access to inference using the system routes. It calls Router::proxy_with_candidates() with the system route cache -- the same backend proxy logic used for user inference, but without any CONNECT/TLS overhead.

ctx.system_inference(
    "openai_chat_completions",
    "POST",
    "/v1/chat/completions",
    headers,
    body,
).await

Access control

The system route is not exposed through the CONNECT proxy. The supervisor runs in the host network namespace and calls the router directly. User processes are in an isolated sandbox network namespace and cannot reach the in-process API.

Bundle delivery

Both routes are included in GetInferenceBundleResponse.routes (which is repeated ResolvedRoute). The sandbox partitions routes by ResolvedRoute.name during bundle_to_resolved_routes(): routes named sandbox-system go to the system cache, everything else goes to the user cache. Both caches are refreshed on the same polling interval.

Storage

The system route is stored as a separate InferenceRoute record in the gateway store with name = "sandbox-system". The SetClusterInferenceRequest.route_name field selects which route to target (empty string defaults to inference.local).

CLI Surface

Cluster inference commands:

openshell inference set --provider <name> --model <id> -- configures user-facing cluster inference
openshell inference set --system --provider <name> --model <id> -- configures system inference
openshell inference get -- displays both user and system inference configuration
openshell inference get --system -- displays only the system inference configuration

The --provider flag references a provider record name (not a provider type). The provider must already exist in the cluster and have a supported inference type (openai, anthropic, or nvidia).

Inference writes verify by default. --no-verify is the explicit opt-out for endpoints that are not up yet.

Provider Discovery

Files:

crates/openshell-providers/src/lib.rs -- ProviderRegistry, ProviderPlugin trait
crates/openshell-providers/src/providers/openai.rs -- OpenaiProvider
crates/openshell-providers/src/providers/anthropic.rs -- AnthropicProvider
crates/openshell-providers/src/providers/nvidia.rs -- NvidiaProvider

Provider discovery and inference routing are separate concerns:

ProviderPlugin (in openshell-providers) handles credential discovery -- scanning environment variables to find API keys.
InferenceProviderProfile (in openshell-core) handles how to use discovered credentials to make inference API calls.

The openai, anthropic, and nvidia provider plugins each discover credentials from their canonical environment variable (OPENAI_API_KEY, ANTHROPIC_API_KEY, NVIDIA_API_KEY). These credentials are stored in provider records and looked up by the gateway at bundle resolution time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Routing

Architecture Overview

Provider Profiles

Vendor Auth Projection Boundaries

Control Plane (Gateway)

Cluster inference set/get

Bundle delivery

Proto definitions

Data Plane (Sandbox)

Interception flow

Request classification

Route cache

Bundle-to-route conversion

Router Behavior

Route selection

Request rewriting

Header sanitization

Response streaming

Mock routes

HTTP client

Standalone Route File

Error Model

System Inference Route

Key differences from user inference

In-process API

Access control

Bundle delivery

Storage

CLI Surface

Provider Discovery

FilesExpand file tree

inference-routing.md

Latest commit

History

inference-routing.md

File metadata and controls

Inference Routing

Architecture Overview

Provider Profiles

Vendor Auth Projection Boundaries

Control Plane (Gateway)

Cluster inference set/get

Bundle delivery

Proto definitions

Data Plane (Sandbox)

Interception flow

Request classification

Route cache

Bundle-to-route conversion

Router Behavior

Route selection

Request rewriting

Header sanitization

Response streaming

Mock routes

HTTP client

Standalone Route File

Error Model

System Inference Route

Key differences from user inference

In-process API

Access control

Bundle delivery

Storage

CLI Surface

Provider Discovery