Inference routing gives sandboxed agents access to LLM APIs through a single, explicit endpoint: inference.local. There is no implicit catch-all interception for arbitrary hosts. Requests are routed only when the process targets inference.local via HTTPS and the request matches a supported inference API pattern.
All inference execution happens locally inside the sandbox via the openshell-router crate. The gateway is control-plane only: it stores configuration and delivers resolved route bundles to sandboxes over gRPC.
sequenceDiagram
participant Agent as Agent Process
participant Proxy as Sandbox Proxy
participant Router as openshell-router
participant Gateway as Gateway (gRPC)
participant Backend as Inference Backend
Note over Gateway,Router: Control plane (startup + periodic refresh)
Gateway->>Router: GetInferenceBundle (routes, credentials)
Note over Agent,Backend: Data plane (per-request)
Agent->>Proxy: CONNECT inference.local:443
Proxy->>Proxy: TLS terminate (MITM)
Proxy->>Proxy: Parse HTTP, detect pattern
Proxy->>Router: proxy_with_candidates()
Router->>Router: Select route by protocol
Router->>Router: Rewrite auth + model
Router->>Backend: HTTPS request
Backend->>Router: Response headers + body stream
Router->>Proxy: StreamingProxyResponse (headers first)
Proxy->>Agent: HTTP/1.1 headers (chunked TE)
loop Each body chunk
Router->>Proxy: chunk via next_chunk()
Proxy->>Agent: Chunked-encoded frame
end
Proxy->>Agent: Chunk terminator (0\r\n\r\n)
File: crates/openshell-core/src/inference.rs
InferenceProviderProfile is the single source of truth for provider-specific inference knowledge: default endpoint, supported protocols, credential key lookup order, auth header style, and default headers.
Three profiles are defined:
| Provider | Default Base URL | Protocols | Auth | Default Headers |
|---|---|---|---|---|
openai |
https://api.openai.com/v1 |
openai_chat_completions, openai_completions, openai_responses, model_discovery |
Authorization: Bearer |
(none) |
anthropic |
https://api.anthropic.com/v1 |
anthropic_messages, model_discovery |
x-api-key |
anthropic-version: 2023-06-01 |
nvidia |
https://integrate.api.nvidia.com/v1 |
openai_chat_completions, openai_completions, openai_responses, model_discovery |
Authorization: Bearer |
(none) |
Each profile also defines credential_key_names (e.g. ["OPENAI_API_KEY"]) and base_url_config_keys (e.g. ["OPENAI_BASE_URL"]) used by the gateway to resolve credentials and endpoint overrides from provider records.
Unknown provider types return None from profile_for() and default to Bearer auth with no default headers via auth_for_provider_type().
Inference provider profiles describe backend request shape, not full sandbox tool auth projection. The next phase keeps these concerns separate:
- Anthropic path:
claude codeand any other Anthropic-facing tool may rely on provider records that discoverANTHROPIC_API_KEYor related fields, but the decision to expose those values to the child process belongs to the tool adapter contract in the sandbox layer. - GitHub / Copilot path: GitHub token discovery (
GITHUB_TOKEN,GH_TOKEN) is not enough on its own to define a GitHub Copilot model-access contract. Any Copilot-backed model flow must explicitly document which endpoints, token shapes, and tool adapters are supported. - Fail-closed rule: if a tool/vendor combination is not explicitly documented, OpenShell should not silently treat generic provider discovery as authorization to project those credentials into a sandbox child process.
In other words, provider profiles remain the source of truth for upstream inference protocol handling once traffic is routed, but tool adapters remain the source of truth for what vendor-facing auth material can enter the child process before routing begins.
File: crates/openshell-server/src/inference.rs
The gateway implements the Inference gRPC service defined in proto/inference.proto.
SetClusterInference takes a provider_name and model_id. It:
- Validates that both fields are non-empty.
- Fetches the named provider record from the store.
- Validates the provider by resolving its route (checking that the provider type is supported and has a usable API key).
- By default, performs a lightweight provider-shaped probe against the resolved upstream endpoint (for example, a tiny chat/messages request with
max_tokens: 1) to confirm the endpoint is reachable and accepts the expected auth/request shape.--no-verifydisables this probe when the endpoint is not up yet. - Builds a managed route spec that stores only
provider_nameandmodel_id. The spec intentionally leavesbase_url,api_key, andprotocolsempty -- these are resolved dynamically at bundle time from the provider record. - Upserts the route with name
inference.local. Version starts at 1 and increments monotonically on each update.
GetClusterInference returns provider_name, model_id, and version for the managed route. Returns NOT_FOUND if cluster inference is not configured.
GetInferenceBundle resolves the managed route at request time:
- Loads the
inference.localroute from the store. - Looks up the referenced provider record by
provider_name. - Resolves endpoint, API key, protocols, and provider type from the provider record using the
InferenceProviderProfileregistry. - If the provider's config map contains a base URL override key (e.g.
OPENAI_BASE_URL), that value overrides the profile default. - Returns a
GetInferenceBundleResponsewith the resolved route(s), a revision hash (DefaultHasher over route fields), andgenerated_at_mstimestamp.
Because resolution happens at request time, credential rotation and endpoint changes on the provider record take effect on the next bundle fetch without re-running SetClusterInference.
An empty route list is valid and indicates cluster inference is not yet configured.
File: proto/inference.proto
Key messages:
SetClusterInferenceRequest--provider_name+model_id+ optionalno_verifyoverride, with verification enabled by defaultSetClusterInferenceResponse--provider_name+model_id+versionGetInferenceBundleResponse--repeated ResolvedRoute routes+revision+generated_at_msResolvedRoute--name,base_url,protocols,api_key,model_id,provider_type
Files:
crates/openshell-sandbox/src/proxy.rs-- proxy interception, inference context, request routingcrates/openshell-sandbox/src/l7/inference.rs-- pattern detection, HTTP parsing, response formattingcrates/openshell-sandbox/src/lib.rs-- inference context initialization, route refreshcrates/openshell-sandbox/src/grpc_client.rs--fetch_inference_bundle()
In cluster mode, the sandbox starts a background refresh loop as soon as the inference context is created. The loop polls the gateway every 5 seconds by default (OPENSHELL_ROUTE_REFRESH_INTERVAL_SECS override) and uses the bundle revision hash to skip no-op cache writes.
The proxy handles only CONNECT requests to inference.local. Non-CONNECT requests (any method, any host) are rejected with 403.
When a CONNECT inference.local:443 arrives:
- Proxy responds
200 Connection Established. handle_inference_interception()TLS-terminates the client connection using the sandbox CA (MITM).- Raw HTTP requests are parsed from the TLS tunnel using
try_parse_http_request()(supports Content-Length and chunked transfer encoding). - Each parsed request is passed to
route_inference_request(). - The tunnel supports HTTP keep-alive: multiple requests can be processed sequentially.
- Buffer starts at 64 KiB (
INITIAL_INFERENCE_BUF) and grows up to 10 MiB (MAX_INFERENCE_BUF). Requests exceeding the max get413 Payload Too Large.
File: crates/openshell-sandbox/src/l7/inference.rs -- default_patterns() and detect_inference_pattern()
Supported built-in patterns:
| Method | Path | Protocol | Kind |
|---|---|---|---|
POST |
/v1/chat/completions |
openai_chat_completions |
chat_completion |
POST |
/v1/completions |
openai_completions |
completion |
POST |
/v1/responses |
openai_responses |
responses |
POST |
/v1/messages |
anthropic_messages |
messages |
GET |
/v1/models |
model_discovery |
models_list |
GET |
/v1/models/* |
model_discovery |
models_get |
Query strings are stripped before matching. Path matching is exact for most patterns; /v1/models/* matches any sub-path (e.g. /v1/models/gpt-4.1). Absolute-form URIs (e.g. https://inference.local/v1/chat/completions) are normalized to path-only form by normalize_inference_path() before detection.
If no pattern matches, the proxy returns 403 Forbidden with {"error": "connection not allowed by policy"}.
InferenceContextholds aRouter, the pattern list, and anArc<RwLock<Vec<ResolvedRoute>>>route cache.- In cluster mode,
spawn_route_refresh()pollsGetInferenceBundleevery 30 seconds (ROUTE_REFRESH_INTERVAL_SECS). On failure, stale routes are kept. - In file mode (
--inference-routes), routes load once at startup from YAML. No refresh task is spawned. - In cluster mode, an empty initial bundle still enables the inference context so the refresh task can pick up later configuration.
bundle_to_resolved_routes() in lib.rs converts proto ResolvedRoute messages to router ResolvedRoute structs. Auth header style and default headers are derived from provider_type using openshell_core::inference::auth_for_provider_type().
Files:
crates/openshell-router/src/lib.rs--Router,proxy_with_candidates(),proxy_with_candidates_streaming()crates/openshell-router/src/backend.rs--proxy_to_backend(),proxy_to_backend_streaming(), URL constructioncrates/openshell-router/src/config.rs--RouteConfig,ResolvedRoute, YAML loading
proxy_with_candidates() finds the first route whose protocols list contains the detected source protocol (normalized to lowercase). If no route matches, returns RouterError::NoCompatibleRoute.
proxy_to_backend() rewrites outgoing requests:
- Auth injection: Uses the route's
AuthHeader-- eitherAuthorization: Bearer <key>or a custom header (e.g.x-api-key: <key>for Anthropic). - Header stripping: Removes
authorization,x-api-key,host, and any header names that will be set from route defaults. - Default headers: Applies route-level default headers (e.g.
anthropic-version: 2023-06-01) unless the client already sent them. - Model rewrite: Parses the request body as JSON and replaces the
modelfield with the route's configured model. Non-JSON bodies are forwarded unchanged. - URL construction:
build_backend_url()appends the request path to the route endpoint. If the endpoint already ends with/v1and the request path starts with/v1/, the duplicate prefix is deduplicated.
Before forwarding inference requests, the proxy strips sensitive and hop-by-hop headers from both requests and responses:
- Request:
authorization,x-api-key,host,content-length, and hop-by-hop headers (connection,keep-alive,proxy-authenticate,proxy-authorization,proxy-connection,te,trailer,transfer-encoding,upgrade). - Response:
content-lengthand hop-by-hop headers.
The router supports two response modes:
- Buffered (
proxy_with_candidates()): Reads the entire upstream response body into memory before returning aProxyResponse { status, headers, body: Bytes }. Used by mock routes and in-process system inference calls where latency is not a concern. - Streaming (
proxy_with_candidates_streaming()): Returns aStreamingProxyResponseas soon as response headers arrive from the backend. The body is exposed as aStreamingBodyenum with anext_chunk()method that yieldsOption<Bytes>incrementally.
StreamingBody has two variants:
| Variant | Source | Behavior |
|---|---|---|
Live(reqwest::Response) |
Real HTTP backend | Calls response.chunk() to yield each body fragment as it arrives from the network |
Buffered(Option<Bytes>) |
Mock routes or fallback | Yields the entire body on the first call, then None |
The sandbox proxy (route_inference_request() in proxy.rs) uses the streaming path for all inference requests:
- Calls
proxy_with_candidates_streaming()to get headers immediately. - Formats and sends the HTTP/1.1 response header with
Transfer-Encoding: chunkedviaformat_http_response_header(). - Loops on
body.next_chunk(), wrapping each fragment in HTTP chunked encoding viaformat_chunk(). - Sends the chunk terminator (
0\r\n\r\n) viaformat_chunk_terminator().
This eliminates full-body buffering for streaming responses (SSE). Time-to-first-byte is determined by the backend's first chunk latency rather than the full generation time.
File: crates/openshell-router/src/mock.rs
Routes with mock:// scheme endpoints return canned responses without making HTTP requests. Mock responses are protocol-aware (OpenAI chat completion, OpenAI completion, Anthropic messages, or generic JSON). Mock routes include an x-openshell-mock: true response header.
The router uses a reqwest::Client with a 60-second timeout. Timeouts and connection failures map to RouterError::UpstreamUnavailable.
File: crates/openshell-router/src/config.rs
Standalone sandboxes can load static routes from YAML via --inference-routes:
routes:
- route: inference.local
endpoint: http://localhost:1234/v1
model: local-model
protocols: [openai_chat_completions]
api_key: lm-studio
# Or reference an environment variable:
# api_key_env: OPENAI_API_KEYFields:
route-- route name (informational)endpoint-- backend base URLmodel-- model ID to force on outgoing requestsprotocols-- list of supported protocol stringsprovider_type-- optional; determines auth style and default headers viaInferenceProviderProfileapi_key-- inline API key (mutually exclusive withapi_key_env)api_key_env-- environment variable name containing the API key
Validation at load time requires either api_key or api_key_env to resolve, and at least one protocol. Protocols are normalized (lowercased, trimmed, deduplicated).
| Status | Condition |
|---|---|
403 |
Request on inference.local does not match a recognized inference API pattern |
503 |
Pattern matched but route cache is empty (cluster inference not configured) |
400 |
No compatible route for the detected source protocol |
401 |
Upstream returned unauthorized |
502 |
Upstream protocol error or internal router error |
503 |
Upstream unavailable (timeout or connection failure) |
413 |
Request body exceeds 10 MiB buffer limit |
In addition to the user-facing inference.local route, the gateway supports a second managed route named sandbox-system for platform system functions (e.g. an embedded agent harness for policy analysis).
| Aspect | User (inference.local) |
System (sandbox-system) |
|---|---|---|
| Consumer | Agent code inside sandbox | Supervisor binary only |
| Access | Proxy-intercepted CONNECT | In-process API on InferenceContext |
| Network surface | HTTPS to inference.local:443 |
None -- function call |
| Route cache | InferenceContext.routes |
InferenceContext.system_routes |
InferenceContext::system_inference() provides the supervisor with direct access to inference using the system routes. It calls Router::proxy_with_candidates() with the system route cache -- the same backend proxy logic used for user inference, but without any CONNECT/TLS overhead.
ctx.system_inference(
"openai_chat_completions",
"POST",
"/v1/chat/completions",
headers,
body,
).awaitThe system route is not exposed through the CONNECT proxy. The supervisor runs in the host network namespace and calls the router directly. User processes are in an isolated sandbox network namespace and cannot reach the in-process API.
Both routes are included in GetInferenceBundleResponse.routes (which is repeated ResolvedRoute). The sandbox partitions routes by ResolvedRoute.name during bundle_to_resolved_routes(): routes named sandbox-system go to the system cache, everything else goes to the user cache. Both caches are refreshed on the same polling interval.
The system route is stored as a separate InferenceRoute record in the gateway store with name = "sandbox-system". The SetClusterInferenceRequest.route_name field selects which route to target (empty string defaults to inference.local).
Cluster inference commands:
openshell inference set --provider <name> --model <id>-- configures user-facing cluster inferenceopenshell inference set --system --provider <name> --model <id>-- configures system inferenceopenshell inference get-- displays both user and system inference configurationopenshell inference get --system-- displays only the system inference configuration
The --provider flag references a provider record name (not a provider type). The provider must already exist in the cluster and have a supported inference type (openai, anthropic, or nvidia).
Inference writes verify by default. --no-verify is the explicit opt-out for endpoints that are not up yet.
Files:
crates/openshell-providers/src/lib.rs--ProviderRegistry,ProviderPlugintraitcrates/openshell-providers/src/providers/openai.rs--OpenaiProvidercrates/openshell-providers/src/providers/anthropic.rs--AnthropicProvidercrates/openshell-providers/src/providers/nvidia.rs--NvidiaProvider
Provider discovery and inference routing are separate concerns:
ProviderPlugin(inopenshell-providers) handles credential discovery -- scanning environment variables to find API keys.InferenceProviderProfile(inopenshell-core) handles how to use discovered credentials to make inference API calls.
The openai, anthropic, and nvidia provider plugins each discover credentials from their canonical environment variable (OPENAI_API_KEY, ANTHROPIC_API_KEY, NVIDIA_API_KEY). These credentials are stored in provider records and looked up by the gateway at bundle resolution time.