feat(proxy): lazy input_schema discovery for sandbox-supplied-URL MCP providers (closes #135)#136
Conversation
… providers (closes #135) For `handler = "mcp"` providers whose upstream URL is sandbox-supplied (the `mcp_url_env` / `X-Ati-Upstream-Url` model from #124), the proxy cannot discover tools at boot — the URL isn't proxy-resident. So these providers MUST declare their tools statically in the manifest, and `GET /tools/:name` served the static entry verbatim. If the static entry had no `input_schema`, the agent saw `input_schema: null`. Operators worked around this by hand-mirroring the MCP server's schemas into the manifest and keeping them from drifting. This change lazily fetches the schemas on demand. When a tool info / list / MCP tools-list request resolves a static entry whose `input_schema` is `None` AND the provider is `handler = "mcp"` with `mcp_url_env` set AND the request carries a validated `X-Ati-Upstream-Url`, the proxy runs one `tools/list` against the upstream and merges the discovered `inputSchema` into the response. Results are cached per `(provider, upstream_url)` for the process lifetime — both positive (the schema map) and negative (the dial failed) — so a steady-state of `ati tool info` calls only pays the discovery cost once per sandbox URL. Three call sites learn the trick: - `GET /tools` (REST listing — `ati tool list`) - `GET /tools/:name` (REST single-tool — `ati tool info`) - `POST /mcp` → `tools/list` (MCP JSON-RPC — agents) The REST list paths batch by provider so a multi-tool provider with N tools only triggers one upstream tools/list per request, not N. Graceful-degradation contract: - Header absent → static schema (today's behavior: usually `null`). - Header present but URL outside the allowlist → static schema. The REST handlers must NOT surface 4xx for a misconfigured sandbox URL on a tool-info call (the `/call` path still does, unchanged). - Header present + URL valid but upstream unreachable → static schema; the negative result is cached so the next request doesn't retry a known-broken upstream. OpenAPI-handler tools are untouched (they always carry a schema from the spec). Local-mode MCP is untouched (the proxy already discovers those at boot). Static schemas on `[[tools]]` entries are untouched (the fast path returns them without dialing). Tests in `tests/lazy_schema_discovery_test.rs`: - `tool_info_merges_upstream_schema_when_static_is_none` — the acceptance criterion from the issue body. - `tool_info_without_header_falls_back_to_static_null` — no regression when the header is absent. - `tool_info_keeps_null_when_upstream_omits_the_tool` — partial upstream responses don't fabricate schemas. - `tool_info_silently_falls_back_when_url_outside_allowlist` — the REST handlers never surface 4xx on allowlist miss. - `lazy_schema_cache_avoids_redundant_upstream_calls` — second call on the same (provider, url) is served from cache; the upstream saw at most one `tools/list`. - `mcp_tools_list_merges_upstream_schema` — the MCP JSON-RPC path also serves the merged schema. ProxyState gains one new field (`lazy_schema_cache`) and a type alias (`LazySchemaCache`). 33 existing test fixtures updated to initialize the new field; the bulk update follows the same idiom as `upstream_url_allowlists`.
Greptile SummaryThis PR adds lazy
Confidence Score: 4/5Safe to merge with the minor batching fix applied; the fallback-to-null contract guarantees no new 5xx surface on the info endpoints. The core logic is sound: allowlist validation happens before discovery, the mutex is never held across an await, and the negative-result cache prevents hammering a broken upstream. When lazy_fetch_schemas returns None (upstream failure), the provider is not recorded in schemas_by_provider, so N-1 remaining tools from that provider each acquire the mutex for a redundant cache-read rather than being short-circuited — a concrete but non-correctness-affecting defect replicated in both list handlers. src/proxy/server.rs — the two batching loops (handle_tools_list and the tools/list MCP arm) should record a sentinel for failed providers. Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant Proxy as ATI Proxy
participant Cache as lazy_schema_cache
participant Allowlist as extract_upstream_url
participant MCP as Sandbox MCP Upstream
Client->>Proxy: GET /tools/:name (X-Ati-Upstream-Url: url)
Proxy->>Allowlist: resolve_upstream_override(provider, url)
alt URL rejected or absent
Allowlist-->>Proxy: None
Proxy-->>Client: 200 input_schema null
else URL allowed
Allowlist-->>Proxy: Allow(url)
Proxy->>Cache: lock get((provider, url))
alt Cache hit
Cache-->>Proxy: Some(map) or None
Proxy-->>Client: 200 input_schema from map or null
else Cache miss
Cache-->>Proxy: absent
Proxy->>MCP: "connect_with_gen -> initialize -> tools/list"
MCP-->>Proxy: Vec of Tool with inputSchema
Proxy->>Cache: lock insert((provider, url), map or None)
Proxy-->>Client: 200 input_schema from map or null
end
end
Reviews (1): Last reviewed commit: "feat(proxy): lazy input_schema discovery..." | Re-trigger Greptile |
| if let Some(map) = lazy_fetch_schemas(&state, provider, &upstream_url).await { | ||
| schemas_by_provider.insert(provider.name.clone(), map); | ||
| } | ||
| } | ||
|
|
||
| let tools: Vec<Value> = filtered |
There was a problem hiding this comment.
Batching short-circuit doesn't fire on upstream failure. When
lazy_fetch_schemas returns None (connect or list failed), the provider is not recorded in schemas_by_provider. For a provider with N tools that all lack a static schema, the first tool caches the negative result, but the remaining N-1 tools still re-enter the loop and individually call lazy_fetch_schemas — each grabbing the mutex for a cache read. Replacing the if let Some with unwrap_or_default() inserts an empty map for the failed provider, so the contains_key guard on subsequent tools fires correctly and the N-1 extra mutex acquisitions are avoided.
| if let Some(map) = lazy_fetch_schemas(&state, provider, &upstream_url).await { | |
| schemas_by_provider.insert(provider.name.clone(), map); | |
| } | |
| } | |
| let tools: Vec<Value> = filtered | |
| let map = lazy_fetch_schemas(&state, provider, &upstream_url) | |
| .await | |
| .unwrap_or_default(); | |
| schemas_by_provider.insert(provider.name.clone(), map); | |
| } | |
| let tools: Vec<Value> = filtered |
| if let Some(map) = lazy_fetch_schemas(&state, provider, &url).await { | ||
| schemas_by_provider.insert(provider.name.clone(), map); | ||
| } | ||
| } | ||
|
|
||
| let mcp_tools: Vec<Value> = visible_tools |
There was a problem hiding this comment.
Same batching short-circuit gap in the MCP JSON-RPC
tools/list arm. If lazy_fetch_schemas returns None, subsequent tools from the same provider bypass the contains_key guard and each acquire the mutex for an individual negative-cache hit.
| if let Some(map) = lazy_fetch_schemas(&state, provider, &url).await { | |
| schemas_by_provider.insert(provider.name.clone(), map); | |
| } | |
| } | |
| let mcp_tools: Vec<Value> = visible_tools | |
| let map = lazy_fetch_schemas(&state, provider, &url) | |
| .await | |
| .unwrap_or_default(); | |
| schemas_by_provider.insert(provider.name.clone(), map); | |
| } | |
| let mcp_tools: Vec<Value> = visible_tools |
| /// Build an axum Router wired with the issue #135 manifest pattern: | ||
| /// MCP provider with `mcp_url_env`, two statically named tools (one | ||
| /// without schema, one without — both meant to be filled by lazy | ||
| /// discovery), and a hostname-glob allowlist that admits the wiremock | ||
| /// server's host. |
There was a problem hiding this comment.
Docstring typo: "one without schema, one without" appears to be a copy-paste error. The test body and
tool_info_keeps_null_when_upstream_omits_the_tool make clear the intent is to verify both the "present upstream schema" and "absent upstream schema" paths.
| /// Build an axum Router wired with the issue #135 manifest pattern: | |
| /// MCP provider with `mcp_url_env`, two statically named tools (one | |
| /// without schema, one without — both meant to be filled by lazy | |
| /// discovery), and a hostname-glob allowlist that admits the wiremock | |
| /// server's host. | |
| /// Build an axum Router wired with the issue #135 manifest pattern: | |
| /// MCP provider with `mcp_url_env`, two statically named tools (one | |
| /// with an upstream inputSchema, one without — no static schemas on | |
| /// either, so the issue #135 lazy-discovery path fires for both), | |
| /// and a hostname-glob allowlist that admits the wiremock server's host. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Three P2 findings on PR #136: 1. `handle_tools_list`'s per-provider batching loop did not record a provider entry when `lazy_fetch_schemas` returned `None` (dial or tools/list failed). The `schemas_by_provider.contains_key` guard wouldn't short-circuit for sibling tools, so an N-tool provider would re-call `extract_upstream_url` + `lazy_fetch_schemas` N times per request when the upstream was unreachable — the inner negative-result cache absorbed the network cost but every sibling still walked the mutex. 2. Same gap in the MCP JSON-RPC `tools/list` arm. 3. `build_app` docstring said "one without schema, one without", should have read "both without schema". Fix shape for (1)+(2): the per-provider decision now records an entry on EVERY path through the loop — including the URL-missing and discovery-failed paths — using `HashMap::new()` as the sentinel for "tried and got nothing". The value-pull loop treats an empty map the same as a real miss (no schema for that tool name), so behavior is unchanged; sibling tools just short-circuit on contains_key. New test `tools_list_batches_one_dial_per_provider_per_request` asserts the success-path short-circuit fires: a 2-tool provider on a single `GET /tools` request triggers exactly 1 upstream tools/list, not 2. The existing serial-cache test still covers the cross-request short-circuit.
… providers (closes #135) [0.7 backport] (#141) * feat(proxy): lazy input_schema discovery for sandbox-supplied-URL MCP providers (closes #135) For `handler = "mcp"` providers whose upstream URL is sandbox-supplied (the `mcp_url_env` / `X-Ati-Upstream-Url` model from #124), the proxy cannot discover tools at boot — the URL isn't proxy-resident. So these providers MUST declare their tools statically in the manifest, and `GET /tools/:name` served the static entry verbatim. If the static entry had no `input_schema`, the agent saw `input_schema: null`. Operators worked around this by hand-mirroring the MCP server's schemas into the manifest and keeping them from drifting. This change lazily fetches the schemas on demand. When a tool info / list / MCP tools-list request resolves a static entry whose `input_schema` is `None` AND the provider is `handler = "mcp"` with `mcp_url_env` set AND the request carries a validated `X-Ati-Upstream-Url`, the proxy runs one `tools/list` against the upstream and merges the discovered `inputSchema` into the response. Results are cached per `(provider, upstream_url)` for the process lifetime — both positive (the schema map) and negative (the dial failed) — so a steady-state of `ati tool info` calls only pays the discovery cost once per sandbox URL. Three call sites learn the trick: - `GET /tools` (REST listing — `ati tool list`) - `GET /tools/:name` (REST single-tool — `ati tool info`) - `POST /mcp` → `tools/list` (MCP JSON-RPC — agents) The REST list paths batch by provider so a multi-tool provider with N tools only triggers one upstream tools/list per request, not N. Graceful-degradation contract: - Header absent → static schema (today's behavior: usually `null`). - Header present but URL outside the allowlist → static schema. The REST handlers must NOT surface 4xx for a misconfigured sandbox URL on a tool-info call (the `/call` path still does, unchanged). - Header present + URL valid but upstream unreachable → static schema; the negative result is cached so the next request doesn't retry a known-broken upstream. OpenAPI-handler tools are untouched (they always carry a schema from the spec). Local-mode MCP is untouched (the proxy already discovers those at boot). Static schemas on `[[tools]]` entries are untouched (the fast path returns them without dialing). Tests in `tests/lazy_schema_discovery_test.rs`: - `tool_info_merges_upstream_schema_when_static_is_none` — the acceptance criterion from the issue body. - `tool_info_without_header_falls_back_to_static_null` — no regression when the header is absent. - `tool_info_keeps_null_when_upstream_omits_the_tool` — partial upstream responses don't fabricate schemas. - `tool_info_silently_falls_back_when_url_outside_allowlist` — the REST handlers never surface 4xx on allowlist miss. - `lazy_schema_cache_avoids_redundant_upstream_calls` — second call on the same (provider, url) is served from cache; the upstream saw at most one `tools/list`. - `mcp_tools_list_merges_upstream_schema` — the MCP JSON-RPC path also serves the merged schema. ProxyState gains one new field (`lazy_schema_cache`) and a type alias (`LazySchemaCache`). 33 existing test fixtures updated to initialize the new field; the bulk update follows the same idiom as `upstream_url_allowlists`. * fix(proxy): address Greptile findings on lazy schema discovery Three P2 findings on PR #136: 1. `handle_tools_list`'s per-provider batching loop did not record a provider entry when `lazy_fetch_schemas` returned `None` (dial or tools/list failed). The `schemas_by_provider.contains_key` guard wouldn't short-circuit for sibling tools, so an N-tool provider would re-call `extract_upstream_url` + `lazy_fetch_schemas` N times per request when the upstream was unreachable — the inner negative-result cache absorbed the network cost but every sibling still walked the mutex. 2. Same gap in the MCP JSON-RPC `tools/list` arm. 3. `build_app` docstring said "one without schema, one without", should have read "both without schema". Fix shape for (1)+(2): the per-provider decision now records an entry on EVERY path through the loop — including the URL-missing and discovery-failed paths — using `HashMap::new()` as the sentinel for "tried and got nothing". The value-pull loop treats an empty map the same as a real miss (no schema for that tool name), so behavior is unchanged; sibling tools just short-circuit on contains_key. New test `tools_list_batches_one_dial_per_provider_per_request` asserts the success-path short-circuit fires: a 2-tool provider on a single `GET /tools` request triggers exactly 1 upstream tools/list, not 2. The existing serial-cache test still covers the cross-request short-circuit. * test(lazy-schema): strip 0.8-only ProxyState fields from new test file The new tests/lazy_schema_discovery_test.rs file shipped on main as part of #136 constructs ProxyState with the 0.8 field set (db, passthrough, sig_verify, key_store, admin_token). 0.7 doesn't have those fields. Strip them so the test compiles on 0.7 — the rest of the test surface (the public proxy /tools, /tools/:name, and /mcp endpoints + the lazy discovery overlay) is unaffected. All 7 lazy-schema tests pass on the 0.7 ProxyState shape: tool_info_merges_upstream_schema_when_static_is_none tool_info_without_header_falls_back_to_static_null tool_info_keeps_null_when_upstream_omits_the_tool tool_info_silently_falls_back_when_url_outside_allowlist lazy_schema_cache_avoids_redundant_upstream_calls tools_list_batches_one_dial_per_provider_per_request mcp_tools_list_merges_upstream_schema --------- Co-authored-by: claudio-michel[bot] <1592301+claudio-michel[bot]@users.noreply.github.com>
Greptile P2 finding on PR #141 (the 0.7 backport, but the same code ships in #136 here on main): `lazy_fetch_schemas` drops the cache mutex between the cache-miss check and the upstream dial. Two cold-cache requests for the same `(provider, upstream_url)` can race; if both run to completion and one of them fails while the other succeeds, the failing racer landing SECOND would overwrite the cached `Some(map)` with `None`. The cache then permanently serves `null` schemas for that (provider, url) pair until the proxy restarts — equivalent to the bug #135 set out to fix in the first place, just gated on a race. Fix: a 6-line write guard on the cache mutex that enforces "first-success-wins, never demote": existing | new | action -------- |--------- |-------- absent | Some(_) | insert (cold-cache success) absent | None | insert (cold-cache failure → fast short-circuit) Some(_) | Some(_) | leave existing (idempotent) Some(_) | None | leave existing (the bug — don't demote) None | Some(_) | upgrade (retry succeeded; promote) None | None | leave existing (idempotent) Two regression tests pin the invariant: - `cache_write_never_demotes_positive_entry_to_none`: writes a Some(map), then a None for the same key, asserts the Some is preserved. Directly exercises the write guard's no-demote rule. - `cache_write_upgrades_none_to_some`: companion in the opposite direction — a transient failure (cached None) MUST be upgradable when a later dial succeeds. Without this companion test the no-demote rule could regress into a too-restrictive "never overwrite" policy that locks in transient failures permanently. Both tests construct a `LazySchemaCache` directly and exercise the write rule against it; we can't drive two concurrent oneshot calls through axum to reproduce the race in-test, but the write rule itself is the load-bearing invariant. If both tests pass, the race fix is correct regardless of how the racing dial paths land at the cache. 9/9 lazy-schema tests green. Default test suite green, clippy clean across default/sentry/all-features, fmt clean.
Summary
Closes #135. For
handler = "mcp"providers whose upstream URL is sandbox-supplied (themcp_url_env/X-Ati-Upstream-Urlmodel from #124), the proxy cannot discover tools at boot. So operators declare tool names statically; the gap was the schemas, which surfaced asinput_schema: nulluntil they were hand-mirrored.This PR has the proxy lazily fetch missing schemas on demand and cache them per
(provider, upstream_url). The change runs on three surfaces —GET /tools,GET /tools/:name, and MCP JSON-RPCtools/list— so bothati tool infoand live agents benefit.Acceptance criteria from #135 — all green
ati tool info <provider>:<tool>returns the liveinputSchemafor a sandbox-URL MCP provider whose manifest declares the tool name but noinput_schema. →tool_info_merges_upstream_schema_when_static_is_nonetool_info_without_header_falls_back_to_static_null(header absent = today's behavior) + the existing test suite (1211/1211 green)tool_info_silently_falls_back_when_url_outside_allowlist+ negative-result caching avoids retrying a known-broken upstreamWhat I shipped
New code (all in
src/proxy/server.rs):ProxyState::lazy_schema_cache: LazySchemaCache— keyed(provider, url), valueOption<HashMap<tool_name, Value>>. Negative results cached too.lazy_fetch_schemas(state, provider, url)— runs onetools/listviamcp_client::McpClient::connect_with_gen(same path the/callhandler uses, so allowlist semantics are unified). Caches positive + negative.extract_upstream_url(state, provider, headers)— REST-handler counterpart to the JSON-RPCresolve_upstream_override. An allowlist miss silently becomesNone(no 4xx on tool-info).resolve_lazy_input_schema(state, provider, tool, headers)— single helper used byhandle_tool_info; fast-paths the common case (static schema present).GET /toolslist handler and the MCPtools/listJSON-RPC arm batch by provider so a multi-tool provider only triggers one upstream tools/list per request.Test coverage in
tests/lazy_schema_discovery_test.rs(six tests):tools/list→ keep static nulltools/listalso serves the merged schemaThe tests stand up a wiremock server that impersonates a Streamable HTTP MCP server (initialize / notifications/initialized / tools/list).
Test plan
ati tool info parcha_tools:firecrawl_parse_documentagainst a Parcha sandbox returns the full schema (issue Proxy: lazy per-call input_schema discovery for sandbox-supplied-URL MCP providers #135 concrete repro)[tools.input_schema]mirrors can be removed once this landsWhat this does NOT do (deliberately)
mcp_url_env+ per-provider allowlist do all the wiring.handler = "mcp"+mcp_url_env./callpath. Already does lazy discovery viacall.rs::execute_local's "tool not in static index" code; this PR is for the info/list surface only./callis 403; on/tools/:nameit stays 200 with the static fallback. The two paths have different threat models (call dispatches credentials; info just describes).🤖 Generated with Claude Code