Skip to content

feat(proxy): lazy input_schema discovery for sandbox-supplied-URL MCP providers (closes #135)#136

Merged
miguelrios merged 2 commits into
mainfrom
fix/lazy-schema-discovery
Jun 5, 2026
Merged

feat(proxy): lazy input_schema discovery for sandbox-supplied-URL MCP providers (closes #135)#136
miguelrios merged 2 commits into
mainfrom
fix/lazy-schema-discovery

Conversation

@claudio-michel

@claudio-michel claudio-michel Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #135. For handler = "mcp" providers whose upstream URL is sandbox-supplied (the mcp_url_env / X-Ati-Upstream-Url model from #124), the proxy cannot discover tools at boot. So operators declare tool names statically; the gap was the schemas, which surfaced as input_schema: null until they were hand-mirrored.

This PR has the proxy lazily fetch missing schemas on demand and cache them per (provider, upstream_url). The change runs on three surfaces — GET /tools, GET /tools/:name, and MCP JSON-RPC tools/list — so both ati tool info and live agents benefit.

Acceptance criteria from #135 — all green

  • ati tool info <provider>:<tool> returns the live inputSchema for a sandbox-URL MCP provider whose manifest declares the tool name but no input_schema. → tool_info_merges_upstream_schema_when_static_is_none
  • No regression for OpenAPI-handler tools (schema from spec) or local-mode MCP (already works). → tool_info_without_header_falls_back_to_static_null (header absent = today's behavior) + the existing test suite (1211/1211 green)
  • Boot still succeeds when the upstream URL is unresolvable (falls back to the static entry / null, as today). → tool_info_silently_falls_back_when_url_outside_allowlist + negative-result caching avoids retrying a known-broken upstream

What I shipped

New code (all in src/proxy/server.rs):

  • ProxyState::lazy_schema_cache: LazySchemaCache — keyed (provider, url), value Option<HashMap<tool_name, Value>>. Negative results cached too.
  • lazy_fetch_schemas(state, provider, url) — runs one tools/list via mcp_client::McpClient::connect_with_gen (same path the /call handler uses, so allowlist semantics are unified). Caches positive + negative.
  • extract_upstream_url(state, provider, headers) — REST-handler counterpart to the JSON-RPC resolve_upstream_override. An allowlist miss silently becomes None (no 4xx on tool-info).
  • resolve_lazy_input_schema(state, provider, tool, headers) — single helper used by handle_tool_info; fast-paths the common case (static schema present).
  • The REST GET /tools list handler and the MCP tools/list JSON-RPC arm batch by provider so a multi-tool provider only triggers one upstream tools/list per request.

Test coverage in tests/lazy_schema_discovery_test.rs (six tests):

  1. Merge upstream schema when static is none — the acceptance criterion
  2. No header → fall back to static null (today's behavior)
  3. Tool absent from upstream tools/list → keep static null
  4. URL outside allowlist → 200 + null (REST handlers must not 4xx)
  5. Cache: second request served without re-dialing upstream
  6. MCP JSON-RPC tools/list also serves the merged schema

The tests stand up a wiremock server that impersonates a Streamable HTTP MCP server (initialize / notifications/initialized / tools/list).

Test plan

cargo test               # 1211 / 1211 green
cargo clippy --all-features --lib --tests -- -D warnings  # clean
cargo fmt --all --check  # clean

What this does NOT do (deliberately)

  • Manifest schema change. None needed — existing mcp_url_env + per-provider allowlist do all the wiring.
  • Touch local-mode MCP or stdio transport. Boot-time discovery still handles those; lazy path only fires for handler = "mcp" + mcp_url_env.
  • Touch the /call path. Already does lazy discovery via call.rs::execute_local's "tool not in static index" code; this PR is for the info/list surface only.
  • Open up the REST handlers to 4xx behavior. An allowlist miss on /call is 403; on /tools/:name it stays 200 with the static fallback. The two paths have different threat models (call dispatches credentials; info just describes).

🤖 Generated with Claude Code

… providers (closes #135)

For `handler = "mcp"` providers whose upstream URL is sandbox-supplied
(the `mcp_url_env` / `X-Ati-Upstream-Url` model from #124), the proxy
cannot discover tools at boot — the URL isn't proxy-resident. So
these providers MUST declare their tools statically in the manifest,
and `GET /tools/:name` served the static entry verbatim. If the static
entry had no `input_schema`, the agent saw `input_schema: null`.
Operators worked around this by hand-mirroring the MCP server's
schemas into the manifest and keeping them from drifting.

This change lazily fetches the schemas on demand. When a tool info /
list / MCP tools-list request resolves a static entry whose
`input_schema` is `None` AND the provider is `handler = "mcp"` with
`mcp_url_env` set AND the request carries a validated
`X-Ati-Upstream-Url`, the proxy runs one `tools/list` against the
upstream and merges the discovered `inputSchema` into the response.
Results are cached per `(provider, upstream_url)` for the process
lifetime — both positive (the schema map) and negative (the dial
failed) — so a steady-state of `ati tool info` calls only pays the
discovery cost once per sandbox URL.

Three call sites learn the trick:
- `GET /tools` (REST listing — `ati tool list`)
- `GET /tools/:name` (REST single-tool — `ati tool info`)
- `POST /mcp` → `tools/list` (MCP JSON-RPC — agents)

The REST list paths batch by provider so a multi-tool provider with
N tools only triggers one upstream tools/list per request, not N.

Graceful-degradation contract:
- Header absent → static schema (today's behavior: usually `null`).
- Header present but URL outside the allowlist → static schema. The
  REST handlers must NOT surface 4xx for a misconfigured sandbox URL
  on a tool-info call (the `/call` path still does, unchanged).
- Header present + URL valid but upstream unreachable → static
  schema; the negative result is cached so the next request doesn't
  retry a known-broken upstream.

OpenAPI-handler tools are untouched (they always carry a schema from
the spec). Local-mode MCP is untouched (the proxy already discovers
those at boot). Static schemas on `[[tools]]` entries are untouched
(the fast path returns them without dialing).

Tests in `tests/lazy_schema_discovery_test.rs`:
- `tool_info_merges_upstream_schema_when_static_is_none` — the
  acceptance criterion from the issue body.
- `tool_info_without_header_falls_back_to_static_null` — no
  regression when the header is absent.
- `tool_info_keeps_null_when_upstream_omits_the_tool` — partial
  upstream responses don't fabricate schemas.
- `tool_info_silently_falls_back_when_url_outside_allowlist` — the
  REST handlers never surface 4xx on allowlist miss.
- `lazy_schema_cache_avoids_redundant_upstream_calls` — second call
  on the same (provider, url) is served from cache; the upstream
  saw at most one `tools/list`.
- `mcp_tools_list_merges_upstream_schema` — the MCP JSON-RPC path
  also serves the merged schema.

ProxyState gains one new field (`lazy_schema_cache`) and a type
alias (`LazySchemaCache`). 33 existing test fixtures updated to
initialize the new field; the bulk update follows the same idiom
as `upstream_url_allowlists`.
@greptile-apps

greptile-apps Bot commented Jun 1, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds lazy inputSchema discovery for handler = "mcp" + mcp_url_env providers, where the proxy can't know tool schemas at boot because the upstream URL is sandbox-supplied. On a cache miss, the three info surfaces (GET /tools, GET /tools/:name, and MCP JSON-RPC tools/list) now dial the upstream via the existing connect_with_gen path, cache the result per (provider, upstream_url) for the process lifetime, and fall back silently to null on any failure.

  • src/proxy/server.rs: adds LazySchemaCache to ProxyState; implements lazy_fetch_schemas, extract_upstream_url, and resolve_lazy_input_schema; applies per-provider batching in the two list handlers and the single-tool info handler.
  • tests/lazy_schema_discovery_test.rs: six new wiremock-backed integration tests covering all acceptance criteria from Proxy: lazy per-call input_schema discovery for sandbox-supplied-URL MCP providers #135.
  • Remaining test files: mechanical addition of lazy_schema_cache field initialization to ProxyState struct literals.

Confidence Score: 4/5

Safe to merge with the minor batching fix applied; the fallback-to-null contract guarantees no new 5xx surface on the info endpoints.

The core logic is sound: allowlist validation happens before discovery, the mutex is never held across an await, and the negative-result cache prevents hammering a broken upstream. When lazy_fetch_schemas returns None (upstream failure), the provider is not recorded in schemas_by_provider, so N-1 remaining tools from that provider each acquire the mutex for a redundant cache-read rather than being short-circuited — a concrete but non-correctness-affecting defect replicated in both list handlers.

src/proxy/server.rs — the two batching loops (handle_tools_list and the tools/list MCP arm) should record a sentinel for failed providers.

Important Files Changed

Filename Overview
src/proxy/server.rs Core change: adds LazySchemaCache to ProxyState, implements lazy_fetch_schemas/extract_upstream_url/resolve_lazy_input_schema, applies lazy discovery to all three info surfaces. Logic correct; mutex never held across awaits; allowlist validated before discovery. Minor: failed providers not recorded in schemas_by_provider, causing redundant per-tool cache lookups.
tests/lazy_schema_discovery_test.rs Six new wiremock-backed integration tests covering all #135 acceptance criteria. build_app docstring has a typo describing both tools as 'without schema'.
tests/common/mod.rs Mechanical addition of lazy_schema_cache field to shared test-app builders.
tests/proxy_server_test.rs Mechanical addition of lazy_schema_cache field to inline ProxyState constructions.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Proxy as ATI Proxy
    participant Cache as lazy_schema_cache
    participant Allowlist as extract_upstream_url
    participant MCP as Sandbox MCP Upstream

    Client->>Proxy: GET /tools/:name (X-Ati-Upstream-Url: url)
    Proxy->>Allowlist: resolve_upstream_override(provider, url)
    alt URL rejected or absent
        Allowlist-->>Proxy: None
        Proxy-->>Client: 200 input_schema null
    else URL allowed
        Allowlist-->>Proxy: Allow(url)
        Proxy->>Cache: lock get((provider, url))
        alt Cache hit
            Cache-->>Proxy: Some(map) or None
            Proxy-->>Client: 200 input_schema from map or null
        else Cache miss
            Cache-->>Proxy: absent
            Proxy->>MCP: "connect_with_gen -> initialize -> tools/list"
            MCP-->>Proxy: Vec of Tool with inputSchema
            Proxy->>Cache: lock insert((provider, url), map or None)
            Proxy-->>Client: 200 input_schema from map or null
        end
    end
Loading

Reviews (1): Last reviewed commit: "feat(proxy): lazy input_schema discovery..." | Re-trigger Greptile

Comment thread src/proxy/server.rs Outdated
Comment on lines +2067 to +2072
if let Some(map) = lazy_fetch_schemas(&state, provider, &upstream_url).await {
schemas_by_provider.insert(provider.name.clone(), map);
}
}

let tools: Vec<Value> = filtered

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Batching short-circuit doesn't fire on upstream failure. When lazy_fetch_schemas returns None (connect or list failed), the provider is not recorded in schemas_by_provider. For a provider with N tools that all lack a static schema, the first tool caches the negative result, but the remaining N-1 tools still re-enter the loop and individually call lazy_fetch_schemas — each grabbing the mutex for a cache read. Replacing the if let Some with unwrap_or_default() inserts an empty map for the failed provider, so the contains_key guard on subsequent tools fires correctly and the N-1 extra mutex acquisitions are avoided.

Suggested change
if let Some(map) = lazy_fetch_schemas(&state, provider, &upstream_url).await {
schemas_by_provider.insert(provider.name.clone(), map);
}
}
let tools: Vec<Value> = filtered
let map = lazy_fetch_schemas(&state, provider, &upstream_url)
.await
.unwrap_or_default();
schemas_by_provider.insert(provider.name.clone(), map);
}
let tools: Vec<Value> = filtered

Comment thread src/proxy/server.rs Outdated
Comment on lines 1805 to 1810
if let Some(map) = lazy_fetch_schemas(&state, provider, &url).await {
schemas_by_provider.insert(provider.name.clone(), map);
}
}

let mcp_tools: Vec<Value> = visible_tools

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same batching short-circuit gap in the MCP JSON-RPC tools/list arm. If lazy_fetch_schemas returns None, subsequent tools from the same provider bypass the contains_key guard and each acquire the mutex for an individual negative-cache hit.

Suggested change
if let Some(map) = lazy_fetch_schemas(&state, provider, &url).await {
schemas_by_provider.insert(provider.name.clone(), map);
}
}
let mcp_tools: Vec<Value> = visible_tools
let map = lazy_fetch_schemas(&state, provider, &url)
.await
.unwrap_or_default();
schemas_by_provider.insert(provider.name.clone(), map);
}
let mcp_tools: Vec<Value> = visible_tools

Comment on lines +80 to +84
/// Build an axum Router wired with the issue #135 manifest pattern:
/// MCP provider with `mcp_url_env`, two statically named tools (one
/// without schema, one without — both meant to be filled by lazy
/// discovery), and a hostname-glob allowlist that admits the wiremock
/// server's host.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Docstring typo: "one without schema, one without" appears to be a copy-paste error. The test body and tool_info_keeps_null_when_upstream_omits_the_tool make clear the intent is to verify both the "present upstream schema" and "absent upstream schema" paths.

Suggested change
/// Build an axum Router wired with the issue #135 manifest pattern:
/// MCP provider with `mcp_url_env`, two statically named tools (one
/// without schema, one without — both meant to be filled by lazy
/// discovery), and a hostname-glob allowlist that admits the wiremock
/// server's host.
/// Build an axum Router wired with the issue #135 manifest pattern:
/// MCP provider with `mcp_url_env`, two statically named tools (one
/// with an upstream inputSchema, one without — no static schemas on
/// either, so the issue #135 lazy-discovery path fires for both),
/// and a hostname-glob allowlist that admits the wiremock server's host.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Three P2 findings on PR #136:

1. `handle_tools_list`'s per-provider batching loop did not record a
   provider entry when `lazy_fetch_schemas` returned `None` (dial or
   tools/list failed). The `schemas_by_provider.contains_key` guard
   wouldn't short-circuit for sibling tools, so an N-tool provider
   would re-call `extract_upstream_url` + `lazy_fetch_schemas` N
   times per request when the upstream was unreachable — the inner
   negative-result cache absorbed the network cost but every sibling
   still walked the mutex.

2. Same gap in the MCP JSON-RPC `tools/list` arm.

3. `build_app` docstring said "one without schema, one without",
   should have read "both without schema".

Fix shape for (1)+(2): the per-provider decision now records an entry
on EVERY path through the loop — including the URL-missing and
discovery-failed paths — using `HashMap::new()` as the sentinel for
"tried and got nothing". The value-pull loop treats an empty map the
same as a real miss (no schema for that tool name), so behavior is
unchanged; sibling tools just short-circuit on contains_key.

New test `tools_list_batches_one_dial_per_provider_per_request`
asserts the success-path short-circuit fires: a 2-tool provider on a
single `GET /tools` request triggers exactly 1 upstream tools/list,
not 2. The existing serial-cache test still covers the cross-request
short-circuit.
@miguelrios miguelrios merged commit e66e940 into main Jun 5, 2026
6 checks passed
claudio-michel Bot added a commit that referenced this pull request Jun 5, 2026
… providers (closes #135) [0.7 backport] (#141)

* feat(proxy): lazy input_schema discovery for sandbox-supplied-URL MCP providers (closes #135)

For `handler = "mcp"` providers whose upstream URL is sandbox-supplied
(the `mcp_url_env` / `X-Ati-Upstream-Url` model from #124), the proxy
cannot discover tools at boot — the URL isn't proxy-resident. So
these providers MUST declare their tools statically in the manifest,
and `GET /tools/:name` served the static entry verbatim. If the static
entry had no `input_schema`, the agent saw `input_schema: null`.
Operators worked around this by hand-mirroring the MCP server's
schemas into the manifest and keeping them from drifting.

This change lazily fetches the schemas on demand. When a tool info /
list / MCP tools-list request resolves a static entry whose
`input_schema` is `None` AND the provider is `handler = "mcp"` with
`mcp_url_env` set AND the request carries a validated
`X-Ati-Upstream-Url`, the proxy runs one `tools/list` against the
upstream and merges the discovered `inputSchema` into the response.
Results are cached per `(provider, upstream_url)` for the process
lifetime — both positive (the schema map) and negative (the dial
failed) — so a steady-state of `ati tool info` calls only pays the
discovery cost once per sandbox URL.

Three call sites learn the trick:
- `GET /tools` (REST listing — `ati tool list`)
- `GET /tools/:name` (REST single-tool — `ati tool info`)
- `POST /mcp` → `tools/list` (MCP JSON-RPC — agents)

The REST list paths batch by provider so a multi-tool provider with
N tools only triggers one upstream tools/list per request, not N.

Graceful-degradation contract:
- Header absent → static schema (today's behavior: usually `null`).
- Header present but URL outside the allowlist → static schema. The
  REST handlers must NOT surface 4xx for a misconfigured sandbox URL
  on a tool-info call (the `/call` path still does, unchanged).
- Header present + URL valid but upstream unreachable → static
  schema; the negative result is cached so the next request doesn't
  retry a known-broken upstream.

OpenAPI-handler tools are untouched (they always carry a schema from
the spec). Local-mode MCP is untouched (the proxy already discovers
those at boot). Static schemas on `[[tools]]` entries are untouched
(the fast path returns them without dialing).

Tests in `tests/lazy_schema_discovery_test.rs`:
- `tool_info_merges_upstream_schema_when_static_is_none` — the
  acceptance criterion from the issue body.
- `tool_info_without_header_falls_back_to_static_null` — no
  regression when the header is absent.
- `tool_info_keeps_null_when_upstream_omits_the_tool` — partial
  upstream responses don't fabricate schemas.
- `tool_info_silently_falls_back_when_url_outside_allowlist` — the
  REST handlers never surface 4xx on allowlist miss.
- `lazy_schema_cache_avoids_redundant_upstream_calls` — second call
  on the same (provider, url) is served from cache; the upstream
  saw at most one `tools/list`.
- `mcp_tools_list_merges_upstream_schema` — the MCP JSON-RPC path
  also serves the merged schema.

ProxyState gains one new field (`lazy_schema_cache`) and a type
alias (`LazySchemaCache`). 33 existing test fixtures updated to
initialize the new field; the bulk update follows the same idiom
as `upstream_url_allowlists`.

* fix(proxy): address Greptile findings on lazy schema discovery

Three P2 findings on PR #136:

1. `handle_tools_list`'s per-provider batching loop did not record a
   provider entry when `lazy_fetch_schemas` returned `None` (dial or
   tools/list failed). The `schemas_by_provider.contains_key` guard
   wouldn't short-circuit for sibling tools, so an N-tool provider
   would re-call `extract_upstream_url` + `lazy_fetch_schemas` N
   times per request when the upstream was unreachable — the inner
   negative-result cache absorbed the network cost but every sibling
   still walked the mutex.

2. Same gap in the MCP JSON-RPC `tools/list` arm.

3. `build_app` docstring said "one without schema, one without",
   should have read "both without schema".

Fix shape for (1)+(2): the per-provider decision now records an entry
on EVERY path through the loop — including the URL-missing and
discovery-failed paths — using `HashMap::new()` as the sentinel for
"tried and got nothing". The value-pull loop treats an empty map the
same as a real miss (no schema for that tool name), so behavior is
unchanged; sibling tools just short-circuit on contains_key.

New test `tools_list_batches_one_dial_per_provider_per_request`
asserts the success-path short-circuit fires: a 2-tool provider on a
single `GET /tools` request triggers exactly 1 upstream tools/list,
not 2. The existing serial-cache test still covers the cross-request
short-circuit.

* test(lazy-schema): strip 0.8-only ProxyState fields from new test file

The new tests/lazy_schema_discovery_test.rs file shipped on main as
part of #136 constructs ProxyState with the 0.8 field set (db,
passthrough, sig_verify, key_store, admin_token). 0.7 doesn't have
those fields. Strip them so the test compiles on 0.7 — the rest of
the test surface (the public proxy /tools, /tools/:name, and /mcp
endpoints + the lazy discovery overlay) is unaffected.

All 7 lazy-schema tests pass on the 0.7 ProxyState shape:
  tool_info_merges_upstream_schema_when_static_is_none
  tool_info_without_header_falls_back_to_static_null
  tool_info_keeps_null_when_upstream_omits_the_tool
  tool_info_silently_falls_back_when_url_outside_allowlist
  lazy_schema_cache_avoids_redundant_upstream_calls
  tools_list_batches_one_dial_per_provider_per_request
  mcp_tools_list_merges_upstream_schema

---------

Co-authored-by: claudio-michel[bot] <1592301+claudio-michel[bot]@users.noreply.github.com>
@claudio-michel claudio-michel Bot mentioned this pull request Jun 5, 2026
6 tasks
claudio-michel Bot pushed a commit that referenced this pull request Jun 5, 2026
Greptile P2 finding on PR #141 (the 0.7 backport, but the same code
ships in #136 here on main):

`lazy_fetch_schemas` drops the cache mutex between the cache-miss
check and the upstream dial. Two cold-cache requests for the same
`(provider, upstream_url)` can race; if both run to completion and one
of them fails while the other succeeds, the failing racer landing
SECOND would overwrite the cached `Some(map)` with `None`. The cache
then permanently serves `null` schemas for that (provider, url) pair
until the proxy restarts — equivalent to the bug #135 set out to fix
in the first place, just gated on a race.

Fix: a 6-line write guard on the cache mutex that enforces
"first-success-wins, never demote":

    existing | new      | action
    -------- |--------- |--------
    absent   | Some(_)  | insert (cold-cache success)
    absent   | None     | insert (cold-cache failure → fast short-circuit)
    Some(_)  | Some(_)  | leave existing (idempotent)
    Some(_)  | None     | leave existing (the bug — don't demote)
    None     | Some(_)  | upgrade (retry succeeded; promote)
    None     | None     | leave existing (idempotent)

Two regression tests pin the invariant:

- `cache_write_never_demotes_positive_entry_to_none`: writes a
  Some(map), then a None for the same key, asserts the Some is
  preserved. Directly exercises the write guard's no-demote rule.
- `cache_write_upgrades_none_to_some`: companion in the opposite
  direction — a transient failure (cached None) MUST be upgradable
  when a later dial succeeds. Without this companion test the
  no-demote rule could regress into a too-restrictive "never
  overwrite" policy that locks in transient failures permanently.

Both tests construct a `LazySchemaCache` directly and exercise the
write rule against it; we can't drive two concurrent oneshot calls
through axum to reproduce the race in-test, but the write rule itself
is the load-bearing invariant. If both tests pass, the race fix is
correct regardless of how the racing dial paths land at the cache.

9/9 lazy-schema tests green. Default test suite green, clippy clean
across default/sentry/all-features, fmt clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proxy: lazy per-call input_schema discovery for sandbox-supplied-URL MCP providers

1 participant