Skip to content

fix: return specific error for stale E2EE attestation keys#521

Merged
Evrard-Nil merged 5 commits intomainfrom
fix/e2ee-stale-pubkey-error-message
Apr 1, 2026
Merged

fix: return specific error for stale E2EE attestation keys#521
Evrard-Nil merged 5 commits intomainfrom
fix/e2ee-stale-pubkey-error-message

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Contributor

@Evrard-Nil Evrard-Nil commented Mar 31, 2026

Summary

  • Return HTTP 421 with "The encryption key is no longer valid" when E2EE pubkey routing fails to find a matching provider
  • Previously returned generic 502 "model unavailable" — indistinguishable from a real backend outage
  • Downgrade log level from ERROR to WARN (expected client behavior, not a server error)

Context

In production, inference-proxy signing keys are deterministic (derived from dstack KMS using the model name). They don't change on restart — only when the CVM app identity (compose hash) changes.

This fix targets the case where X-Model-Pub-Key doesn't match any active provider in cloud-api's pubkey mapping (e.g., client sends a key from a decomissioned CVM, or after a CVM re-registration with a new compose hash).

Note: The 87 "All providers failed for model with public key" errors seen in prod logs are a different code path — those occur when the provider IS found by pubkey but the connection to it fails. Those are addressed by PR #520 (connection retry). This PR fixes the separate "No provider found for model with public key" path, which currently returns a misleading 502.

Reproduction steps

# Send a request with a fake/invalid public key (not matching any provider)
curl -s -X POST "https://cloud-api.near.ai/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Model-Pub-Key: 0000000000000000000000000000000000000000000000000000000000000000" \
  -d '{
    "model": "zai-org/GLM-5-FP8",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 5
  }'

# Before fix: {"error": {"message": "The model is currently unavailable...", "type": "bad_gateway"}}
# After fix:  {"error": {"message": "The encryption key is no longer valid. Please refresh your attestation report and retry.", "type": "provider_error"}}

See repro_e2ee_stale_pubkey.sh (gitignored) for the full reproduction script.

Test plan

  • cargo check compiles cleanly
  • All 188 unit tests pass (cargo test --lib --bins)
  • Deploy to staging:
    • Verify fake pubkey returns 421 with attestation refresh message
    • Verify valid pubkey requests still work normally
    • Verify Datadog shows WARN (not ERROR) for unmatched pubkey failures

🤖 Generated with Claude Code

When a client sends X-Model-Pub-Key that doesn't match any active
provider (typically after backend restart with new signing keys),
return HTTP 421 with "encryption key is no longer valid" instead of
a generic 502 "model unavailable".

This allows E2EE clients to detect stale attestation and auto-refresh
without user intervention, instead of appearing as a backend outage.
Copilot AI review requested due to automatic review settings March 31, 2026 03:46
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:46 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces specific error handling for E2EE public key routing failures within the completion service. When an error message contains "with public key", the system now logs a warning regarding potential stale attestations and returns a 421 status code, prompting the client to refresh their attestation report. I have no feedback to provide as there were no review comments.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Review

This is a clean, well-motivated fix. The 421 status code is semantically appropriate, and downgrading from ERROR to WARN makes sense for an expected operational condition.

One issue worth addressing:

Missing unit test for new branch

The file has a dedicated test suite for map_provider_error (see test_map_provider_error_model_not_found_string, test_map_provider_error_connection_error_becomes_502, etc.), but the new with public key branch is not covered. Given the string-matching approach used, a regression test would make this more robust:

#[test]
fn test_map_provider_error_stale_pubkey_becomes_421() {
    let error = inference_providers::CompletionError::CompletionError(
        "No provider found for model test-model with public key 'abcdef...'. Encryption requires routing to the specific provider with this public key.".to_string(),
    );
    let result = CompletionServiceImpl::map_provider_error("test-model", &error, "test");
    assert!(matches!(
        result,
        ports::CompletionError::ProviderError { status_code: 421, .. }
    ));
}

Minor note: The msg.contains("with public key") match is fragile — if the upstream error message in inference_provider_pool/mod.rs:615 ever changes, this silently falls through to the generic 502. A dedicated CompletionError variant in the inference_providers crate would be the more robust long-term solution, but given the error originates in a single place, the string match is acceptable for now.

Everything else looks good: the 421 propagates correctly through StatusCode::from_u16 in common.rs:15, maps to provider_error type in conversions.rs:247, the provider_message log only contains model ID + public key prefix (no customer PII), and the responses route doesn't use pubkey routing so no coverage gap there.

⚠️ Approve after adding the unit test.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts chat completion error mapping so E2EE public-key routing failures (stale client attestation keys after backend restart) return a specific, actionable HTTP error instead of a generic “model unavailable” response.

Changes:

  • Detect pubkey-routing failures in map_provider_error and return HTTP 421 with an attestation-refresh message.
  • Downgrade logging for this scenario from ERROR to WARN to reflect expected/benign failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add CompletionError::NoPubKeyProvider variant instead of relying on
  brittle string matching ("with public key")
- Provider pool returns NoPubKeyProvider when E2EE pubkey routing fails
- map_provider_error matches on the variant directly
- completion_stream_error_category returns "stale_pubkey" for metrics
  (distinguishable from generic inference errors)
- Add unit test for NoPubKeyProvider → 421 mapping
- Handle new variant in sanitize_completion_error and StopReason
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 20:16 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env April 1, 2026 13:58 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil merged commit 77eeb99 into main Apr 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants