Skip to content

fix: return specific error for stale E2EE attestation keys#521

Open
Evrard-Nil wants to merge 2 commits intomainfrom
fix/e2ee-stale-pubkey-error-message
Open

fix: return specific error for stale E2EE attestation keys#521
Evrard-Nil wants to merge 2 commits intomainfrom
fix/e2ee-stale-pubkey-error-message

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Contributor

@Evrard-Nil Evrard-Nil commented Mar 31, 2026

Summary

  • Return HTTP 421 with "The encryption key is no longer valid" when E2EE pubkey routing fails to find a matching provider
  • Previously returned generic 502 "model unavailable" — indistinguishable from a real backend outage
  • Downgrade log level from ERROR to WARN (expected client behavior, not a server error)

Context

In production, inference-proxy signing keys are deterministic (derived from dstack KMS using the model name). They don't change on restart — only when the CVM app identity (compose hash) changes.

This fix targets the case where X-Model-Pub-Key doesn't match any active provider in cloud-api's pubkey mapping (e.g., client sends a key from a decomissioned CVM, or after a CVM re-registration with a new compose hash).

Note: The 87 "All providers failed for model with public key" errors seen in prod logs are a different code path — those occur when the provider IS found by pubkey but the connection to it fails. Those are addressed by PR #520 (connection retry). This PR fixes the separate "No provider found for model with public key" path, which currently returns a misleading 502.

Reproduction steps

# Send a request with a fake/invalid public key (not matching any provider)
curl -s -X POST "https://cloud-api.near.ai/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Model-Pub-Key: 0000000000000000000000000000000000000000000000000000000000000000" \
  -d '{
    "model": "zai-org/GLM-5-FP8",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 5
  }'

# Before fix: {"error": {"message": "The model is currently unavailable...", "type": "bad_gateway"}}
# After fix:  {"error": {"message": "The encryption key is no longer valid. Please refresh your attestation report and retry.", "type": "provider_error"}}

See repro_e2ee_stale_pubkey.sh (gitignored) for the full reproduction script.

Test plan

  • cargo check compiles cleanly
  • All 188 unit tests pass (cargo test --lib --bins)
  • Deploy to staging:
    • Verify fake pubkey returns 421 with attestation refresh message
    • Verify valid pubkey requests still work normally
    • Verify Datadog shows WARN (not ERROR) for unmatched pubkey failures

🤖 Generated with Claude Code

When a client sends X-Model-Pub-Key that doesn't match any active
provider (typically after backend restart with new signing keys),
return HTTP 421 with "encryption key is no longer valid" instead of
a generic 502 "model unavailable".

This allows E2EE clients to detect stale attestation and auto-refresh
without user intervention, instead of appearing as a backend outage.
Copilot AI review requested due to automatic review settings March 31, 2026 03:46
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:46 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces specific error handling for E2EE public key routing failures within the completion service. When an error message contains "with public key", the system now logs a warning regarding potential stale attestations and returns a 421 status code, prompting the client to refresh their attestation report. I have no feedback to provide as there were no review comments.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Review

This is a clean, well-motivated fix. The 421 status code is semantically appropriate, and downgrading from ERROR to WARN makes sense for an expected operational condition.

One issue worth addressing:

Missing unit test for new branch

The file has a dedicated test suite for map_provider_error (see test_map_provider_error_model_not_found_string, test_map_provider_error_connection_error_becomes_502, etc.), but the new with public key branch is not covered. Given the string-matching approach used, a regression test would make this more robust:

#[test]
fn test_map_provider_error_stale_pubkey_becomes_421() {
    let error = inference_providers::CompletionError::CompletionError(
        "No provider found for model test-model with public key 'abcdef...'. Encryption requires routing to the specific provider with this public key.".to_string(),
    );
    let result = CompletionServiceImpl::map_provider_error("test-model", &error, "test");
    assert!(matches!(
        result,
        ports::CompletionError::ProviderError { status_code: 421, .. }
    ));
}

Minor note: The msg.contains("with public key") match is fragile — if the upstream error message in inference_provider_pool/mod.rs:615 ever changes, this silently falls through to the generic 502. A dedicated CompletionError variant in the inference_providers crate would be the more robust long-term solution, but given the error originates in a single place, the string match is acceptable for now.

Everything else looks good: the 421 propagates correctly through StatusCode::from_u16 in common.rs:15, maps to provider_error type in conversions.rs:247, the provider_message log only contains model ID + public key prefix (no customer PII), and the responses route doesn't use pubkey routing so no coverage gap there.

⚠️ Approve after adding the unit test.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts chat completion error mapping so E2EE public-key routing failures (stale client attestation keys after backend restart) return a specific, actionable HTTP error instead of a generic “model unavailable” response.

Changes:

  • Detect pubkey-routing failures in map_provider_error and return HTTP 421 with an attestation-refresh message.
  • Downgrade logging for this scenario from ERROR to WARN to reflect expected/benign failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 741 to +746
inference_providers::CompletionError::CompletionError(msg) => {
if msg.contains("not found in any configured provider") {
ports::CompletionError::InvalidModel(msg.clone())
} else if msg.contains("with public key") {
// E2EE pubkey routing failure: the client's cached attestation key
// doesn't match any active provider (likely backend restarted with new keys).
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch relies on substring-matching provider error text ("with public key") to detect the stale E2EE key case. That’s brittle: if the upstream message wording changes, this will silently fall back to the generic 502 path. Consider propagating a structured error/variant from the provider pool for pubkey routing failures, or at least matching a more specific/unique substring (e.g., the full "No provider found for model" prefix) via a shared constant.

Copilot uses AI. Check for mistakes.
Comment on lines 741 to +756
inference_providers::CompletionError::CompletionError(msg) => {
if msg.contains("not found in any configured provider") {
ports::CompletionError::InvalidModel(msg.clone())
} else if msg.contains("with public key") {
// E2EE pubkey routing failure: the client's cached attestation key
// doesn't match any active provider (likely backend restarted with new keys).
tracing::warn!(
model,
provider_message = %msg,
"E2EE pubkey routing failed during {} (stale attestation?)",
operation
);
ports::CompletionError::ProviderError {
status_code: 421,
message: "The encryption key is no longer valid. Please refresh your attestation report and retry.".to_string(),
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New stale-pubkey mapping behavior isn’t covered by the existing map_provider_error unit tests in this file. Please add a test that feeds a CompletionError::CompletionError containing the pubkey-routing message (e.g., "No provider found for model ... with public key") and asserts the returned ProviderError has status_code 421 and the attestation-refresh message.

Copilot uses AI. Check for mistakes.
Comment on lines +753 to +756
ports::CompletionError::ProviderError {
status_code: 421,
message: "The encryption key is no longer valid. Please refresh your attestation report and retry.".to_string(),
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning this as ports::CompletionError::ProviderError means it will still be tagged as ERROR_TYPE_INFERENCE_ERROR by record_error() (since metrics don’t include status_code), so stale-key events will remain indistinguishable from real provider/infrastructure failures in cloud_api.request.errors. If the goal is to separate these operationally, consider introducing a dedicated domain error variant (or tagging ProviderError metrics with status_code) so dashboards/alerts can filter stale-key errors independently.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants