fix: return specific error for stale E2EE attestation keys by Evrard-Nil · Pull Request #521 · nearai/cloud-api

Evrard-Nil · 2026-03-31T03:46:05Z

Summary

Return HTTP 421 with "The encryption key is no longer valid" when E2EE pubkey routing fails to find a matching provider
Previously returned generic 502 "model unavailable" — indistinguishable from a real backend outage
Downgrade log level from ERROR to WARN (expected client behavior, not a server error)

Context

In production, inference-proxy signing keys are deterministic (derived from dstack KMS using the model name). They don't change on restart — only when the CVM app identity (compose hash) changes.

This fix targets the case where X-Model-Pub-Key doesn't match any active provider in cloud-api's pubkey mapping (e.g., client sends a key from a decomissioned CVM, or after a CVM re-registration with a new compose hash).

Note: The 87 "All providers failed for model with public key" errors seen in prod logs are a different code path — those occur when the provider IS found by pubkey but the connection to it fails. Those are addressed by PR #520 (connection retry). This PR fixes the separate "No provider found for model with public key" path, which currently returns a misleading 502.

Reproduction steps

# Send a request with a fake/invalid public key (not matching any provider)
curl -s -X POST "https://cloud-api.near.ai/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Model-Pub-Key: 0000000000000000000000000000000000000000000000000000000000000000" \
  -d '{
    "model": "zai-org/GLM-5-FP8",
    "messages": [{"role": "user", "content": "hi"}],
    "max_tokens": 5
  }'

# Before fix: {"error": {"message": "The model is currently unavailable...", "type": "bad_gateway"}}
# After fix:  {"error": {"message": "The encryption key is no longer valid. Please refresh your attestation report and retry.", "type": "provider_error"}}

See repro_e2ee_stale_pubkey.sh (gitignored) for the full reproduction script.

Test plan

cargo check compiles cleanly
All 188 unit tests pass (cargo test --lib --bins)
Deploy to staging:
- Verify fake pubkey returns 421 with attestation refresh message
- Verify valid pubkey requests still work normally
- Verify Datadog shows WARN (not ERROR) for unmatched pubkey failures

🤖 Generated with Claude Code

When a client sends X-Model-Pub-Key that doesn't match any active provider (typically after backend restart with new signing keys), return HTTP 421 with "encryption key is no longer valid" instead of a generic 502 "model unavailable". This allows E2EE clients to detect stale attestation and auto-refresh without user intervention, instead of appearing as a backend outage.

gemini-code-assist

Code Review

This pull request introduces specific error handling for E2EE public key routing failures within the completion service. When an error message contains "with public key", the system now logs a warning regarding potential stale attestations and returns a 421 status code, prompting the client to refresh their attestation report. I have no feedback to provide as there were no review comments.

claude · 2026-03-31T03:48:26Z

Review

This is a clean, well-motivated fix. The 421 status code is semantically appropriate, and downgrading from ERROR to WARN makes sense for an expected operational condition.

One issue worth addressing:

Missing unit test for new branch

The file has a dedicated test suite for map_provider_error (see test_map_provider_error_model_not_found_string, test_map_provider_error_connection_error_becomes_502, etc.), but the new with public key branch is not covered. Given the string-matching approach used, a regression test would make this more robust:

#[test]
fn test_map_provider_error_stale_pubkey_becomes_421() {
    let error = inference_providers::CompletionError::CompletionError(
        "No provider found for model test-model with public key 'abcdef...'. Encryption requires routing to the specific provider with this public key.".to_string(),
    );
    let result = CompletionServiceImpl::map_provider_error("test-model", &error, "test");
    assert!(matches!(
        result,
        ports::CompletionError::ProviderError { status_code: 421, .. }
    ));
}

Minor note: The msg.contains("with public key") match is fragile — if the upstream error message in inference_provider_pool/mod.rs:615 ever changes, this silently falls through to the generic 502. A dedicated CompletionError variant in the inference_providers crate would be the more robust long-term solution, but given the error originates in a single place, the string match is acceptable for now.

Everything else looks good: the 421 propagates correctly through StatusCode::from_u16 in common.rs:15, maps to provider_error type in conversions.rs:247, the provider_message log only contains model ID + public key prefix (no customer PII), and the responses route doesn't use pubkey routing so no coverage gap there.

⚠️ Approve after adding the unit test.

Copilot

Pull request overview

Adjusts chat completion error mapping so E2EE public-key routing failures (stale client attestation keys after backend restart) return a specific, actionable HTTP error instead of a generic “model unavailable” response.

Changes:

Detect pubkey-routing failures in map_provider_error and return HTTP 421 with an attestation-refresh message.
Downgrade logging for this scenario from ERROR to WARN to reflect expected/benign failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T03:49:35Z

crates/services/src/completions/mod.rs

            inference_providers::CompletionError::CompletionError(msg) => {
                if msg.contains("not found in any configured provider") {
                    ports::CompletionError::InvalidModel(msg.clone())
+                } else if msg.contains("with public key") {
+                    // E2EE pubkey routing failure: the client's cached attestation key
+                    // doesn't match any active provider (likely backend restarted with new keys).


This branch relies on substring-matching provider error text ("with public key") to detect the stale E2EE key case. That’s brittle: if the upstream message wording changes, this will silently fall back to the generic 502 path. Consider propagating a structured error/variant from the provider pool for pubkey routing failures, or at least matching a more specific/unique substring (e.g., the full "No provider found for model" prefix) via a shared constant.

Copilot · 2026-03-31T03:49:35Z

crates/services/src/completions/mod.rs

            inference_providers::CompletionError::CompletionError(msg) => {
                if msg.contains("not found in any configured provider") {
                    ports::CompletionError::InvalidModel(msg.clone())
+                } else if msg.contains("with public key") {
+                    // E2EE pubkey routing failure: the client's cached attestation key
+                    // doesn't match any active provider (likely backend restarted with new keys).
+                    tracing::warn!(
+                        model,
+                        provider_message = %msg,
+                        "E2EE pubkey routing failed during {} (stale attestation?)",
+                        operation
+                    );
+                    ports::CompletionError::ProviderError {
+                        status_code: 421,
+                        message: "The encryption key is no longer valid. Please refresh your attestation report and retry.".to_string(),
+                    }


New stale-pubkey mapping behavior isn’t covered by the existing map_provider_error unit tests in this file. Please add a test that feeds a CompletionError::CompletionError containing the pubkey-routing message (e.g., "No provider found for model ... with public key") and asserts the returned ProviderError has status_code 421 and the attestation-refresh message.

Copilot · 2026-03-31T03:49:35Z

crates/services/src/completions/mod.rs

+                    ports::CompletionError::ProviderError {
+                        status_code: 421,
+                        message: "The encryption key is no longer valid. Please refresh your attestation report and retry.".to_string(),
+                    }


Returning this as ports::CompletionError::ProviderError means it will still be tagged as ERROR_TYPE_INFERENCE_ERROR by record_error() (since metrics don’t include status_code), so stale-key events will remain indistinguishable from real provider/infrastructure failures in cloud_api.request.errors. If the goal is to separate these operationally, consider introducing a dedicated domain error variant (or tagging ProviderError metrics with status_code) so dashboards/alerts can filter stale-key errors independently.

Copilot AI review requested due to automatic review settings March 31, 2026 03:46

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:46 — with GitHub Actions Inactive

Copilot started reviewing on behalf of Evrard-Nil March 31, 2026 03:46 View session

chore: gitignore repro scripts

5962082

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Copilot AI reviewed Mar 31, 2026

View reviewed changes

PierreLeGuen approved these changes Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return specific error for stale E2EE attestation keys#521

fix: return specific error for stale E2EE attestation keys#521
Evrard-Nil wants to merge 2 commits intomainfrom
fix/e2ee-stale-pubkey-error-message

Evrard-Nil commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

claude bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Evrard-Nil commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Reproduction steps

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude bot commented Mar 31, 2026

Review

Missing unit test for new branch

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Evrard-Nil commented Mar 31, 2026 •

edited

Loading