fix: return specific error for stale E2EE attestation keys#521
fix: return specific error for stale E2EE attestation keys#521Evrard-Nil wants to merge 2 commits intomainfrom
Conversation
When a client sends X-Model-Pub-Key that doesn't match any active provider (typically after backend restart with new signing keys), return HTTP 421 with "encryption key is no longer valid" instead of a generic 502 "model unavailable". This allows E2EE clients to detect stale attestation and auto-refresh without user intervention, instead of appearing as a backend outage.
There was a problem hiding this comment.
Code Review
This pull request introduces specific error handling for E2EE public key routing failures within the completion service. When an error message contains "with public key", the system now logs a warning regarding potential stale attestations and returns a 421 status code, prompting the client to refresh their attestation report. I have no feedback to provide as there were no review comments.
ReviewThis is a clean, well-motivated fix. The 421 status code is semantically appropriate, and downgrading from ERROR to WARN makes sense for an expected operational condition. One issue worth addressing: Missing unit test for new branchThe file has a dedicated test suite for #[test]
fn test_map_provider_error_stale_pubkey_becomes_421() {
let error = inference_providers::CompletionError::CompletionError(
"No provider found for model test-model with public key 'abcdef...'. Encryption requires routing to the specific provider with this public key.".to_string(),
);
let result = CompletionServiceImpl::map_provider_error("test-model", &error, "test");
assert!(matches!(
result,
ports::CompletionError::ProviderError { status_code: 421, .. }
));
}Minor note: The Everything else looks good: the 421 propagates correctly through |
There was a problem hiding this comment.
Pull request overview
Adjusts chat completion error mapping so E2EE public-key routing failures (stale client attestation keys after backend restart) return a specific, actionable HTTP error instead of a generic “model unavailable” response.
Changes:
- Detect pubkey-routing failures in
map_provider_errorand return HTTP 421 with an attestation-refresh message. - Downgrade logging for this scenario from ERROR to WARN to reflect expected/benign failures.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| inference_providers::CompletionError::CompletionError(msg) => { | ||
| if msg.contains("not found in any configured provider") { | ||
| ports::CompletionError::InvalidModel(msg.clone()) | ||
| } else if msg.contains("with public key") { | ||
| // E2EE pubkey routing failure: the client's cached attestation key | ||
| // doesn't match any active provider (likely backend restarted with new keys). |
There was a problem hiding this comment.
This branch relies on substring-matching provider error text ("with public key") to detect the stale E2EE key case. That’s brittle: if the upstream message wording changes, this will silently fall back to the generic 502 path. Consider propagating a structured error/variant from the provider pool for pubkey routing failures, or at least matching a more specific/unique substring (e.g., the full "No provider found for model" prefix) via a shared constant.
| inference_providers::CompletionError::CompletionError(msg) => { | ||
| if msg.contains("not found in any configured provider") { | ||
| ports::CompletionError::InvalidModel(msg.clone()) | ||
| } else if msg.contains("with public key") { | ||
| // E2EE pubkey routing failure: the client's cached attestation key | ||
| // doesn't match any active provider (likely backend restarted with new keys). | ||
| tracing::warn!( | ||
| model, | ||
| provider_message = %msg, | ||
| "E2EE pubkey routing failed during {} (stale attestation?)", | ||
| operation | ||
| ); | ||
| ports::CompletionError::ProviderError { | ||
| status_code: 421, | ||
| message: "The encryption key is no longer valid. Please refresh your attestation report and retry.".to_string(), | ||
| } |
There was a problem hiding this comment.
New stale-pubkey mapping behavior isn’t covered by the existing map_provider_error unit tests in this file. Please add a test that feeds a CompletionError::CompletionError containing the pubkey-routing message (e.g., "No provider found for model ... with public key") and asserts the returned ProviderError has status_code 421 and the attestation-refresh message.
| ports::CompletionError::ProviderError { | ||
| status_code: 421, | ||
| message: "The encryption key is no longer valid. Please refresh your attestation report and retry.".to_string(), | ||
| } |
There was a problem hiding this comment.
Returning this as ports::CompletionError::ProviderError means it will still be tagged as ERROR_TYPE_INFERENCE_ERROR by record_error() (since metrics don’t include status_code), so stale-key events will remain indistinguishable from real provider/infrastructure failures in cloud_api.request.errors. If the goal is to separate these operationally, consider introducing a dedicated domain error variant (or tagging ProviderError metrics with status_code) so dashboards/alerts can filter stale-key errors independently.
Summary
Context
In production, inference-proxy signing keys are deterministic (derived from dstack KMS using the model name). They don't change on restart — only when the CVM app identity (compose hash) changes.
This fix targets the case where
X-Model-Pub-Keydoesn't match any active provider in cloud-api's pubkey mapping (e.g., client sends a key from a decomissioned CVM, or after a CVM re-registration with a new compose hash).Note: The 87 "All providers failed for model with public key" errors seen in prod logs are a different code path — those occur when the provider IS found by pubkey but the connection to it fails. Those are addressed by PR #520 (connection retry). This PR fixes the separate "No provider found for model with public key" path, which currently returns a misleading 502.
Reproduction steps
See
repro_e2ee_stale_pubkey.sh(gitignored) for the full reproduction script.Test plan
cargo checkcompiles cleanlycargo test --lib --bins)🤖 Generated with Claude Code