Skip to content

feat: api key model affinity routing#509

Open
hanakannzashi wants to merge 7 commits intomainfrom
feat/api-key-model-affinity-routing
Open

feat: api key model affinity routing#509
hanakannzashi wants to merge 7 commits intomainfrom
feat/api-key-model-affinity-routing

Conversation

@hanakannzashi
Copy link
Copy Markdown
Contributor

@hanakannzashi hanakannzashi commented Mar 23, 2026

Implement shared API key/model affinity routing for vLLM-backed chat completions using Postgres.

This change adds a short-lived (api_key_id, canonical_model_name) -> provider_url binding so requests from different cloud-api instances can reuse the same model node and improve backend cache locality without introducing Redis. The existing routing strategy remains the fallback selector; affinity is a higher-level shared hint.

Key changes:

  • add a Postgres-backed affinity table and repository
  • apply affinity only to vLLM chat completions, while keeping external providers and model_pub_key requests on the existing path
  • prefer a bound provider_url when it is available locally, otherwise fall back to the current routing strategy
  • serialize concurrent misses with pg_advisory_xact_lock, while keeping cache hits lock-free
  • refresh bindings asynchronously after successful completions so response latency is not gated on a DB upsert
  • add repository e2e coverage for concurrent miss behavior and hit-path lock avoidance

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces API key model affinity routing, which allows the system to route requests to preferred inference providers based on the API key and model being used. This can improve performance by ensuring that requests are routed to the most appropriate provider for a given user and model. The changes include adding a new database table to store the affinity bindings, implementing advisory locks to prevent race conditions, and modifying the inference provider pool to prioritize the preferred provider URL.

Highlights

  • API Key Model Affinity: Introduces a mechanism for routing requests to preferred inference providers based on API key and model, improving performance and reliability.
  • Database Schema: Adds a new table api_key_model_affinity to store the preferred provider URL for each API key and model combination.
  • Concurrency Handling: Implements advisory locks to prevent race conditions when multiple requests try to create the same affinity binding.
  • Provider Selection: Modifies the inference provider pool to prioritize the preferred provider URL when available.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

PR Review: API Key Model Affinity Routing - The overall design is sound. A few critical issues: (1) No FK constraint on api_key_id in V0049__add_api_key_model_affinity.sql - add REFERENCES api_keys(id) ON DELETE CASCADE. (2) Raw pointer identity via Arc::as_ptr in inference_provider_pool/mod.rs is fragile due to allocator address reuse - use an explicit u64 stable ID instead. (3) model_name logged at warn level alongside api_key_id in completions/mod.rs violates privacy policy. Minor: raise API_KEY_MODEL_AFFINITY_REFRESH_TIMEOUT_SECS from 2s to 5s. The double-checked locking and health partitioning logic are correct. Please address items 1 and 3 as blockers.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Full review detail follows (split due to formatting constraints):

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Issue 1 - No FK constraint and no expired row cleanup. File: crates/database/src/migrations/sql/V0049__add_api_key_model_affinity.sql. api_key_id has no FK constraint. If an API key is deleted, its affinity rows are never cleaned up (expired rows persist as dead weight). There is also no periodic cleanup of expired rows - over time this table will accumulate stale entries. Suggested fix: add REFERENCES api_keys(id) ON DELETE CASCADE to api_key_id. Even without a separate cleanup job, the FK + cascade ensures orphaned rows do not accumulate after key deletion.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces API key model affinity, allowing specific API keys to be consistently routed to the same inference provider for a given model. Key changes include adding a new api_key_model_affinity database table and a PgApiKeyModelAffinityRepository to manage these bindings, which uses advisory locks for concurrent creation. The CompletionServiceImpl now integrates this affinity logic, determining when to use preferred provider URLs and scheduling background refreshes for active bindings. The InferenceProviderPool was updated to store provider URLs and prioritize a preferred provider during selection. Review comments suggest improving the robustness of background affinity refresh tasks by considering tokio::spawn_blocking with timeouts for critical operations, adding comments to explain the advisory_lock_key generation logic, and enhancing tracing messages with specific error details for better debugging.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Issue 2 - Raw pointer identity for provider_urls map is fragile. File: crates/services/src/inference_provider_pool/mod.rs. The provider_ptr function uses raw allocation addresses (Arc::as_ptr) as stable identifiers. If a provider Arc is dropped during a discovery cycle and a new one is allocated at the same address (allocator reuse), the new provider would inherit the old URL in provider_urls before the next provider_url_updates write. The failure mode is brief incorrect routing (corrected on the next discovery cycle), but it is a subtle invariant to rely on. Consider using an explicit stable ID (a u64 counter assigned at provider creation, stored in the provider struct and accessible via a trait method). This eliminates the aliasing risk entirely.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Issue 3 - model_name logged at warn level (production-visible) may violate privacy policy. File: crates/services/src/completions/mod.rs - schedule_provider_affinity_refresh. The warn-level log includes model_name alongside api_key_id. Per project policy, logging metadata that reveals customer activity is forbidden. The pairing of api_key_id + model_name reveals which models a specific customer is using. While model names are public, the association with an API key ID is customer activity metadata. Suggest removing model_name from warn-level logs (it can remain at debug level for troubleshooting).

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Minor note: API_KEY_MODEL_AFFINITY_REFRESH_TIMEOUT_SECS = 2 may cause noisy logs. If the DB is under load, the 2-second refresh timeout will fire frequently, generating warn-level log spam. Since the refresh is best-effort and has no user-visible impact on failure, consider raising the timeout to 5s, or downgrading the timeout branch to debug level. The double-checked locking pattern in get_or_create_active_provider_url is correct. Health-based partitioning correctly overrides stale affinity for demoted providers (partitioning runs after the preferred reorder, so a failed preferred provider ends up at the back). The e2e tests for concurrent lock behavior are well-designed. Issues found - please address the FK constraint (1), privacy logging (3), and consider the pointer identity concern (2).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds shared, Postgres-backed API key ↔ model affinity routing as a best-effort hint to improve cache locality for vLLM chat completions across multiple cloud-api instances (while preserving the existing routing strategy as fallback).

Changes:

  • Introduces api_key_model_affinity table + PgApiKeyModelAffinityRepository with advisory-lock serialization for concurrent misses.
  • Extends InferenceProviderPool to track discovered provider URLs and to prioritize a preferred URL during provider ordering.
  • Wires affinity selection + async refresh into chat completion (streaming and non-streaming) paths, and adds E2E repository coverage for hit/miss locking behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
crates/services/src/inference_provider_pool/mod.rs Tracks provider URL per provider instance and allows preferred URL prioritization; exposes provider URL selection for affinity creation.
crates/services/src/completions/ports.rs Adds new ports for affinity repository and provider URL selection.
crates/services/src/completions/mod.rs Uses affinity (when eligible) for chat completions and schedules best-effort async refresh after success.
crates/database/src/repositories/mod.rs Registers the new repository module/export.
crates/database/src/repositories/api_key_model_affinity.rs Implements Postgres repository with TTL, upsert, and pg_advisory_xact_lock for concurrent misses.
crates/database/src/migrations/sql/V0049__add_api_key_model_affinity.sql Adds the api_key_model_affinity table.
crates/database/src/lib.rs Re-exports the new Postgres repository type.
crates/api/tests/e2e_repositories.rs Adds E2E coverage for concurrent miss lock serialization and hit-path lock avoidance.
crates/api/src/lib.rs Wires the new affinity repository into CompletionServiceImpl initialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hanakannzashi hanakannzashi temporarily deployed to Cloud API test env March 25, 2026 02:42 — with GitHub Actions Inactive
# Conflicts:
#	crates/services/src/completions/mod.rs
#	crates/services/src/inference_provider_pool/mod.rs
@hanakannzashi hanakannzashi temporarily deployed to Cloud API test env March 30, 2026 06:22 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants