feat: api key model affinity routing by hanakannzashi · Pull Request #509 · nearai/cloud-api

hanakannzashi · 2026-03-23T06:15:34Z

Implement shared API key/model affinity routing for vLLM-backed chat completions using Postgres.

This change adds a short-lived (api_key_id, canonical_model_name) -> provider_url binding so requests from different cloud-api instances can reuse the same model node and improve backend cache locality without introducing Redis. The existing routing strategy remains the fallback selector; affinity is a higher-level shared hint.

Key changes:

add a Postgres-backed affinity table and repository
apply affinity only to vLLM chat completions, while keeping external providers and model_pub_key requests on the existing path
prefer a bound provider_url when it is available locally, otherwise fall back to the current routing strategy
serialize concurrent misses with pg_advisory_xact_lock, while keeping cache hits lock-free
refresh bindings asynchronously after successful completions so response latency is not gated on a DB upsert
add repository e2e coverage for concurrent miss behavior and hit-path lock avoidance

gemini-code-assist · 2026-03-23T06:18:52Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces API key model affinity routing, which allows the system to route requests to preferred inference providers based on the API key and model being used. This can improve performance by ensuring that requests are routed to the most appropriate provider for a given user and model. The changes include adding a new database table to store the affinity bindings, implementing advisory locks to prevent race conditions, and modifying the inference provider pool to prioritize the preferred provider URL.

Highlights

API Key Model Affinity: Introduces a mechanism for routing requests to preferred inference providers based on API key and model, improving performance and reliability.
Database Schema: Adds a new table api_key_model_affinity to store the preferred provider URL for each API key and model combination.
Concurrency Handling: Implements advisory locks to prevent race conditions when multiple requests try to create the same affinity binding.
Provider Selection: Modifies the inference provider pool to prioritize the preferred provider URL when available.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claude · 2026-03-23T06:22:15Z

PR Review: API Key Model Affinity Routing - The overall design is sound. A few critical issues: (1) No FK constraint on api_key_id in V0049__add_api_key_model_affinity.sql - add REFERENCES api_keys(id) ON DELETE CASCADE. (2) Raw pointer identity via Arc::as_ptr in inference_provider_pool/mod.rs is fragile due to allocator address reuse - use an explicit u64 stable ID instead. (3) model_name logged at warn level alongside api_key_id in completions/mod.rs violates privacy policy. Minor: raise API_KEY_MODEL_AFFINITY_REFRESH_TIMEOUT_SECS from 2s to 5s. The double-checked locking and health partitioning logic are correct. Please address items 1 and 3 as blockers.

claude · 2026-03-23T06:22:21Z

Full review detail follows (split due to formatting constraints):

claude · 2026-03-23T06:22:27Z

Issue 1 - No FK constraint and no expired row cleanup. File: crates/database/src/migrations/sql/V0049__add_api_key_model_affinity.sql. api_key_id has no FK constraint. If an API key is deleted, its affinity rows are never cleaned up (expired rows persist as dead weight). There is also no periodic cleanup of expired rows - over time this table will accumulate stale entries. Suggested fix: add REFERENCES api_keys(id) ON DELETE CASCADE to api_key_id. Even without a separate cleanup job, the FK + cascade ensures orphaned rows do not accumulate after key deletion.

gemini-code-assist

Code Review

This pull request introduces API key model affinity, allowing specific API keys to be consistently routed to the same inference provider for a given model. Key changes include adding a new api_key_model_affinity database table and a PgApiKeyModelAffinityRepository to manage these bindings, which uses advisory locks for concurrent creation. The CompletionServiceImpl now integrates this affinity logic, determining when to use preferred provider URLs and scheduling background refreshes for active bindings. The InferenceProviderPool was updated to store provider URLs and prioritize a preferred provider during selection. Review comments suggest improving the robustness of background affinity refresh tasks by considering tokio::spawn_blocking with timeouts for critical operations, adding comments to explain the advisory_lock_key generation logic, and enhancing tracing messages with specific error details for better debugging.

crates/services/src/completions/mod.rs

crates/database/src/repositories/api_key_model_affinity.rs

crates/services/src/completions/mod.rs

claude · 2026-03-23T06:22:37Z

Issue 2 - Raw pointer identity for provider_urls map is fragile. File: crates/services/src/inference_provider_pool/mod.rs. The provider_ptr function uses raw allocation addresses (Arc::as_ptr) as stable identifiers. If a provider Arc is dropped during a discovery cycle and a new one is allocated at the same address (allocator reuse), the new provider would inherit the old URL in provider_urls before the next provider_url_updates write. The failure mode is brief incorrect routing (corrected on the next discovery cycle), but it is a subtle invariant to rely on. Consider using an explicit stable ID (a u64 counter assigned at provider creation, stored in the provider struct and accessible via a trait method). This eliminates the aliasing risk entirely.

claude · 2026-03-23T06:22:43Z

Issue 3 - model_name logged at warn level (production-visible) may violate privacy policy. File: crates/services/src/completions/mod.rs - schedule_provider_affinity_refresh. The warn-level log includes model_name alongside api_key_id. Per project policy, logging metadata that reveals customer activity is forbidden. The pairing of api_key_id + model_name reveals which models a specific customer is using. While model names are public, the association with an API key ID is customer activity metadata. Suggest removing model_name from warn-level logs (it can remain at debug level for troubleshooting).

claude · 2026-03-23T06:22:55Z

Minor note: API_KEY_MODEL_AFFINITY_REFRESH_TIMEOUT_SECS = 2 may cause noisy logs. If the DB is under load, the 2-second refresh timeout will fire frequently, generating warn-level log spam. Since the refresh is best-effort and has no user-visible impact on failure, consider raising the timeout to 5s, or downgrading the timeout branch to debug level. The double-checked locking pattern in get_or_create_active_provider_url is correct. Health-based partitioning correctly overrides stale affinity for demoted providers (partitioning runs after the preferred reorder, so a failed preferred provider ends up at the back). The e2e tests for concurrent lock behavior are well-designed. Issues found - please address the FK constraint (1), privacy logging (3), and consider the pointer identity concern (2).

Copilot

Pull request overview

Adds shared, Postgres-backed API key ↔ model affinity routing as a best-effort hint to improve cache locality for vLLM chat completions across multiple cloud-api instances (while preserving the existing routing strategy as fallback).

Changes:

Introduces api_key_model_affinity table + PgApiKeyModelAffinityRepository with advisory-lock serialization for concurrent misses.
Extends InferenceProviderPool to track discovered provider URLs and to prioritize a preferred URL during provider ordering.
Wires affinity selection + async refresh into chat completion (streaming and non-streaming) paths, and adds E2E repository coverage for hit/miss locking behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
crates/services/src/inference_provider_pool/mod.rs	Tracks provider URL per provider instance and allows preferred URL prioritization; exposes provider URL selection for affinity creation.
crates/services/src/completions/ports.rs	Adds new ports for affinity repository and provider URL selection.
crates/services/src/completions/mod.rs	Uses affinity (when eligible) for chat completions and schedules best-effort async refresh after success.
crates/database/src/repositories/mod.rs	Registers the new repository module/export.
crates/database/src/repositories/api_key_model_affinity.rs	Implements Postgres repository with TTL, upsert, and `pg_advisory_xact_lock` for concurrent misses.
crates/database/src/migrations/sql/V0049__add_api_key_model_affinity.sql	Adds the `api_key_model_affinity` table.
crates/database/src/lib.rs	Re-exports the new Postgres repository type.
crates/api/tests/e2e_repositories.rs	Adds E2E coverage for concurrent miss lock serialization and hit-path lock avoidance.
crates/api/src/lib.rs	Wires the new affinity repository into `CompletionServiceImpl` initialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/services/src/completions/mod.rs

crates/services/src/inference_provider_pool/mod.rs

crates/api/tests/e2e_repositories.rs

crates/database/src/repositories/api_key_model_affinity.rs

crates/services/src/completions/mod.rs

# Conflicts: # crates/services/src/completions/mod.rs # crates/services/src/inference_provider_pool/mod.rs

hanakannzashi added 4 commits March 23, 2026 10:46

Add Postgres-backed affinity for chat completion routing

e26783b

Add advisory locking for api key model affinity

f193838

Avoid locking affinity cache hits

28b09cd

Avoid blocking completions on affinity refresh writes

432072d

hanakannzashi requested a review from think-in-universe March 23, 2026 06:15

hanakannzashi temporarily deployed to Cloud API test env March 23, 2026 06:15 — with GitHub Actions Inactive

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

crates/services/src/completions/mod.rs Show resolved Hide resolved

crates/database/src/repositories/api_key_model_affinity.rs Outdated Show resolved Hide resolved

crates/services/src/completions/mod.rs Show resolved Hide resolved

Refine affinity logging and document advisory lock behavior

77d9295

hanakannzashi temporarily deployed to Cloud API test env March 23, 2026 07:25 — with GitHub Actions Inactive

think-in-universe requested a review from Copilot March 24, 2026 16:28

Copilot started reviewing on behalf of think-in-universe March 24, 2026 16:29 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Tighten affinity routing and cleanup behavior

58641a8

hanakannzashi temporarily deployed to Cloud API test env March 25, 2026 02:42 — with GitHub Actions Inactive

Merge branch 'main' into feat/api-key-model-affinity-routing

b8ba23c

# Conflicts: # crates/services/src/completions/mod.rs # crates/services/src/inference_provider_pool/mod.rs

hanakannzashi temporarily deployed to Cloud API test env March 30, 2026 06:22 — with GitHub Actions Inactive

Conversation

hanakannzashi commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

claude bot commented Mar 23, 2026

Uh oh!

claude bot commented Mar 23, 2026

Uh oh!

claude bot commented Mar 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Mar 23, 2026

Uh oh!

claude bot commented Mar 23, 2026

Uh oh!

claude bot commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hanakannzashi commented Mar 23, 2026 •

edited

Loading