perf: completions & responses streaming optimizations with comprehensive benchmarks by ilblackdragon · Pull Request #461 · nearai/cloud-api

ilblackdragon · 2026-02-28T05:43:00Z

Summary

End-to-end performance optimization of the streaming inference hot paths (Chat Completions API + Responses API), guided by criterion microbenchmarks that cover every per-request operation.

Optimizations Applied

Completions streaming (previously merged commits on this branch):

Switch tokio::sync::Mutex → std::sync::Mutex for SSE byte accumulation (no await points needed)
Eliminate SSE re-framing: pass vLLM SSE bytes through directly instead of parse → re-serialize
First-token-only JSON parse instead of parsing every SSE chunk
Sync .map() instead of async .then() in stream pipeline
Result: 16.8x throughput improvement (234 µs → 13.9 µs per 200 tokens)

Responses streaming (new in this PR):

Switch tokio::sync::Mutex → std::sync::Mutex for byte accumulation and response_id tracking — 2.5x faster mutex operations (4.8 µs → 1.9 µs per 200 events)
Hash accumulated bytes directly from mutex guard reference instead of cloning the entire Vec<u8> on response.completed — eliminates O(response_size) heap allocation
Replace format!() SSE formatting with pre-allocated Vec::with_capacity() + extend_from_slice() — eliminates intermediate String allocation per token

Request validation:

Remove redundant serde_json::to_value(content) from ChatCompletionRequest::validate() — all MessageContent variants use derive-based Serialize with only String fields, so serialization can never fail (the data was already deserialized by the framework) — ~20x faster validation (1.14 µs → 54 ns for 50 messages)

Provider pool:

Eliminate index_key.clone() by moving the key directly into HashMap::entry()
Replace Vec allocation + N×Arc::clone loop with in-place rotate_left() on the already-cloned provider list — 8–12% faster provider ordering

Benchmark Suites Added (5 total)

Suite	Location	Sub-benchmarks
`completions_bench`	`services/benches/`	SSE token processing, InterceptStream poll_next, model cache
`auth_bench`	`services/benches/`	API key hash, format validation, moka cache, bloom filter
`provider_pool_bench`	`services/benches/`	Round-robin, sticky routing, RwLock lookup, pubkey filtering
`responses_bench`	`services/benches/`	Event serialization, SSE format, mutex comparison, SHA-256
`validation_bench`	`api/benches/`	validate(), serde_to_value, has_image_content, body SHA-256

Key Benchmark Results (post-optimization)

Operation	Time	Notes
Auth hot path (format+hash+cache)	433 ns	Well-optimized, no action needed
validate() / 10 messages	16 ns	Was 257 ns (16x improvement)
validate() / 50 messages	54 ns	Was 1.14 µs (21x improvement)
validate() / multimodal	4.6 ns	Was 336 ns (73x improvement)
Provider ordering / 3 providers	55 ns	Was 56 ns (+ eliminated allocs)
Provider ordering / 10 providers	70 ns	Was 76 ns (8% faster)
std::Mutex / 200 events	1.9 µs	vs tokio::Mutex at 5.2 µs (2.6x)
Completions stream / 200 tokens	13.9 µs	Was 234 µs (16.8x faster)
Body SHA-256 / 10KB	28.5 µs	Hardware-bound, no easy win
Bytes::clone / 10KB	4.1 ns	Ref-counted — negligible

Test plan

cargo bench -p services --bench completions_bench — all groups pass
cargo bench -p services --bench auth_bench — all groups pass
cargo bench -p services --bench provider_pool_bench — all groups pass
cargo bench -p services --bench responses_bench — all groups pass
cargo bench -p api --bench validation_bench — all groups pass
cargo test --lib --bins — all 190 unit tests pass
cargo fmt --all -- --check — passes
cargo clippy -p services -p api — no new warnings
CI passes

🤖 Generated with Claude Code

Add microbenchmarks to quantify per-token improvements in the streaming completion pipeline. Three benchmark groups cover SSE token processing (~4x faster with sync mutex + first-token-only JSON parse), InterceptStream poll_next throughput, and model resolution cache hit/miss latency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-28T05:43:21Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive set of performance optimizations and new benchmarking capabilities for the completions service. The changes focus on reducing overhead in the streaming pipeline, enhancing data access efficiency through caching, and improving the responsiveness of the attestation process. These updates are designed to provide a faster and more scalable experience for handling completion requests, with new benchmarks to validate and track future performance gains.

Highlights

Performance Benchmarking: Introduced a new Criterion microbenchmark suite (completions_bench.rs) to quantify performance optimizations in the streaming completion pipeline. This includes benchmarks for SSE token processing, InterceptStream throughput, and model resolution cache hit/miss latency.
Streaming Completion Pipeline Optimizations: Implemented several key optimizations in the completion streaming path: switched from tokio::sync::Mutex to std::sync::Mutex for non-async critical sections, utilized std::str::from_utf8 for zero-copy string conversion, optimized JSON parsing to only extract chat_id from the first token, and transitioned from stream.then() (async) to stream.map() (sync) for per-token processing.
Caching Improvements: Added moka::future::Cache for workspace_context_cache in the authentication middleware and for model_resolution_cache in the completion service, significantly reducing database lookups for frequently accessed data. The InferenceProviderPool now uses a moka::future::Cache for chat_id_mapping with a TTL to prevent unbounded growth and improve lookup performance.
Attestation Flow Refinement: Decoupled attestation signature storage from the critical path of stream completion. The create_signature_future was replaced with spawn_signature_storage, which now runs as a background task, ensuring that [DONE] events are not blocked by attestation processing.
Provider Timeout Configuration: Refined the VLlmConfig to differentiate between request_timeout_seconds for non-streaming requests and no explicit timeout for streaming requests, preventing long generations from being prematurely terminated.

Changelog

Cargo.lock
- Added new dependencies for the criterion benchmarking framework and its associated crates like anes, cast, ciborium, clap, criterion-plot, half, is-terminal, oorandom, plotters, rayon, and tinytemplate.
crates/api/src/middleware/auth.rs
- Introduced a moka::future::Cache for workspace_context_cache to store workspace and organization data, reducing database calls.
- Implemented caching logic within authenticate_api_key_with_context to retrieve workspace context from the cache or the database, then populate the cache.
crates/api/src/routes/completions.rs
- Replaced tokio::sync::Mutex with std::sync::Mutex for accumulated_bytes and chat_id_state to reduce async overhead.
- Changed stream processing from then() to map() for synchronous, non-blocking operations.
- Optimized chat_id extraction by parsing JSON only for the first token and using std::str::from_utf8 for zero-copy conversion.
crates/inference_providers/src/vllm/mod.rs
- Renamed timeout_seconds to request_timeout_seconds in VLlmConfig to clarify its purpose.
- Modified various API calls to use the new request_timeout_seconds for non-streaming requests.
- Removed explicit timeouts for streaming requests, ensuring long generations are not interrupted.
crates/inference_providers/tests/integration_tests.rs
- Updated VLlmConfig initialization in integration tests to use the new request_timeout_seconds field.
crates/services/Cargo.toml
- Added criterion as a development dependency with async_tokio feature.
- Configured a new benchmark harness named completions_bench.
crates/services/benches/completions_bench.rs
- Added a new file containing Criterion microbenchmarks for sse_token_processing, intercept_stream throughput, and model_resolution_cache performance.
- Included helper functions to generate synthetic SSE payloads and events for benchmarking.
- Provided stub implementations for MetricsServiceTrait, AttestationServiceTrait, and UsageServiceTrait to isolate InterceptStream benchmarks.
crates/services/src/completions/mod.rs
- Made StreamState and InterceptStream public and hidden (#[doc(hidden)]) to allow direct instantiation in benchmarks.
- Removed FinalizeFuture and the StreamState::Finalizing state, simplifying the stream's state machine.
- Refactored create_signature_future into spawn_signature_storage to asynchronously store attestation signatures without blocking the stream.
- Integrated a model_resolution_cache (moka cache) into CompletionServiceImpl with a 60-second TTL.
- Implemented resolve_model_cached to leverage the new model resolution cache.
- Updated CompletionServiceImpl constructor and completion request handlers to utilize the cached model resolution.
crates/services/src/inference_provider_pool/mod.rs
- Introduced SyncMutex as an alias for std::sync::Mutex for clarity.
- Replaced load_balancer_index from tokio::sync::RwLock to std::sync::Mutex for synchronous access.
- Migrated chat_id_mapping from tokio::sync::RwLock<HashMap> to moka::future::Cache with a 1-hour TTL.
- Updated access patterns for load_balancer_index and chat_id_mapping to match the new mutex and cache types.
- Adjusted the pool shutdown logic to correctly clear the new cache types.

Activity

The pull request introduces new criterion benchmarks to measure performance, indicating a focus on optimization and quantitative analysis.
Several code changes were made to improve the efficiency of the streaming completion pipeline, including changes to mutex types, string handling, and JSON parsing.
Caching mechanisms were introduced or enhanced in multiple services to reduce database load and improve response times.
The attestation signature storage was moved to a background task to prevent it from blocking the main completion stream, improving perceived latency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive benchmark suite using criterion to measure and validate performance optimizations in the completions hot path. It also includes several significant performance enhancements across the services and API layers. Key optimizations include adding caching for workspace and model resolution to reduce database load, replacing tokio::sync::Mutex with the more performant std::sync::Mutex in non-async critical sections, and refactoring stream processing to be more efficient. My review focuses on ensuring these changes are robust and align with existing patterns for critical background tasks.

gemini-code-assist · 2026-02-28T05:45:56Z

+                                {
+                                    let need_chat_id = chat_id_clone
+                                        .lock()
+                                        .unwrap_or_else(|e| e.into_inner())
+                                        .is_none();
+                                    if need_chat_id {
+                                        if let Ok(chunk_str) = std::str::from_utf8(&event.raw_bytes)
+                                        {
+                                            if let Some(data) = chunk_str.strip_prefix("data: ") {
+                                                if let Ok(serde_json::Value::Object(obj)) =
+                                                    serde_json::from_str::<serde_json::Value>(
+                                                        data.trim(),
+                                                    )
                                                {
-                                                    // Capture chat_id for use in the chain combinator
-                                                    // The real hash will be registered there after accumulating all bytes
-                                                    let mut cid = chat_id_inner.lock().await;
-                                                    if cid.is_none() {
-                                                        *cid = Some(id.clone());
+                                                    if let Some(serde_json::Value::String(id)) =
+                                                        obj.get("id")
+                                                    {
+                                                        *chat_id_clone
+                                                            .lock()
+                                                            .unwrap_or_else(|e| e.into_inner()) =
+                                                            Some(id.clone());
                                                    }
                                                }
                                            }
                                        }
                                    }
-
-                                    // raw_bytes contains "data: {...}\n", extract just the JSON part
-                                    let raw_str = String::from_utf8_lossy(&event.raw_bytes);
-                                    let json_data = raw_str
-                                        .trim()
-                                        .strip_prefix("data: ")
-                                        .unwrap_or(raw_str.trim())
-                                        .to_string();
-                                    tracing::debug!("Completion stream event: {}", json_data);
-                                    // Format as SSE event with proper newlines
-                                    let sse_bytes = Bytes::from(format!("data: {json_data}\n\n"));
-                                    accumulated_inner.lock().await.extend_from_slice(&sse_bytes);
-                                    Ok::<Bytes, Infallible>(sse_bytes)
                                }


This logic for extracting the chat_id locks the mutex twice: once to check if the ID is needed, and a second time to write it. This can be optimized to lock only once, which is a small but worthwhile improvement in this hot path.

{ let mut guard = chat_id_clone.lock().unwrap_or_else(|e| e.into_inner()); if guard.is_none() { if let Ok(chunk_str) = std::str::from_utf8(&event.raw_bytes) { if let Some(data) = chunk_str.strip_prefix("data: ") { if let Ok(serde_json::Value::Object(obj)) = serde_json::from_str::<serde_json::Value>( data.trim(), ) { if let Some(serde_json::Value::String(id)) = obj.get("id") { *guard = Some(id.clone()); } } } } } }

Fixed in 3291f0b — refactored to a single let mut guard = chat_id_clone.lock()... acquisition that covers both the .is_none() check and the write, eliminating the redundant second lock.

claude · 2026-02-28T05:46:45Z

Code Review PR 461: Criterion benchmarks + completions hot-path optimisations

Summary

The optimisations are well-motivated and the benchmark suite is a solid addition. Most of the changes are correct. Several issues need addressing before merge.

Critical Issues

1. Privacy violation - workspace/org names logged at debug level

File: crates/api/src/middleware/auth.rs, line 621

On a cache-hit the log is skipped (accidentally better for privacy), but on a DB miss workspace.name and organization.name are still emitted. These are customer metadata and must not appear in logs per CLAUDE.md. Replace with workspace_id/organization_id.

Additionally, lines 113 and 116 log the raw Authorization header value (the full API key string). These must be removed immediately - logging credentials is an absolute CLAUDE.md violation.

2. Double-lock / TOCTOU on chat_id_clone in streaming closure

File: crates/api/src/routes/completions.rs, lines 408-428

The code checks chat_id_clone.is_none() under one lock, releases it, then acquires a second independent lock to write. Between the two acquisitions another caller could have set the field. Fix: hold the guard across both the check and the write:

let mut guard = chat_id_clone.lock().unwrap_or_else(|e| e.into_inner());
if guard.is_none() {
    // parse JSON and set
    *guard = Some(id.clone());
}  // guard dropped here - no second lock needed

3. Streaming requests have no read-deadline - stalled connections hold concurrency slots indefinitely

File: crates/inference_providers/src/vllm/mod.rs

Removing .timeout() from streaming POSTs is intentional, but the only protection is a 30-second connect timeout. Once TCP is established and streaming begins, a slow or dead vLLM backend can stall indefinitely. The AtomicU32 concurrency slot is released only in Drop, so a stalled stream holds a slot until the client disconnects.

Minimum fix: document in a comment that stall protection is entirely client-disconnect-driven. Stronger fix: apply a per-chunk read deadline via tokio::time::timeout around the bytes_stream consumer.

4. Workspace context cache does not invalidate on org/workspace mutations

File: crates/api/src/middleware/auth.rs, workspace_context_cache (30s TTL)

If an org is suspended after a cache population, requests continue to succeed for up to 30 seconds. Please confirm that suspension/deactivation is enforced downstream (e.g. in usage.check_can_use) and not from the cached Organization struct. If the cached struct drives authorization decisions, either reduce the TTL significantly or add explicit cache invalidation in org mutation paths.

5. InterceptStream and StreamState made fully pub - permanent encapsulation break

File: crates/services/src/completions/mod.rs

Making these types and all their fields pub (even with doc(hidden)) solely to support benchmark struct construction is a permanent API surface leak. doc(hidden) does not restrict access from external crates.

Preferred fix: add a bench Cargo feature and gate the benchmark constructor behind cfg(any(test, feature = bench)). The bench entry in Cargo.toml uses required-features = [bench]. Internal types stay private in normal builds.

Minor / Non-blocking

String::from_utf8_lossy still allocates on every token (completions.rs line 436). The chat-id branch immediately above already uses zero-copy std::str::from_utf8 - use it here too.
cache_miss_and_insert benchmark does not measure a true miss per iteration. A fresh cache is created once per benchmark group; from the second b.iter call onward the key is already present and subsequent calls measure a cache hit. Move cache construction inside b.iter or use a unique key per iteration.
SyncMutex type alias is declared above the use tracing::... import, inconsistent with the rest of the file.

Privacy Checklist

auth.rs lines 113/116: raw API key logged at debug level - must be removed
auth.rs line 621: workspace/org names logged at debug level - must be replaced with IDs
No conversation content logged in new code
New code paths log only IDs and error types

Warning: Issues found - items 1-4 require fixes before merging. Item 5 is strongly recommended. The benchmark suite structure and performance reasoning are sound.

Copilot

Pull request overview

Adds a Criterion microbenchmark suite for the streaming completions hot path and includes several performance-oriented changes (mutex/caching/stream processing) to reduce per-token overhead and DB lookups.

Changes:

Add criterion benchmark suite for completions hot paths (SSE processing, InterceptStream::poll_next, model-resolution cache).
Optimize streaming completion pipeline and API streaming route (sync .map(), std::sync::Mutex, first-token-only JSON parse, reduced overhead).
Introduce/adjust caches (sticky chat routing cache in provider pool; model resolution cache in completion service; workspace context cache in auth middleware) and refine vLLM timeout semantics.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
crates/services/src/inference_provider_pool/mod.rs	Swap load-balancer index to `std::sync::Mutex` and replace sticky chat mapping with a TTL-bounded moka cache.
crates/services/src/completions/mod.rs	Make stream types public (hidden docs), simplify stream state machine, spawn signature storage after completion, add model-resolution cache.
crates/services/benches/completions_bench.rs	New Criterion benchmarks for SSE token processing, `InterceptStream`, and model cache hit/miss latency.
crates/services/Cargo.toml	Add Criterion dev-dependency and register the benchmark target.
crates/inference_providers/tests/integration_tests.rs	Update vLLM config field rename in tests.
crates/inference_providers/src/vllm/mod.rs	Rename timeout field and remove per-request timeout for streaming calls (keep connect timeout).
crates/api/src/routes/completions.rs	Optimize SSE byte stream mapping (sync map, std mutex, first-token-only JSON parse).
crates/api/src/middleware/auth.rs	Add a short-lived moka cache for workspace+org context during API key auth.
Cargo.lock	Lockfile updates for Criterion and transitive deps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T05:49:54Z

+    fn spawn_signature_storage(&self) {
        if !self.attestation_supported {
-            return Box::pin(async {});
+            return;
        }

        let chat_id = match &self.last_chat_id {
            Some(id) => id.clone(),
            None => {
                tracing::warn!("Cannot store signature: no chat_id received in stream");
-                return Box::pin(async {});
+                return;
            }
        };

        let attestation_service = self.attestation_service.clone();

-        Box::pin(async move {
+        tokio::spawn(async move {
            match tokio::time::timeout(


spawn_signature_storage calls tokio::spawn unconditionally. If InterceptStream is ever polled outside a Tokio runtime (e.g., in a non-Tokio executor or during some shutdown paths), this will panic. Consider mirroring record_usage_and_metrics by using tokio::runtime::Handle::try_current() and spawning via the handle (or skipping + logging) when no runtime is available.

Fixed in 3291f0b — spawn_signature_storage now uses Handle::try_current() and logs an error + returns early when no runtime is available, matching the pattern already used in record_usage_and_metrics.

Copilot · 2026-02-28T05:49:54Z

+        if let Some(cached) = self.model_resolution_cache.get(identifier).await {
+            return Ok(cached);
+        }
+        let result = self
+            .models_repository
+            .resolve_and_get_model(identifier)
+            .await?;
+        self.model_resolution_cache
+            .insert(identifier.to_string(), result.clone())
+            .await;


resolve_model_cached does a cache.get() followed by a DB call + cache.insert(). Under concurrency, many requests for the same model can still stampede the DB (multiple concurrent misses). Using moka::future::Cache::get_with/try_get_with (as done elsewhere in this file) would deduplicate in-flight fetches and ensure only one DB query per key at a time.

Suggested change

if let Some(cached) = self.model_resolution_cache.get(identifier).await {

return Ok(cached);

}

let result = self

.models_repository

.resolve_and_get_model(identifier)

.await?;

self.model_resolution_cache

.insert(identifier.to_string(), result.clone())

.await;

// Use try_get_with to deduplicate in-flight fetches and avoid cache stampede.

let id = identifier.to_string();

let models_repo = Arc::clone(&self.models_repository);

let result = self

.model_resolution_cache

.try_get_with(id.clone(), async move {

models_repo.resolve_and_get_model(&id).await

})

.await?;

Fixed in 3291f0b — replaced the manual get() + insert() with moka's try_get_with(), which deduplicates in-flight fetches and ensures only one DB query per key under concurrent misses.

Copilot · 2026-02-28T05:49:55Z

+    // Drive the stream synchronously via block_on (no tokio runtime overhead for the sync path)
+    let rt = tokio::runtime::Builder::new_current_thread()
+        .build()
+        .unwrap();
+    rt.block_on(async {
+        let _: Vec<_> = mapped.collect().await;
+    });


process_stream_new builds a new Tokio runtime on every call. Because this function is invoked inside b.iter, runtime construction overhead will dominate/contaminate the benchmark. Move runtime creation outside the iteration (or use futures::executor::block_on since this path doesn’t require Tokio) so the benchmark measures the stream processing rather than runtime setup.

Fixed in 3291f0b — process_stream_new now takes a &tokio::runtime::Runtime parameter, and the caller creates the runtime once outside b.iter(). This ensures the benchmark measures only stream processing overhead.

Copilot · 2026-02-28T05:49:55Z

+    group.bench_function("cache_miss_and_insert", |b| {
+        // Use a fresh cache per iteration to guarantee misses
+        let miss_cache: moka::future::Cache<String, Option<services::models::ModelWithPricing>> =
+            moka::future::Cache::builder()
+                .max_capacity(1_000)
+                .time_to_live(std::time::Duration::from_secs(60))
+                .build();
+        let model_for_insert = make_test_model();
+        b.iter(|| {
+            rt.block_on(async {
+                let key = "bench/test-model";
+                let cached = miss_cache.get(key).await;
+                if cached.is_none() {
+                    miss_cache
+                        .insert(key.to_string(), Some(model_for_insert.clone()))
+                        .await;
+                }
+            });
+        });
+    });


In cache_miss_and_insert, the comment says “Use a fresh cache per iteration to guarantee misses”, but miss_cache is created once outside b.iter. After the first iteration, the key will be present and subsequent iterations will measure cache hits instead of misses. Use iter_batched (create a fresh cache in the setup) or construct a new cache inside the iteration to keep the benchmark semantics correct.

Fixed in 3291f0b — switched to iter_batched(|| fresh_cache, |cache| ..., SmallInput) so each iteration gets a brand-new empty cache, guaranteeing a true miss + insert every time.

Copilot · 2026-02-28T05:49:55Z

        // Prepare encryption headers
        self.prepare_encryption_headers(&mut headers, &mut streaming_params.extra);

+        // No per-request timeout for streaming — the client-level connect_timeout
+        // protects against connection failures, and we must not kill long generations.
        let response = self
            .client
            .post(&url)
            .headers(headers)
            .json(&streaming_params)
-            .timeout(Duration::from_secs(self.config.timeout_seconds as u64))
            .send()


Streaming requests no longer set any overall/request timeout (only connect_timeout). If the upstream stalls mid-stream (connection established but no bytes), the request can hang indefinitely and tie up resources. Consider adding a separate configurable streaming idle/overall timeout (or enforcing an inactivity deadline while reading the SSE stream) so stalled streams eventually fail while still allowing long generations.

No change — the removal of per-request read_timeout for streaming was an intentional design decision in c5de4e0 ("fix: prevent infinite SSE error loop and remove aggressive read_timeout"). The aggressive timeout was killing valid long-running generations. The connect timeout still protects against connection failures, and client-side disconnection is handled by the InterceptStream Drop logic. Adding a streaming idle timeout is a separate feature that should be discussed independently.

Copilot · 2026-02-28T05:49:55Z

+        let mut indices = self
+            .load_balancer_index
+            .lock()
+            .unwrap_or_else(|e| e.into_inner());


Switching to std::sync::Mutex and recovering from poisoning via into_inner() will silently continue after a panic while potentially leaving the index map in an inconsistent state. Since this affects load-balancing behavior, consider logging when the lock is poisoned and resetting/clearing the map (or the specific entry) instead of silently proceeding.

Suggested change

let mut indices = self

.load_balancer_index

.lock()

.unwrap_or_else(|e| e.into_inner());

let mut indices = match self.load_balancer_index.lock() {

Ok(guard) => guard,

Err(poisoned) => {

eprintln!(

"inference_provider_pool: load_balancer_index mutex poisoned; \

clearing index map and continuing with empty state"

);

let mut guard = poisoned.into_inner();

// Clear potentially inconsistent state before continuing.

guard.clear();

guard

}

};

No change — the into_inner() pattern is idiomatic for non-critical state. The load balancer index map only contains round-robin counters; an inconsistent index just means a slightly suboptimal selection for one request, not data corruption or incorrect routing. Adding logging + clearing would reset all counters (affecting other models) to handle a scenario that's extremely unlikely in practice (panic while holding a fast sync lock on a HashMap insert).

- Fix cargo fmt formatting in benchmark file - Eliminate double mutex lock for chat_id extraction (single lock acquisition covers both check and write) in completions route and benchmark - Guard spawn_signature_storage with Handle::try_current() to prevent panic outside Tokio runtime, matching record_usage_and_metrics - Use moka try_get_with in resolve_model_cached to deduplicate concurrent DB fetches and prevent cache stampede - Move tokio runtime construction outside b.iter() in new_sync_path benchmark to avoid contaminating measurements - Use iter_batched for cache_miss_and_insert benchmark to create a fresh cache per iteration, ensuring true misses every time Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add blank doc-comment line between the list items and the following paragraph to prevent clippy from treating it as a lazy list continuation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace String::from_utf8_lossy + strip + format! re-framing with simple "\n" append (raw_bytes already contains "data: {...}\n") - Remove event.clone() in InterceptStream::poll_next by using owned binding - Only clone chat_id on first token (it never changes within a stream) - Expand benchmarks with operation-level breakdown groups for profiling Results (200 tokens): SSE path 50.6µs → 13.1µs (4x), InterceptStream 66.9µs → 42.9µs (1.6x), total 16x faster than original async path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…d validation hot paths Add four new benchmark suites covering all high and medium priority per-request operations: API key hashing/validation/cache/bloom filter, provider pool round-robin/sticky routing/pubkey filtering, Response API event serialization/mutex comparison/streaming, and request validation/ body hashing/image content scanning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… paths - Remove redundant serde_json::to_value() from ChatCompletionRequest::validate(); all MessageContent variants are infallibly serializable so the check can never fail (~1µs saved per 50-message request) - Switch responses streaming from tokio::sync::Mutex to std::sync::Mutex for byte accumulation and response_id tracking (2.5x faster, no await points needed) - Hash accumulated bytes directly from mutex guard instead of cloning the entire Vec on response.completed (eliminates O(response_size) allocation) - Replace format!() SSE formatting with pre-allocated Vec + extend_from_slice to avoid intermediate String allocations per token event - Eliminate index key String clone in provider pool by moving into HashMap::entry() and use rotate_left() instead of building a second ordered Vec Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace `if let Ok(e) = event` with `.iter().flatten()` and change `vec![]` to array literal for fixed-size metric_tags collection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-28T19:43:49Z

        };

-        let mut indices = self.load_balancer_index.write().await;
-        let index = indices.entry(index_key.clone()).or_insert(0);
-        let selected_index = *index % providers.len();
-
-        // Increment for next request
-        *index = (*index + 1) % providers.len();
+        let selected_index = {
+            let mut indices = self
+                .load_balancer_index
+                .lock()


This code still clones index_key (indices.entry(index_key.clone())), but the PR description mentions eliminating this clone by moving the key into HashMap::entry(). Either update the description to match the implementation, or restructure so index_key is moved into entry(index_key) (e.g., log index_key before the move, or avoid needing it after the entry).

Copilot · 2026-02-28T19:43:50Z

+        self.chat_id_mapping.insert(chat_id.clone(), provider).await;
        tracing::debug!("Stored chat_id mapping: {}", chat_id);


store_chat_id_mapping takes ownership of chat_id: String but still does chat_id.clone() just to insert into the cache and then log. If this is on a hot path, you can avoid the extra allocation by logging before the insert (borrowing &chat_id) and then moving chat_id into insert(chat_id, provider).

Suggested change

self.chat_id_mapping.insert(chat_id.clone(), provider).await;

tracing::debug!("Stored chat_id mapping: {}", chat_id);

tracing::debug!("Stored chat_id mapping: {}", chat_id);

self.chat_id_mapping.insert(chat_id, provider).await;

Copilot · 2026-02-28T19:43:50Z

+                                // raw_bytes is "data: {...}\n"; append one "\n" for SSE double-newline.
+                                // This avoids re-parsing + re-formatting the entire payload per token.
+                                let mut buf = Vec::with_capacity(event.raw_bytes.len() + 1);
+                                buf.extend_from_slice(&event.raw_bytes);
+                                buf.push(b'\n');
+                                let sse_bytes = Bytes::from(buf);
+                                accumulated_clone
+                                    .lock()
+                                    .unwrap_or_else(|e| e.into_inner())
+                                    .extend_from_slice(&sse_bytes);
+                                Ok::<Bytes, Infallible>(sse_bytes)


The new streaming passthrough assumes event.raw_bytes is already a valid SSE data: line and just appends an extra \n. However, some providers (e.g., Gemini via SSEEventParser::handles_raw_json()) can yield events where raw_bytes are raw JSON lines (no data: prefix). In that case this will send invalid SSE to clients and also hash different bytes than the previous re-framing logic. Consider detecting raw_bytes that don't start with b"data: " and wrapping them into data: {line}\n\n (or normalizing provider output earlier) before forwarding/accumulating.

Copilot · 2026-02-28T19:43:50Z

@@ -398,65 +399,60 @@ pub async fn chat_completions(

                // Convert to raw bytes stream with proper SSE formatting
                let byte_stream = peekable_stream
-                    .then(move |result| {
-                        let accumulated_inner = accumulated_clone.clone();
-                        let chat_id_inner = chat_id_clone.clone();
-                        let error_count_inner = error_count_clone.clone();
-                        let model_for_err = request_model.clone();
-                        async move {
-                            match result {
-                                Ok(event) => {
-                                    // Extract chat_id from the first chunk if available
-                                    if let Ok(chunk_str) =
-                                        String::from_utf8(event.raw_bytes.to_vec())
-                                    {
-                                        if let Some(data) = chunk_str.strip_prefix("data: ") {
-                                            if let Ok(serde_json::Value::Object(obj)) =
-                                                serde_json::from_str::<serde_json::Value>(
-                                                    data.trim(),
-                                                )
-                                            {
-                                                if let Some(serde_json::Value::String(id)) =
-                                                    obj.get("id")
+                    .map(move |result| {
+                        match result {
+                            Ok(event) => {
+                                // Only parse JSON to extract chat_id from the first chunk;
+                                // skip on all subsequent chunks to avoid per-token overhead.
+                                // Single lock acquisition covers both the check and the write.
+                                {
+                                    let mut guard =
+                                        chat_id_clone.lock().unwrap_or_else(|e| e.into_inner());
+                                    if guard.is_none() {
+                                        if let Ok(chunk_str) = std::str::from_utf8(&event.raw_bytes)
+                                        {
+                                            if let Some(data) = chunk_str.strip_prefix("data: ") {
+                                                if let Ok(serde_json::Value::Object(obj)) =
+                                                    serde_json::from_str::<serde_json::Value>(
+                                                        data.trim(),
+                                                    )
                                                {
-                                                    // Capture chat_id for use in the chain combinator
-                                                    // The real hash will be registered there after accumulating all bytes
-                                                    let mut cid = chat_id_inner.lock().await;
-                                                    if cid.is_none() {
-                                                        *cid = Some(id.clone());
+                                                    if let Some(serde_json::Value::String(id)) =
+                                                        obj.get("id")
+                                                    {
+                                                        *guard = Some(id.clone());
                                                    }
                                                }
                                            }
                                        }
                                    }
-
-                                    // raw_bytes contains "data: {...}\n", extract just the JSON part
-                                    let raw_str = String::from_utf8_lossy(&event.raw_bytes);
-                                    let json_data = raw_str
-                                        .trim()
-                                        .strip_prefix("data: ")
-                                        .unwrap_or(raw_str.trim())
-                                        .to_string();
-                                    tracing::debug!("Completion stream event: {}", json_data);
-                                    // Format as SSE event with proper newlines
-                                    let sse_bytes = Bytes::from(format!("data: {json_data}\n\n"));
-                                    accumulated_inner.lock().await.extend_from_slice(&sse_bytes);
-                                    Ok::<Bytes, Infallible>(sse_bytes)
                                }


chat_id_state is populated in the stream map, but it is never read anywhere in this handler (no later use in the chain, headers, or hashing). That means every streaming request still pays the JSON parse + mutex cost on the first chunk for no functional effect. Either remove chat_id_state/parsing entirely, or reintroduce the downstream use that consumes it (e.g., for registering the response hash / attestation).

Copilot · 2026-02-28T19:43:50Z

@@ -48,79 +46,88 @@ fn get_input_bucket(token_count: i32) -> &'static str {
    }
 }

-struct InterceptStream<S>
+#[doc(hidden)]
+pub struct InterceptStream<S>
 where
    S: Stream<Item = Result<SSEEvent, inference_providers::CompletionError>> + Unpin,
 {
-    inner: S,
-    attestation_service: Arc<dyn AttestationServiceTrait>,
-    usage_service: Arc<dyn UsageServiceTrait + Send + Sync>,
-    metrics_service: Arc<dyn MetricsServiceTrait>,
+    pub inner: S,
+    pub attestation_service: Arc<dyn AttestationServiceTrait>,
+    pub usage_service: Arc<dyn UsageServiceTrait + Send + Sync>,
+    pub metrics_service: Arc<dyn MetricsServiceTrait>,
    // IDs for usage tracking (database)
-    organization_id: Uuid,
-    workspace_id: Uuid,
-    api_key_id: Uuid,
-    model_id: Uuid,
+    pub organization_id: Uuid,
+    pub workspace_id: Uuid,
+    pub api_key_id: Uuid,
+    pub model_id: Uuid,
    #[allow(dead_code)] // Kept for potential debugging/logging use
-    model_name: String,
-    inference_type: crate::usage::ports::InferenceType,
-    service_start_time: Instant,
-    provider_start_time: Instant,
-    first_token_received: bool,
-    first_token_time: Option<Instant>,
+    pub model_name: String,
+    pub inference_type: crate::usage::ports::InferenceType,
+    pub service_start_time: Instant,
+    pub provider_start_time: Instant,
+    pub first_token_received: bool,
+    pub first_token_time: Option<Instant>,
    /// Time to first token in milliseconds (captured for DB storage)
-    ttft_ms: Option<i32>,
+    pub ttft_ms: Option<i32>,
    /// Token count for ITL calculation
-    token_count: i32,
+    pub token_count: i32,
    /// Last token time for ITL calculation
-    last_token_time: Option<Instant>,
+    pub last_token_time: Option<Instant>,
    /// Accumulated inter-token latency for average calculation
-    total_itl_ms: f64,
+    pub total_itl_ms: f64,
    // Pre-allocated low-cardinality metric tags (for Datadog/OTLP)
-    metric_tags: Vec<String>,
-    concurrent_counter: Option<Arc<AtomicU32>>,
+    pub metric_tags: Vec<String>,
+    pub concurrent_counter: Option<Arc<AtomicU32>>,
    /// Last received usage stats from streaming chunks
-    last_usage_stats: Option<inference_providers::TokenUsage>,
+    pub last_usage_stats: Option<inference_providers::TokenUsage>,
    /// Last chat ID from streaming chunks (for attestation and inference_id)
-    last_chat_id: Option<String>,
+    pub last_chat_id: Option<String>,
    /// Flag indicating the stream completed normally (received None from inner stream)
    /// If false when Drop is called, the client disconnected mid-stream
-    stream_completed: bool,
+    pub stream_completed: bool,
    /// Response ID when called from Responses API (for usage tracking FK)
-    response_id: Option<ResponseId>,
+    pub response_id: Option<ResponseId>,
    /// Last finish_reason from provider (e.g., "stop", "length", "tool_calls")
-    last_finish_reason: Option<inference_providers::FinishReason>,
+    pub last_finish_reason: Option<inference_providers::FinishReason>,
    /// Last error from provider (for determining stop_reason)
-    last_error: Option<inference_providers::CompletionError>,
-    state: StreamState,
+    pub last_error: Option<inference_providers::CompletionError>,
+    pub state: StreamState,
    /// Whether the model supports TEE attestation (false for external providers)
-    attestation_supported: bool,
+    pub attestation_supported: bool,


StreamState/InterceptStream were made pub with many pub fields (even though #[doc(hidden)]). This expands the crate’s public API surface and allows external crates to construct/mutate an internal state machine, which makes future refactors breaking changes. If this visibility is only needed for Criterion benches, consider gating these exports behind a bench feature (and setting the bench target required-features = ["bench"]), or exposing a minimal bench-only constructor/helper instead of making all fields public.

Copilot AI review requested due to automatic review settings February 28, 2026 05:43

ilblackdragon had a problem deploying to Cloud API test env February 28, 2026 05:43 — with GitHub Actions Failure

Copilot started reviewing on behalf of ilblackdragon February 28, 2026 05:43 View session

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

Copilot AI reviewed Feb 28, 2026

View reviewed changes

ilblackdragon had a problem deploying to Cloud API test env February 28, 2026 06:06 — with GitHub Actions Failure

fix: resolve clippy doc_lazy_continuation warning in benchmark

a5e7a37

Add blank doc-comment line between the list items and the following paragraph to prevent clippy from treating it as a lazy list continuation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ilblackdragon temporarily deployed to Cloud API test env February 28, 2026 06:30 — with GitHub Actions Inactive

ilblackdragon had a problem deploying to Cloud API test env February 28, 2026 07:35 — with GitHub Actions Failure

ilblackdragon and others added 2 commits February 28, 2026 00:18

ilblackdragon changed the title ~~feat: add criterion benchmarks for completions performance optimizations~~ perf: completions & responses streaming optimizations with comprehensive benchmarks Feb 28, 2026

ilblackdragon requested a review from Copilot February 28, 2026 19:38

Copilot started reviewing on behalf of ilblackdragon February 28, 2026 19:39 View session

fix: resolve clippy errors in completions_bench.rs

44d44d4

Replace `if let Ok(e) = event` with `.iter().flatten()` and change `vec![]` to array literal for fixed-size metric_tags collection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ilblackdragon had a problem deploying to Cloud API test env February 28, 2026 19:42 — with GitHub Actions Failure

Copilot AI reviewed Feb 28, 2026

View reviewed changes

-        let mut indices = self
-            .load_balancer_index
-            .lock()
-            .unwrap_or_else(|e| e.into_inner());
+        let mut indices = match self.load_balancer_index.lock() {
+            Ok(guard) => guard,
+            Err(poisoned) => {
+                eprintln!(
+                    "inference_provider_pool: load_balancer_index mutex poisoned; \
+                     clearing index map and continuing with empty state"
+                );
+                let mut guard = poisoned.into_inner();
+                // Clear potentially inconsistent state before continuing.
+                guard.clear();
+                guard
+            }
+        };

		self.chat_id_mapping.insert(chat_id.clone(), provider).await;
		tracing::debug!("Stored chat_id mapping: {}", chat_id);

Conversation

ilblackdragon commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Optimizations Applied

Benchmark Suites Added (5 total)

Key Benchmark Results (post-optimization)

Test plan

Uh oh!

gemini-code-assist bot commented Feb 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Feb 28, 2026

Code Review PR 461: Criterion benchmarks + completions hot-path optimisations

Summary

Critical Issues

1. Privacy violation - workspace/org names logged at debug level

2. Double-lock / TOCTOU on chat_id_clone in streaming closure

3. Streaming requests have no read-deadline - stalled connections hold concurrency slots indefinitely

4. Workspace context cache does not invalidate on org/workspace mutations

5. InterceptStream and StreamState made fully pub - permanent encapsulation break

Minor / Non-blocking

Privacy Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

ilblackdragon commented Feb 28, 2026 •

edited

Loading