perf: attestation optimizations + persistent GPU worker + e2e benchmarks by Evrard-Nil · Pull Request #70 · nearai/inference-proxy

Evrard-Nil · 2026-03-23T20:59:35Z

Summary

Attestation performance overhaul + security dependency updates. Tested on live endpoints (gpu07, gpu23, gpu11).

Security fixes

aws-lc-sys 0.37.0 → 0.39.0 (RUSTSEC-2026-0044 through 0048)
rustls-webpki 0.103.9 → 0.103.10 (RUSTSEC-2026-0049)

Attestation optimizations

Optimization	Expected savings
Persistent Python worker — keeps interpreter, verifier imports, and NVML initialized across requests instead of spawning python3 per call	~0.5-2s/call
Parallelize TDX quote + GPU evidence — `tokio::try_join!` instead of sequential	~0.3-1s/call
Cache dstack info — `OnceCell`, static data fetched once per process	~0.1-0.3s/call
Cache pre-serialized bytes — return `bytes::Bytes` directly on cache hit, no clone/serialize of 297KB struct	~1.5s at c=20
Narrow GPU serialization — only GPU evidence uses Mutex, TDX quotes run concurrently across requests	Reduced queuing

Live benchmark results (gpu07, concurrency=20, 60s)

Metric	Before	After	Change
Fresh attestation p50	17,668ms	7,035ms	-60%
Fresh attestation p90	17,932ms	7,637ms	-57%
Fresh attestation throughput	67 req/60s	144 req/60s	+115%

Cached attestation bytes optimization not yet benchmarked on live (this PR).

New files

gpu_evidence_worker.py — persistent Python worker (JSON line protocol over stdin/stdout)
benches/e2e.rs — end-to-end benchmarks (attestation, completions, proxy overhead)
scripts/bench_live.py — live endpoint benchmark tool (uv run scripts/bench_live.py <url>)

Worker design

Spawns once, stays alive, processes nonce requests via stdin/stdout pipes
sys.stdout redirected to stderr to prevent NVIDIA verifier library prints from corrupting JSON protocol
Auto-restart on worker death, falls back to subprocess-per-call if worker can't spawn

Test plan

94 tests pass
Clippy clean
CI passes (all 5 checks)
Worker protocol tested locally
Deployed and benchmarked on gpu07, gpu23, gpu11
Fresh attestation p50 dropped 60% on gpu07
Verify cached bytes path after deployment (expect <100ms at c=20)

…ilities - aws-lc-rs 1.15.4 → 1.16.2 (pulls aws-lc-sys 0.37.0 → 0.39.0) Fixes RUSTSEC-2026-0044 through 0048 (X.509 bypass, PKCS7 bypasses, AES-CCM timing, CRL scope check) - rustls-webpki 0.103.9 → 0.103.10 Fixes RUSTSEC-2026-0049 (CRL Distribution Point matching bypass)

gemini-code-assist · 2026-03-23T20:59:45Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical security vulnerabilities by upgrading key cryptographic and TLS-related dependencies. The updates ensure the application benefits from the latest security patches, preventing potential exploits and maintaining the integrity of certificate validation and cryptographic operations. This change is crucial for passing security audits and enhancing the overall robustness of the system.

Highlights

Dependency Updates: Updated aws-lc-rs from 1.15.4 to 1.16.2, which in turn updates aws-lc-sys from 0.37.0 to 0.39.0. Also updated rustls-webpki from 0.103.9 to 0.103.10.
Security Vulnerability Fixes: Resolved multiple security vulnerabilities including X.509 Name Constraints Bypass, Timing Side-Channel in AES-CCM, PKCS7 certificate and signature validation bypasses, and CRL Distribution Point logic errors, as well as a CRL Distribution Point matching bypass.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates aws-lc-rs and rustls-webpki dependencies to address several security vulnerabilities. The changes are confined to Cargo.lock, reflecting the updated versions of these transitive dependencies. The changes are consistent with the stated goal of resolving the listed RUSTSEC advisories. No issues were found in this update.

Attestation generation optimization: - Parallelize TDX quote + GPU evidence collection with tokio::try_join! (previously sequential — these are independent: TDX talks to dstack socket, GPU evidence spawns a Python subprocess) - Cache dstack info() with OnceCell (static data, never changes during process lifetime — was re-fetched on every attestation) New benchmark suite (benches/e2e.rs): - End-to-end JSON completion flow (1/5/20 messages) - End-to-end streaming completion flow (5/20/50 chunks) - Attestation cache operations (hit/miss/set/semaphore) - Attestation response serialization - Request body processing pipeline (SHA256, JSON parse, reserialize) - Response signing full pipeline (parse → hash → sign → cache) - Streaming SSE parse+hash pipeline - Auth token constant-time comparison - JSON body round-trip (parse, modify, reserialize)

Replace subprocess-per-call with a long-running Python process that keeps the interpreter, verifier module imports, and NVML driver initialized across requests. Communication via JSON lines over stdin/stdout pipes. Before: each attestation spawns python3, imports verifier/cc_admin, calls nvmlInit(), collects evidence, exits. ~0.5-2s overhead per call just from Python startup + module loading + NVML initialization. After: worker spawns once, stays alive, processes nonce requests via pipe. Only the actual GPU evidence collection time remains (~1-5s depending on GPU load). Python startup + import + nvmlInit amortized to zero after first call. Design: - GpuEvidenceWorker struct manages the child process lifecycle - Worker sends {"ready": true} on stdout after initialization - Requests: {"nonce": "<hex>", "no_gpu_mode": bool} - Responses: {"ok": true, "evidence": [...]} or {"ok": false, "error": "..."} - Auto-restart on worker death with one retry - Falls back to subprocess-per-call if worker can't spawn - All access serialized by existing gpu_semaphore (NVML constraint)

Async benchmark for attestation and completion endpoints with configurable concurrency and duration. Tests: - Attestation cached (no nonce) — measures cache hit path - Attestation fresh (with nonce) — forces GPU evidence + TDX quote - Chat completion (non-streaming and streaming) Reports p50/p90/p99 latencies, throughput, error rates. Usage: uv run scripts/bench_live.py <endpoint> -c 20 -d 60

…ption The NVIDIA verifier library (cc_admin) prints info messages directly to stdout (e.g. "Number of GPUs available : 8"), corrupting the JSON line protocol. Fix by dup'ing the real stdout fd at startup, redirecting sys.stdout to stderr, and using the saved fd for protocol messages. Also fix fallback: when the worker spawns but evidence collection fails on both attempts, now falls back to subprocess instead of returning error.

…ialization Three improvements for attestation latency: 1. Cache pre-serialized JSON bytes instead of structs - Cache hit now returns bytes::Bytes directly (zero-copy clone) - Eliminates report.clone() + AttestationResponse construction + serde_json::to_value + Json() serialization on every cached request - 297KB response was being fully re-serialized on every hit 2. Remove wide semaphore — use worker Mutex only - Previously: semaphore wrapped entire generate_attestation_inner() (TDX quote + GPU evidence + dstack info) - Now: only GPU evidence is serialized (via worker Mutex) - Concurrent fresh attestation requests can overlap TDX quotes 3. Return AttestationResult enum from generate_attestation - CachedBytes: pre-serialized bytes sent directly to client - Fresh: report that needs one-time serialization - Route handler branches on variant, avoids redundant work

gemini-code-assist bot reviewed Mar 23, 2026

View reviewed changes

Evrard-Nil added 2 commits March 23, 2026 14:38

Evrard-Nil changed the title ~~fix: update aws-lc-sys and rustls-webpki to resolve security vulnerabilities~~ perf: attestation optimization + e2e benchmarks + persistent GPU worker Mar 23, 2026

Evrard-Nil added 4 commits March 23, 2026 14:48

style: cargo fmt

059114d

Evrard-Nil changed the title ~~perf: attestation optimization + e2e benchmarks + persistent GPU worker~~ perf: attestation optimizations + persistent GPU worker + e2e benchmarks Mar 24, 2026

Evrard-Nil merged commit e6108d0 into main Mar 24, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: attestation optimizations + persistent GPU worker + e2e benchmarks#70

perf: attestation optimizations + persistent GPU worker + e2e benchmarks#70
Evrard-Nil merged 7 commits intomainfrom
fix/security-audit-aws-lc-rustls-webpki

Evrard-Nil commented Mar 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Evrard-Nil commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Security fixes

Attestation optimizations

Live benchmark results (gpu07, concurrency=20, 60s)

New files

Worker design

Test plan

Uh oh!

gemini-code-assist bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Evrard-Nil commented Mar 23, 2026 •

edited

Loading