Summary
Follow-up from #3970 (onboarding OAuth single-flight). The single-flight guard added there fixes the reported sequential vector but is not atomic under concurrent starts.
The race
api/oauth.py::_pending_oauth_flow_for(provider, hermes_home) acquires _OAUTH_FLOWS_LOCK, checks for a live pending flow, then releases the lock. The caller then does the device-code request and inserts the new flow. So:
check (lock held) → release → request device code → insert flow → spawn worker
Two simultaneous unauthenticated POST /api/onboarding/oauth/start requests for the same provider/home can both pass the existence check before either inserts → both request device codes, both spawn polling workers, both leave pending flows. Codex reproduced this with a concurrent two-thread start against the #3970 head.
- Codex path (
start_onboarding_oauth_flow, oauth.py:~749): larger window (includes the _request_codex_user_code() network round-trip).
- Anthropic path (
_start_anthropic_flow, oauth.py:~705): same shape, smaller window.
Severity
Low. #3970 already reduces accumulation from O(total requests) → O(burst-concurrency within one ~15-min flow lifetime), the reported 25-requests→25-workers vector is fully fixed, and the 300s terminal purge cleans stragglers. This is a tightening of the guard's stated "single-flight" contract under concurrency, not a re-opening of the original DoS.
Fix
Make check → reserve → spawn atomic per (canonical_provider, canonical_hermes_home):
- A per-key start lock (
dict[key, threading.Lock] guarded by a meta-lock, or a single coarse start lock) held across the existence check, device-code request, flow insertion, and worker spawn; or
- Insert a placeholder
pending flow under the same _OAUTH_FLOWS_LOCK acquisition as the existence check (before the device-code request), rolling it back on failure.
Add a regression test driving two concurrent starts for the same provider/home, asserting exactly one worker spawn / one flow_id.
Refs #3970.
Summary
Follow-up from #3970 (onboarding OAuth single-flight). The single-flight guard added there fixes the reported sequential vector but is not atomic under concurrent starts.
The race
api/oauth.py::_pending_oauth_flow_for(provider, hermes_home)acquires_OAUTH_FLOWS_LOCK, checks for a live pending flow, then releases the lock. The caller then does the device-code request and inserts the new flow. So:Two simultaneous unauthenticated
POST /api/onboarding/oauth/startrequests for the same provider/home can both pass the existence check before either inserts → both request device codes, both spawn polling workers, both leave pending flows. Codex reproduced this with a concurrent two-thread start against the #3970 head.start_onboarding_oauth_flow, oauth.py:~749): larger window (includes the_request_codex_user_code()network round-trip)._start_anthropic_flow, oauth.py:~705): same shape, smaller window.Severity
Low. #3970 already reduces accumulation from O(total requests) → O(burst-concurrency within one ~15-min flow lifetime), the reported 25-requests→25-workers vector is fully fixed, and the 300s terminal purge cleans stragglers. This is a tightening of the guard's stated "single-flight" contract under concurrency, not a re-opening of the original DoS.
Fix
Make check → reserve → spawn atomic per
(canonical_provider, canonical_hermes_home):dict[key, threading.Lock]guarded by a meta-lock, or a single coarse start lock) held across the existence check, device-code request, flow insertion, and worker spawn; orpendingflow under the same_OAUTH_FLOWS_LOCKacquisition as the existence check (before the device-code request), rolling it back on failure.Add a regression test driving two concurrent starts for the same provider/home, asserting exactly one worker spawn / one
flow_id.Refs #3970.