Skip to content

Onboarding OAuth start: make single-flight atomic under concurrent starts (follow-up to #3970) #3972

@nesquena-hermes

Description

@nesquena-hermes

Summary

Follow-up from #3970 (onboarding OAuth single-flight). The single-flight guard added there fixes the reported sequential vector but is not atomic under concurrent starts.

The race

api/oauth.py::_pending_oauth_flow_for(provider, hermes_home) acquires _OAUTH_FLOWS_LOCK, checks for a live pending flow, then releases the lock. The caller then does the device-code request and inserts the new flow. So:

check (lock held) → release → request device code → insert flow → spawn worker

Two simultaneous unauthenticated POST /api/onboarding/oauth/start requests for the same provider/home can both pass the existence check before either inserts → both request device codes, both spawn polling workers, both leave pending flows. Codex reproduced this with a concurrent two-thread start against the #3970 head.

  • Codex path (start_onboarding_oauth_flow, oauth.py:~749): larger window (includes the _request_codex_user_code() network round-trip).
  • Anthropic path (_start_anthropic_flow, oauth.py:~705): same shape, smaller window.

Severity

Low. #3970 already reduces accumulation from O(total requests) → O(burst-concurrency within one ~15-min flow lifetime), the reported 25-requests→25-workers vector is fully fixed, and the 300s terminal purge cleans stragglers. This is a tightening of the guard's stated "single-flight" contract under concurrency, not a re-opening of the original DoS.

Fix

Make check → reserve → spawn atomic per (canonical_provider, canonical_hermes_home):

  • A per-key start lock (dict[key, threading.Lock] guarded by a meta-lock, or a single coarse start lock) held across the existence check, device-code request, flow insertion, and worker spawn; or
  • Insert a placeholder pending flow under the same _OAUTH_FLOWS_LOCK acquisition as the existence check (before the device-code request), rolling it back on failure.

Add a regression test driving two concurrent starts for the same provider/home, asserting exactly one worker spawn / one flow_id.

Refs #3970.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingsprint-candidateStrong candidate for next sprint

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions