Skip to content

perf(providers): per-home probe-worker pool + scoped credential invalidation (#3787)#3940

Open
rodboev wants to merge 5 commits into
nesquena:masterfrom
rodboev:pr/probe-worker-pool-contention
Open

perf(providers): per-home probe-worker pool + scoped credential invalidation (#3787)#3940
rodboev wants to merge 5 commits into
nesquena:masterfrom
rodboev:pr/probe-worker-pool-contention

Conversation

@rodboev

@rodboev rodboev commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Thinking Path

  • Reviewed _AccountUsageProbeWorker.fetch() at �pi/providers.py:1259: non-blocking _lock.acquire(blocking=False) immediately falls back to a cold subprocess spawn when the single worker is busy, producing O(N) cold spawns under N concurrent probes for the same home.
  • Reviewed _close_account_usage_probe_workers_async() at line 1468 and invalidate_account_usage_status_cache() at line 1506: credential update for any provider flushes every home's worker pool unconditionally.
  • Fix: replace the single-worker-per-home slot with a small pool (N=2), add a short bounded wait before falling back to cold spawn, and narrow the async flush to only the active home when a specific provider's credential changes.

What Changed

  • �pi/providers.py: changed _account_usage_worker_pool from dict[str, Worker] to dict[str, list[Worker]] with _ACCOUNT_USAGE_WORKERS_PER_HOME = 2; _get_account_usage_probe_worker() tries all pool slots non-blocking then waits briefly on slot 0 before returning None; _close_account_usage_probe_workers_async() accepts an optional provider_id and scopes flush to the affected home when set; invalidate_account_usage_status_cache() passes the provider id through

Why It Matters

Same-home contention under rapid Settings-panel polls no longer degrades to repeated cold subprocess spawns, reducing tail latency for quota probes; credential rotation for one provider stops evicting warm workers for unrelated providers.

Verification

  • pytest tests/test_issue3787_probe_pool.py -v --timeout=60
  • pytest tests/ -v --timeout=60

Risks / Follow-ups

  • Pool size 2 is conservative; a follow-up can tune _ACCOUNT_USAGE_WORKERS_PER_HOME or make it configurable via env var if profiling shows more contention
  • Scoped flush currently targets only the active home; multi-profile installs that share a single process but switch HERMES_HOME at runtime could benefit from a per-home credential-change signal in a follow-up

Model Used

Claude Opus 4.6 via Claude Code CLI

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR replaces the single-worker-per-home slot with a two-slot pool (_ACCOUNT_USAGE_WORKERS_PER_HOME = 2) and fixes a TOCTOU race by returning the worker with its lock already held, requiring the caller to release it via try/finally. Scoped credential invalidation now flushes only the active home's workers when a specific provider ID is given, rather than evicting all workers unconditionally.

  • The stale-detection condition was changed from proc is None or proc.poll() is not None to proc is not None and proc.poll() is not None to prevent replenished (never-started) workers from being immediately evicted. This is correct for the replenishment case, but it breaks test_account_usage_cleanup_removes_null_proc_worker: after worker.close() sets _proc = None, cleanup no longer recognises those workers as stale, and the assertion assert str(tmp_path) not in providers._account_usage_worker_pool at line 1509 will fail.
  • The worker list scan in _get_account_usage_probe_worker runs outside the pool lock; this is safe but worth documenting, as a concurrent _close_account_usage_probe_workers can silently push the caller to the cold-spawn fallback.

Confidence Score: 4/5

Safe to merge once the broken test in test_provider_quota_status.py is reconciled with the changed stale-detection logic.

The stale-detection condition was intentionally narrowed so freshly-created replenished workers are not immediately evicted. That fix is correct, but test_account_usage_cleanup_removes_null_proc_worker calls worker.close() and then expects the pool key to be absent after cleanup — an expectation that no longer holds. The assertion at line 1509 will fail in CI.

tests/test_provider_quota_status.py — the null-proc cleanup test needs its assertion and setup updated to account for the changed stale-detection logic.

Important Files Changed

Filename Overview
api/providers.py Pool expanded to list[Worker], lock-held handoff implemented, scoped flush added; stale-detection logic for null-proc workers is inconsistent with an existing test assertion.
tests/test_issue3787_probe_pool.py New test suite covering pool sizing, non-blocking acquire, saturation fallback, scoped invalidation, concurrent no-double-claim, cleanup replenishment.
tests/test_provider_quota_status.py Adapted to list-of-workers pool shape; test_account_usage_cleanup_removes_null_proc_worker retains an assertion that will fail under the new stale-detection logic.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant G as _get_account_usage_probe_worker
    participant P as worker_pool dict
    participant W0 as Worker[0]
    participant W1 as Worker[1]
    participant F as cold fallback

    C->>G: call(home)
    G->>P: acquire pool_lock, get workers list
    alt pool missing
        G->>P: create 2 new Workers, store list
    end
    G->>P: release pool_lock
    G->>W0: "_lock.acquire(blocking=False)"
    alt W0 free
        W0-->>G: acquired
        G-->>C: return W0 (lock held)
        C->>W0: _fetch_locked()
        W0-->>C: snapshot
        C->>W0: _lock.release()
    else W0 busy
        G->>W1: "_lock.acquire(blocking=False)"
        alt W1 free
            W1-->>G: acquired
            G-->>C: return W1 (lock held)
            C->>W1: _fetch_locked()
            W1-->>C: snapshot
            C->>W1: _lock.release()
        else pool saturated
            G-->>C: return None
            C->>F: _fetch_account_usage_once_for_home()
            F-->>C: snapshot (cold subprocess)
        end
    end
Loading

Reviews (5): Last reviewed commit: "fix(providers): treat proc=None as idle,..." | Re-trigger Greptile

Comment thread api/providers.py
Comment thread api/providers.py Outdated
Comment thread tests/test_issue3787_probe_pool.py Outdated
@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Read the full api/providers.py diff against origin/master, the _AccountUsageProbeWorker class (api/providers.py:1225-1318), and the new tests/test_issue3787_probe_pool.py. The pool/cleanup refactor and the scoped-flush plumbing are well-structured, but there's a check-then-act race in the selector that I think undercuts the stated tail-latency goal.

The selector reserves nothing — it acquires, releases, then returns

_get_account_usage_probe_worker picks a worker that looks free, but releases the lock before returning it:

for w in workers:
    if w._lock.acquire(blocking=False):
        w._lock.release()
        return w
if workers[0]._lock.acquire(timeout=_ACCOUNT_USAGE_WORKER_WAIT_SECONDS):
    workers[0]._lock.release()
    return workers[0]
return None

The caller then re-acquires in fetch() (api/providers.py:1259-1264):

def fetch(self, provider, *, api_key=None):
    if not self._lock.acquire(blocking=False):
        return _fetch_account_usage_once_for_home(provider, self.home, api_key=api_key)

Between the release() in the selector and the acquire(blocking=False) in fetch(), nothing holds the worker. Two probes for the same home racing through _get_account_usage_probe_worker can both observe workers[0] free, both release, and both return workers[0]; the second one's fetch() then finds the lock taken and falls back to _fetch_account_usage_once_for_home — a cold subprocess spawn. That's exactly the O(N) cold-spawn path #3787 set out to remove. The N=2 pool reduces the probability but doesn't close the window, because selection and use aren't atomic.

Suggested shape: hand back a held worker

Have the selector acquire and keep the lock, returning the locked worker (or None), and have the caller release it. Roughly:

for w in workers:
    if w._lock.acquire(blocking=False):
        return w                      # caller releases
if workers[0]._lock.acquire(timeout=_ACCOUNT_USAGE_WORKER_WAIT_SECONDS):
    return workers[0]
return None

and in _agent_fetch_account_usage_for_home, call a lock-free worker._fetch_locked(...) inside a try/finally: worker._lock.release(). Since _lock is an RLock (api/providers.py:1229), _fetch_locked's own self.last_used/_ensure_process calls remain reentrant-safe. That guarantees the two concurrent probes actually get the two distinct workers, which is the whole point.

Test gap

test_nonblocking_acquire_returns_idle_worker holds workers[0]._lock from a background thread for the entire assertion, so it validates the selector in isolation but never exercises the selector → fetch() round trip where the race lives. A test that spawns 2+ real probes concurrently (mocking the subprocess) and asserts zero cold-spawn fallbacks would actually pin the contention behavior. As written, the suite passes even with the TOCTOU window open.

Minor: scoped-flush vs cache asymmetry

In invalidate_account_usage_status_cache (api/providers.py:1537), the status cache is cleared for the provider across all homes (key[0] == normalized), but _close_account_usage_probe_workers_async(provider_id=...) now flushes workers only for the active home. Your Risks section already calls this out for multi-profile installs; worth a one-line code comment at the flush site so the next reader knows the all-homes cache clear vs active-home worker flush asymmetry is intentional, not an oversight.

The cleanup refactor (per-worker stale collection + remaining_workers rebuild) looks correct — busy workers are skipped via acquire(blocking=False) and survive into remaining_workers, and identity comparison on w == sw is fine since _AccountUsageProbeWorker has no __eq__.

Comment thread api/providers.py
Comment thread api/providers.py Outdated
@nesquena-hermes

Copy link
Copy Markdown
Collaborator

Followed up on the three commits pushed after my last review (90eab281, f4f2e0ce, a611440c). Read the new _get_account_usage_probe_worker / _agent_fetch_account_usage_for_home / _cleanup_account_usage_probe_workers in api/providers.py at HEAD and the adapted tests/test_provider_quota_status.py.

The TOCTOU handoff is fixed — thank you

90eab281 does exactly what was needed. The selector now keeps the lock and the caller releases it:

def _get_account_usage_probe_worker(home: Path) -> "_AccountUsageProbeWorker | None":
    ...
    for w in workers:
        if w._lock.acquire(blocking=False):
            return w          # caller releases
    return None

and _agent_fetch_account_usage_for_home (api/providers.py:1573-1585) uses worker._fetch_locked(...) inside try/finally: worker._lock.release(). Two concurrent same-home probes now get the two distinct workers instead of both racing onto workers[0] and bouncing the loser to a cold spawn. That closes the window.

But a611440c regressed CI — shard 1 is red on 3.11/3.12/3.13

CI shows test (3.11, 1), test (3.12, 1), test (3.13, 1) all FAILED while shards 0 and 2 pass on every version. Three versions, one shard, deterministic — that's a single assertion failing, not a flake, and it's test_account_usage_cleanup_removes_null_proc_worker (greptile called this exact risk in its initial summary).

The last commit flipped the staleness predicate (api/providers.py:1459):

is_dead = proc is not None and proc.poll() is not None

That's correct for replenished workers (proc=None, never started → keep). But the test closes every worker and expects cleanup to evict the home:

for worker in workers:
    worker.close()                 # sets self._proc = None
providers._cleanup_account_usage_probe_workers()
assert str(tmp_path) not in providers._account_usage_worker_pool   # fails now

After close() (api/providers.py:1235-1238) _proc is None, so is_dead is False, and last_used was just touched by the _fetch_locked call, so cutoff - last_used >= idle_seconds is also False. Neither stale condition fires → the home key stays → assertion fails.

The root conflict: proc is None is now overloaded. It means both "freshly replenished, keep" and "explicitly closed, evict", and the new predicate can't tell them apart.

Suggested fix — make "closed" explicit instead of inferring it from proc=None

Add a _closed flag that close() sets and _ensure_process() clears, then treat it as dead:

def close(self) -> None:
    with self._lock:
        proc = self._proc
        self._proc = None
        self._closed = True        # explicit
    self._close_process(proc)
# in _cleanup_account_usage_probe_workers (~line 1459)
is_dead = getattr(worker, "_closed", False) or (proc is not None and proc.poll() is not None)

_ensure_process already nulls/relaunches _proc; just set self._closed = False there when a new Popen succeeds so a relaunched worker isn't mistaken for closed. A replenished worker that was never started has _closed = False and _proc = None, so it correctly survives; a close()d worker is correctly evicted, and the existing test passes unchanged with assert ... >= 1.

Once shard 1 is green this looks ready. The cleanup/replenish rebuild (remaining_workers + top-up to N at :1463-1471) and the held-worker handoff are both sound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants