perf(providers): per-home probe-worker pool + scoped credential invalidation (#3787) by rodboev · Pull Request #3940 · nesquena/hermes-webui

rodboev · 2026-06-10T16:30:37Z

Thinking Path

Reviewed _AccountUsageProbeWorker.fetch() at �pi/providers.py:1259: non-blocking _lock.acquire(blocking=False) immediately falls back to a cold subprocess spawn when the single worker is busy, producing O(N) cold spawns under N concurrent probes for the same home.
Reviewed _close_account_usage_probe_workers_async() at line 1468 and invalidate_account_usage_status_cache() at line 1506: credential update for any provider flushes every home's worker pool unconditionally.
Fix: replace the single-worker-per-home slot with a small pool (N=2), add a short bounded wait before falling back to cold spawn, and narrow the async flush to only the active home when a specific provider's credential changes.

What Changed

�pi/providers.py: changed _account_usage_worker_pool from dict[str, Worker] to dict[str, list[Worker]] with _ACCOUNT_USAGE_WORKERS_PER_HOME = 2; _get_account_usage_probe_worker() tries all pool slots non-blocking then waits briefly on slot 0 before returning None; _close_account_usage_probe_workers_async() accepts an optional provider_id and scopes flush to the affected home when set; invalidate_account_usage_status_cache() passes the provider id through

Why It Matters

Same-home contention under rapid Settings-panel polls no longer degrades to repeated cold subprocess spawns, reducing tail latency for quota probes; credential rotation for one provider stops evicting warm workers for unrelated providers.

Verification

pytest tests/test_issue3787_probe_pool.py -v --timeout=60
pytest tests/ -v --timeout=60

Risks / Follow-ups

Pool size 2 is conservative; a follow-up can tune _ACCOUNT_USAGE_WORKERS_PER_HOME or make it configurable via env var if profiling shows more contention
Scoped flush currently targets only the active home; multi-profile installs that share a single process but switch HERMES_HOME at runtime could benefit from a per-home credential-change signal in a follow-up

Model Used

Claude Opus 4.6 via Claude Code CLI

…idation (nesquena#3787)

greptile-apps · 2026-06-10T16:36:37Z

Greptile Summary

This PR replaces the single-worker-per-home slot with a two-slot pool (_ACCOUNT_USAGE_WORKERS_PER_HOME = 2) and fixes a TOCTOU race by returning the worker with its lock already held, requiring the caller to release it via try/finally. Scoped credential invalidation now flushes only the active home's workers when a specific provider ID is given, rather than evicting all workers unconditionally.

The stale-detection condition was changed from proc is None or proc.poll() is not None to proc is not None and proc.poll() is not None to prevent replenished (never-started) workers from being immediately evicted. This is correct for the replenishment case, but it breaks test_account_usage_cleanup_removes_null_proc_worker: after worker.close() sets _proc = None, cleanup no longer recognises those workers as stale, and the assertion assert str(tmp_path) not in providers._account_usage_worker_pool at line 1509 will fail.
The worker list scan in _get_account_usage_probe_worker runs outside the pool lock; this is safe but worth documenting, as a concurrent _close_account_usage_probe_workers can silently push the caller to the cold-spawn fallback.

Confidence Score: 4/5

Safe to merge once the broken test in test_provider_quota_status.py is reconciled with the changed stale-detection logic.

The stale-detection condition was intentionally narrowed so freshly-created replenished workers are not immediately evicted. That fix is correct, but test_account_usage_cleanup_removes_null_proc_worker calls worker.close() and then expects the pool key to be absent after cleanup — an expectation that no longer holds. The assertion at line 1509 will fail in CI.

tests/test_provider_quota_status.py — the null-proc cleanup test needs its assertion and setup updated to account for the changed stale-detection logic.

Important Files Changed

Filename	Overview
api/providers.py	Pool expanded to list[Worker], lock-held handoff implemented, scoped flush added; stale-detection logic for null-proc workers is inconsistent with an existing test assertion.
tests/test_issue3787_probe_pool.py	New test suite covering pool sizing, non-blocking acquire, saturation fallback, scoped invalidation, concurrent no-double-claim, cleanup replenishment.
tests/test_provider_quota_status.py	Adapted to list-of-workers pool shape; `test_account_usage_cleanup_removes_null_proc_worker` retains an assertion that will fail under the new stale-detection logic.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant G as _get_account_usage_probe_worker
    participant P as worker_pool dict
    participant W0 as Worker[0]
    participant W1 as Worker[1]
    participant F as cold fallback

    C->>G: call(home)
    G->>P: acquire pool_lock, get workers list
    alt pool missing
        G->>P: create 2 new Workers, store list
    end
    G->>P: release pool_lock
    G->>W0: "_lock.acquire(blocking=False)"
    alt W0 free
        W0-->>G: acquired
        G-->>C: return W0 (lock held)
        C->>W0: _fetch_locked()
        W0-->>C: snapshot
        C->>W0: _lock.release()
    else W0 busy
        G->>W1: "_lock.acquire(blocking=False)"
        alt W1 free
            W1-->>G: acquired
            G-->>C: return W1 (lock held)
            C->>W1: _fetch_locked()
            W1-->>C: snapshot
            C->>W1: _lock.release()
        else pool saturated
            G-->>C: return None
            C->>F: _fetch_account_usage_once_for_home()
            F-->>C: snapshot (cold subprocess)
        end
    end

_{Reviews (5): Last reviewed commit: "fix(providers): treat proc=None as idle,..." | Re-trigger Greptile}

nesquena-hermes · 2026-06-10T20:51:50Z

Read the full api/providers.py diff against origin/master, the _AccountUsageProbeWorker class (api/providers.py:1225-1318), and the new tests/test_issue3787_probe_pool.py. The pool/cleanup refactor and the scoped-flush plumbing are well-structured, but there's a check-then-act race in the selector that I think undercuts the stated tail-latency goal.

The selector reserves nothing — it acquires, releases, then returns

_get_account_usage_probe_worker picks a worker that looks free, but releases the lock before returning it:

for w in workers:
    if w._lock.acquire(blocking=False):
        w._lock.release()
        return w
if workers[0]._lock.acquire(timeout=_ACCOUNT_USAGE_WORKER_WAIT_SECONDS):
    workers[0]._lock.release()
    return workers[0]
return None

The caller then re-acquires in fetch() (api/providers.py:1259-1264):

def fetch(self, provider, *, api_key=None):
    if not self._lock.acquire(blocking=False):
        return _fetch_account_usage_once_for_home(provider, self.home, api_key=api_key)

Between the release() in the selector and the acquire(blocking=False) in fetch(), nothing holds the worker. Two probes for the same home racing through _get_account_usage_probe_worker can both observe workers[0] free, both release, and both return workers[0]; the second one's fetch() then finds the lock taken and falls back to _fetch_account_usage_once_for_home — a cold subprocess spawn. That's exactly the O(N) cold-spawn path #3787 set out to remove. The N=2 pool reduces the probability but doesn't close the window, because selection and use aren't atomic.

Suggested shape: hand back a held worker

Have the selector acquire and keep the lock, returning the locked worker (or None), and have the caller release it. Roughly:

for w in workers:
    if w._lock.acquire(blocking=False):
        return w                      # caller releases
if workers[0]._lock.acquire(timeout=_ACCOUNT_USAGE_WORKER_WAIT_SECONDS):
    return workers[0]
return None

and in _agent_fetch_account_usage_for_home, call a lock-free worker._fetch_locked(...) inside a try/finally: worker._lock.release(). Since _lock is an RLock (api/providers.py:1229), _fetch_locked's own self.last_used/_ensure_process calls remain reentrant-safe. That guarantees the two concurrent probes actually get the two distinct workers, which is the whole point.

Test gap

test_nonblocking_acquire_returns_idle_worker holds workers[0]._lock from a background thread for the entire assertion, so it validates the selector in isolation but never exercises the selector → fetch() round trip where the race lives. A test that spawns 2+ real probes concurrently (mocking the subprocess) and asserts zero cold-spawn fallbacks would actually pin the contention behavior. As written, the suite passes even with the TOCTOU window open.

Minor: scoped-flush vs cache asymmetry

In invalidate_account_usage_status_cache (api/providers.py:1537), the status cache is cleared for the provider across all homes (key[0] == normalized), but _close_account_usage_probe_workers_async(provider_id=...) now flushes workers only for the active home. Your Risks section already calls this out for multi-profile installs; worth a one-line code comment at the flush site so the next reader knows the all-homes cache clear vs active-home worker flush asymmetry is intentional, not an oversight.

The cleanup refactor (per-worker stale collection + remaining_workers rebuild) looks correct — busy workers are skipped via acquire(blocking=False) and survive into remaining_workers, and identity comparison on w == sw is fine since _AccountUsageProbeWorker has no __eq__.

…rkers

nesquena-hermes · 2026-06-11T01:08:31Z

Followed up on the three commits pushed after my last review (90eab281, f4f2e0ce, a611440c). Read the new _get_account_usage_probe_worker / _agent_fetch_account_usage_for_home / _cleanup_account_usage_probe_workers in api/providers.py at HEAD and the adapted tests/test_provider_quota_status.py.

The TOCTOU handoff is fixed — thank you

90eab281 does exactly what was needed. The selector now keeps the lock and the caller releases it:

def _get_account_usage_probe_worker(home: Path) -> "_AccountUsageProbeWorker | None":
    ...
    for w in workers:
        if w._lock.acquire(blocking=False):
            return w          # caller releases
    return None

and _agent_fetch_account_usage_for_home (api/providers.py:1573-1585) uses worker._fetch_locked(...) inside try/finally: worker._lock.release(). Two concurrent same-home probes now get the two distinct workers instead of both racing onto workers[0] and bouncing the loser to a cold spawn. That closes the window.

But `a611440c` regressed CI — shard 1 is red on 3.11/3.12/3.13

CI shows test (3.11, 1), test (3.12, 1), test (3.13, 1) all FAILED while shards 0 and 2 pass on every version. Three versions, one shard, deterministic — that's a single assertion failing, not a flake, and it's test_account_usage_cleanup_removes_null_proc_worker (greptile called this exact risk in its initial summary).

The last commit flipped the staleness predicate (api/providers.py:1459):

is_dead = proc is not None and proc.poll() is not None

That's correct for replenished workers (proc=None, never started → keep). But the test closes every worker and expects cleanup to evict the home:

for worker in workers:
    worker.close()                 # sets self._proc = None
providers._cleanup_account_usage_probe_workers()
assert str(tmp_path) not in providers._account_usage_worker_pool   # fails now

After close() (api/providers.py:1235-1238) _proc is None, so is_dead is False, and last_used was just touched by the _fetch_locked call, so cutoff - last_used >= idle_seconds is also False. Neither stale condition fires → the home key stays → assertion fails.

The root conflict: proc is None is now overloaded. It means both "freshly replenished, keep" and "explicitly closed, evict", and the new predicate can't tell them apart.

Suggested fix — make "closed" explicit instead of inferring it from proc=None

Add a _closed flag that close() sets and _ensure_process() clears, then treat it as dead:

def close(self) -> None:
    with self._lock:
        proc = self._proc
        self._proc = None
        self._closed = True        # explicit
    self._close_process(proc)

# in _cleanup_account_usage_probe_workers (~line 1459)
is_dead = getattr(worker, "_closed", False) or (proc is not None and proc.poll() is not None)

_ensure_process already nulls/relaunches _proc; just set self._closed = False there when a new Popen succeeds so a relaunched worker isn't mistaken for closed. A replenished worker that was never started has _closed = False and _proc = None, so it correctly survives; a close()d worker is correctly evicted, and the existing test passes unchanged with assert ... >= 1.

Once shard 1 is green this looks ready. The cleanup/replenish rebuild (remaining_workers + top-up to N at :1463-1471) and the held-worker handoff are both sound.

perf(providers): per-home probe-worker pool + scoped credential inval…

a1546a6

…idation (nesquena#3787)

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread api/providers.py

Comment thread api/providers.py Outdated

Comment thread tests/test_issue3787_probe_pool.py Outdated

fix(tests): adapt existing probe pool tests to N=2 worker list structure

4688d0d

fix(providers): hold probe-worker lock across selector handoff

90eab28

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread api/providers.py

fix(providers): replenish pool after partial cleanup removes stale wo…

f4f2e0c

…rkers

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread api/providers.py Outdated

fix(providers): treat proc=None as idle, not dead, in cleanup

a611440

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(providers): per-home probe-worker pool + scoped credential invalidation (#3787)#3940

perf(providers): per-home probe-worker pool + scoped credential invalidation (#3787)#3940
rodboev wants to merge 5 commits into
nesquena:masterfrom
rodboev:pr/probe-worker-pool-contention

rodboev commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nesquena-hermes commented Jun 10, 2026

Uh oh!

Uh oh!

Uh oh!

nesquena-hermes commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rodboev commented Jun 10, 2026

Thinking Path

What Changed

Why It Matters

Verification

Risks / Follow-ups

Model Used

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nesquena-hermes commented Jun 10, 2026

The selector reserves nothing — it acquires, releases, then returns

Suggested shape: hand back a held worker

Test gap

Minor: scoped-flush vs cache asymmetry

Uh oh!

Uh oh!

Uh oh!

nesquena-hermes commented Jun 11, 2026

The TOCTOU handoff is fixed — thank you

But a611440c regressed CI — shard 1 is red on 3.11/3.12/3.13

Suggested fix — make "closed" explicit instead of inferring it from proc=None

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

But `a611440c` regressed CI — shard 1 is red on 3.11/3.12/3.13