Skip to content

fix: sync OAuth tokens between credential pool and credentials file#4765

Closed
netdust wants to merge 2 commits intoNousResearch:mainfrom
netdust:fix/oauth-credential-pool-sync
Closed

fix: sync OAuth tokens between credential pool and credentials file#4765
netdust wants to merge 2 commits intoNousResearch:mainfrom
netdust:fix/oauth-credential-pool-sync

Conversation

@netdust
Copy link
Copy Markdown
Contributor

@netdust netdust commented Apr 3, 2026

Summary

  • Fix OAuth refresh token desync between the credential pool (~/.hermes/auth.json, ~/.hermes/profiles/*/auth.json) and Claude Code's credential file (~/.claude/.credentials.json)
  • OAuth refresh tokens are single-use — when one consumer refreshes, all others' refresh tokens become invalid, causing a 24-hour exhaustion lockout
  • Pool now syncs from the credentials file before applying exhaustion cooldowns, and writes back after successful refreshes

Problem

When Hermes runs with provider: anthropic using Claude Code OAuth credentials, multiple consumers share the same OAuth session:

  • The credential pool in each Hermes profile
  • The Claude Code CLI
  • The resolve_anthropic_token() fallback path

When any one of these refreshes the token, it invalidates the refresh token for all others. The cascade:

  1. Token expires overnight
  2. Pool tries to refresh → gets 400 (refresh token already consumed)
  3. Pool marks credential "exhausted" with 24-hour cooldown
  4. All heartbeats skip the credential for 24 hours
  5. Fallback to resolve_anthropic_token() works only while access token is valid
  6. Once access token expires → complete auth failure, requires manual claude /login

Fix

Three changes to credential_pool.py:

  1. _sync_anthropic_entry_from_credentials_file() — new method that checks if ~/.claude/.credentials.json has a different (newer) refresh token and syncs it into the pool entry, clearing exhaustion status

  2. _refresh_entry() — two additions:

    • On successful refresh of a claude_code entry, write the new tokens back to ~/.claude/.credentials.json
    • On refresh failure, check if the credentials file has a newer token and retry once before marking exhausted
  3. _available_entries() — before applying the 24-hour exhaustion cooldown on claude_code entries, sync from the credentials file first — so a manual re-login or external refresh immediately unblocks agents

Test plan

  • Verify token resolution still works normally (no regression)
  • Simulate stale refresh token in pool, valid token in credentials file → pool syncs and recovers
  • Simulate successful pool refresh → credentials file gets updated
  • Simulate exhausted pool entry + valid credentials file → entry unblocked without waiting 24h
  • Multi-profile scenario: one profile refreshes, others pick up the new token

🤖 Generated with Claude Code

OAuth refresh tokens are single-use. When multiple consumers share the
same Anthropic OAuth session (credential pool entries, Claude Code CLI,
multiple Hermes profiles), whichever refreshes first invalidates the
refresh token for all others. This causes a cascade:

1. Pool entry tries to refresh with a consumed refresh token → 400
2. Pool marks the credential as "exhausted" with a 24-hour cooldown
3. All subsequent heartbeats skip the credential entirely
4. The fallback to resolve_anthropic_token() only works while the
   access token in ~/.claude/.credentials.json hasn't expired
5. Once it expires, nothing can auto-recover without manual re-login

Fix:
- Add _sync_anthropic_entry_from_credentials_file() to detect when
  ~/.claude/.credentials.json has a newer refresh token and sync it
  into the pool entry, clearing exhaustion status
- After a successful pool refresh, write the new tokens back to
  ~/.claude/.credentials.json so other consumers stay in sync
- On refresh failure, check if the credentials file has a different
  (newer) refresh token and retry once before marking exhausted
- In _available_entries(), sync exhausted claude_code entries from
  the credentials file before applying the 24-hour cooldown, so a
  manual re-login or external refresh immediately unblocks agents

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@britrik
Copy link
Copy Markdown

britrik commented Apr 3, 2026

Code Review: OAuth Token Synchronization

Thanks for tackling this tricky OAuth token sync issue! The problem of single-use refresh tokens causing 24-hour lockouts is real and important to fix.

What's Done Well

  • The three-part solution (sync on availability check, sync on refresh failure, write-back on success) covers the main scenarios
  • Using debug-level logging for failures avoids flooding logs while still allowing troubleshooting
  • The retry logic with fallback to credentials file is a good resilience pattern

Suggestions & Questions

**1. Missing function **
The diff calls in two places but this function doesn't appear to exist in . Did you mean to use the existing instead?

2. Runtime imports vs module-level
The code uses runtime imports () inside methods. While this avoids potential circular imports, consider whether a module-level import with proper error handling would be cleaner. Runtime imports can mask missing dependencies until runtime.

3. Potential for unnecessary syncs
In , the check only verifies . If the credentials file has a different refresh token but the pool entry is still valid (not exhausted, access token not expired), this could cause unnecessary writes. Consider also checking if the entry is actually expired or needs refresh before syncing.

4. Race condition consideration
There's a potential race where two processes (e.g., Hermes + Claude Code CLI) both try to refresh simultaneously. One will succeed, the other gets a 400, then syncs from the file. This is handled, but the timing could be tight. Document this edge case in the PR description or code comments.

5. Test coverage
The test plan mentions manual testing scenarios. Consider adding unit tests for the sync methods to prevent regressions, especially the token comparison logic and the retry-on-failure path.

6. Security note
The diff uses for token masking in logs - ensure this is consistent across all logging statements for these code paths.

Minor Nit

  • could be renamed to something shorter like given it's a private method in a file with long names already.

Overall this looks like a solid fix for a real problem. The main concern is the missing function - please confirm that's intentional or needs to be added.

@britrik
Copy link
Copy Markdown

britrik commented Apr 3, 2026

Code Review: OAuth Token Synchronization

Thanks for tackling this tricky OAuth token sync issue! The problem of single-use refresh tokens causing 24-hour lockouts is real and important to fix.

What's Done Well

  • The three-part solution (sync on availability check, sync on refresh failure, write-back on success) covers the main scenarios
  • Using debug-level logging for failures avoids flooding logs while still allowing troubleshooting
  • The retry logic with fallback to credentials file is a good resilience pattern

Suggestions and Questions

**1. Missing function **
The diff calls in two places but this function does not appear to exist in . Did you mean to use the existing instead?

2. Runtime imports vs module-level
The code uses runtime imports inside methods (e.g., ). While this avoids potential circular imports, consider whether a module-level import with proper error handling would be cleaner. Runtime imports can mask missing dependencies until runtime.

3. Potential for unnecessary syncs
In , the check only verifies . If the credentials file has a different refresh token but the pool entry is still valid, this could cause unnecessary writes. Consider also checking if the entry is actually expired or needs refresh before syncing.

4. Race condition consideration
There is a potential race where two processes (e.g., Hermes + Claude Code CLI) both try to refresh simultaneously. This is handled, but timing could be tight. Consider documenting this edge case in code comments.

5. Test coverage
The test plan mentions manual testing scenarios. Consider adding unit tests for the sync methods to prevent regressions.

6. Security note
The diff uses masked tokens in logs () - ensure this is consistent across all logging statements for these code paths.

Overall this looks like a solid fix for a real problem. The main concern is the missing function - please confirm that is intentional or needs to be added.

@netdust
Copy link
Copy Markdown
Contributor Author

netdust commented Apr 3, 2026

Thanks for the thorough review! Addressing each point:

1. _write_claude_code_credentials — This function exists in agent/anthropic_adapter.py (line 398). The import is a runtime import (from agent.anthropic_adapter import _write_claude_code_credentials) which follows the same pattern already used in _refresh_entry at line 314 for refresh_anthropic_oauth_pure. Happy to add a code pointer comment if that helps clarity.

2. Runtime imports — Intentional, to avoid circular imports between credential_pool.py and anthropic_adapter.py. This is consistent with the existing codebase pattern (the current _refresh_entry already does the same thing). Moving to module-level would introduce a circular dependency.

3. Unnecessary syncs — The sync in _available_entries only triggers when entry.last_status == STATUS_EXHAUSTED (the condition on line 434), so it won't fire on healthy entries. The cost of one file read vs. a 24-hour lockout is a worthwhile tradeoff. That said, I can add a comment making this guard explicit if preferred.

4. Race condition — Good call. The current fix handles this gracefully: if two processes refresh simultaneously, one succeeds and writes the new tokens, the other fails and picks up the new tokens from the credentials file on its next attempt. I'll add a code comment documenting this.

5. Tests — Agreed, unit tests would be valuable. Happy to add them in a follow-up or in this PR if you'd prefer.

6. Security — Confirmed: we only log entry.id (a short hash like 34ee18), never tokens. The existing logger.debug statements in the codebase follow the same pattern.

Let me know if you'd like me to add the race condition comment and/or unit tests to this PR.

@britrik
Copy link
Copy Markdown

britrik commented Apr 3, 2026

Code Review: PR #4765

Summary

The PR implements OAuth token synchronization between the credential pool and Claude Code's credentials file (~/.claude/.credentials.json). This is a well-identified fix for a real problem: OAuth refresh tokens are single-use, so when one consumer refreshes, others' tokens become invalid.

Strengths

  1. Comprehensive approach: Three key changes cover the full sync lifecycle:

      • detects external token changes
      • writes back after successful refresh + retry on failure
      • syncs before applying exhaustion cooldown
  2. Good defensive coding: Uses try/except blocks with debug logging (lines 330-334, 377-381, 450-455), preventing failures from blocking token resolution.

  3. Smart retry logic: The refresh failure handler (lines 422-455) checks if credentials file has a newer token before marking exhausted - this directly addresses the 24-hour lockout problem.

Potential Issues

  1. Incomplete error handling in retry path (lines 450-455): The retry success path calls but silently swallows exceptions with bare . Consider at least debug logging here.

  2. Race condition potential: Between checking credentials file and using token, another process could refresh. The retry logic helps but doesn't fully eliminate this.

  3. Missing validation: The sync only checks but doesn't verify the access token is also different - what if only access token changed?

  4. No locking mechanism: Multiple pool instances could simultaneously attempt to refresh/write to the credentials file.

Security Considerations

  • Refresh tokens are handled in memory and written to a file - ensure file permissions are correct
  • The function preserves other fields in the file (good)
  • No sensitive data logged except at debug level (appropriate)

Suggestions

  1. Add logging when silently swallowing exceptions in retry path
  2. Consider checking access token expiry in addition to refresh token comparison
  3. The test plan looks solid - recommend adding a unit test for

Verdict

Approve - This is a well-thought-out fix that addresses the core problem. The code quality is good with proper error handling. Minor improvements suggested above but not blockers.

@netdust
Copy link
Copy Markdown
Contributor Author

netdust commented Apr 3, 2026

Thanks for the approval and the detailed review!

Addressing the suggestions:

1. Bare except: pass in retry path (lines 450-455) — Fair point. The _write_claude_code_credentials call there is a best-effort write-back after a successful retry refresh. The token is already resolved and returned — the write-back just keeps the file in sync for other consumers. A failure here doesn't affect the current process. That said, adding a debug log is easy and helps troubleshooting — happy to add it.

2. Race condition — Acknowledged. Full elimination would require file locking or an atomic swap, which is more complexity than warranted here. The current approach is self-healing: if two processes race, the loser fails, syncs the winner's token from the file on next attempt, and recovers. Worst case is one extra failed refresh attempt, not a 24-hour lockout.

3. Only checking refresh token, not access token — By design. The refresh token is the single-use component — if it changed, someone else refreshed and we need their tokens. If only the access token changed (unlikely without a refresh), the refresh token would still be valid and our own refresh would succeed normally.

4. No locking — Same reasoning as #2. File locking across processes (Hermes profiles, Claude Code CLI) adds complexity and failure modes (stale locks). The retry-and-sync pattern handles contention gracefully without locks.

I'll push the debug log addition for #1. The rest I'd leave as-is given the tradeoffs — let me know if you feel differently.

Address review feedback: replace bare `except: pass` with a debug
log when the post-retry write-back to ~/.claude/.credentials.json
fails. The write-back is best-effort (token is already resolved),
but logging helps troubleshooting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented Apr 4, 2026

Merged via PR #4981. Both commits cherry-picked onto current main with your authorship preserved. Thanks @netdust!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants