Skip to content

approver: fix self-heal cached-token + crypto-wipe bugs surfaced by PR #37 deploy#38

Merged
amiller merged 1 commit into
mainfrom
fix/self-heal-cached-token-token-update
May 9, 2026
Merged

approver: fix self-heal cached-token + crypto-wipe bugs surfaced by PR #37 deploy#38
amiller merged 1 commit into
mainfrom
fix/self-heal-cached-token-token-update

Conversation

@amiller
Copy link
Copy Markdown
Collaborator

@amiller amiller commented May 9, 2026

Summary

Two follow-on bugs from PR #37 surfaced by the first deploy on prod tonight. Both fired in the transition case (existing bot_crypto.db + no cached token files yet).

Bug 1: cached-token path doesn't update TOKEN

`_resolve_credentials` returns a cached token with `reminted=False`. main() previously only assigned `TOKEN = sr2_token` when `reminted=True`. So the cached path returned a working token but main left the global `TOKEN` pointing at the (stale) `os.environ["MATRIX_TOKEN"]`. Mautrix client constructed with the stale value → /sync failed with M_UNKNOWN_TOKEN.

Fix: assign `TOKEN = sr2_token` and `AUTH` unconditionally after self-heal returns. Same for `LOBBY_TOKEN` / `LOBBY_AUTH`.

Bug 2: crypto wipe missed on first PR-#37 boot

The first PR-#37 boot finds `/data/bot_crypto.db` present but no `/data/_device_id` yet (PR #37's persistence hasn't run). `_login_with_password` runs without device_id, gets a fresh device. `_device_changed(None, NEW) = False` → no wipe. Mautrix then tries to load crypto.db with pickle_key for the new device, which doesn't match the prior device's pickle → `OlmAccountError: BAD_ACCOUNT_KEY` → bot crashes at startup.

Fix: in main(), capture `cached_device_before` and `crypto_existed_before` BEFORE `_resolve_credentials` writes to disk. Wipe crypto when:

  • `device_id` rotated (existing `_device_changed` logic), OR
  • `reminted=True` AND no prior cached device AND crypto.db exists (the transition case).

Test plan

  • `python3 tests/self_heal_unit.py` — all 7 existing tests still pass.
  • `python3 tests/announce_unit.py` — still passes.
  • e2e CI on this branch.
  • Verify via prod deploy after merge — bot should restart cleanly without manual `rm /data/bot_crypto.db*` intervention.

Manual recovery executed before this fix

Tonight on prod (2026-05-09 ~19:50 UTC) the bot crash-looped on BAD_ACCOUNT_KEY after the first PR-#37 deploy. Workaround: `rm /data/bot_crypto.db* /data/_token /data/_device_id` then `docker restart` — bot did the full self-heal cycle from scratch and recovered. This PR removes that manual step.

🤖 Generated with Claude Code

Both surfaced today (2026-05-09) when the first deploy of PR #37 ran on a CVM with an existing bot_crypto.db but no /data/<bot>_token / _device_id files yet (the transition case).

Bug 1: TOKEN/AUTH only updated on reminted=True path
  When _resolve_credentials returns a cached token (env stale → cached works → reminted=False), main() skipped TOKEN = sr2_token. The mautrix client then constructed with the still-stale os.environ["MATRIX_TOKEN"] global, so /sync failed with M_UNKNOWN_TOKEN even though the cached token was valid. Fix: always wire TOKEN/AUTH from _resolve_credentials' return regardless of reminted flag.

Bug 2: crypto wipe missed when fresh /login + existing crypto.db + no cached device_id
  On the FIRST PR-#37 boot, /data has no <bot>_device_id (PR #37's persistence hasn't run yet), so cached_device_id=None. _login_with_password runs without device_id → fresh device. _device_changed(None, NEW)=False → no wipe. But bot_crypto.db is still on disk pickled for the *previous* device → mautrix raises BAD_ACCOUNT_KEY. Fix: also wipe when reminted=True AND no prior cached device AND crypto.db exists (the transition case). Capture both signals in main() before _resolve_credentials runs.

After this PR, the resolution sequence is fully self-healing across both cold-start and transition-from-pre-PR-#37 scenarios. Manual recovery procedure (rm /data/bot_crypto.db* + restart) is no longer required.
@amiller amiller merged commit 47df6c5 into main May 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant