approver: fix self-heal cached-token + crypto-wipe bugs surfaced by PR #37 deploy#38
Merged
Merged
Conversation
Both surfaced today (2026-05-09) when the first deploy of PR #37 ran on a CVM with an existing bot_crypto.db but no /data/<bot>_token / _device_id files yet (the transition case). Bug 1: TOKEN/AUTH only updated on reminted=True path When _resolve_credentials returns a cached token (env stale → cached works → reminted=False), main() skipped TOKEN = sr2_token. The mautrix client then constructed with the still-stale os.environ["MATRIX_TOKEN"] global, so /sync failed with M_UNKNOWN_TOKEN even though the cached token was valid. Fix: always wire TOKEN/AUTH from _resolve_credentials' return regardless of reminted flag. Bug 2: crypto wipe missed when fresh /login + existing crypto.db + no cached device_id On the FIRST PR-#37 boot, /data has no <bot>_device_id (PR #37's persistence hasn't run yet), so cached_device_id=None. _login_with_password runs without device_id → fresh device. _device_changed(None, NEW)=False → no wipe. But bot_crypto.db is still on disk pickled for the *previous* device → mautrix raises BAD_ACCOUNT_KEY. Fix: also wipe when reminted=True AND no prior cached device AND crypto.db exists (the transition case). Capture both signals in main() before _resolve_credentials runs. After this PR, the resolution sequence is fully self-healing across both cold-start and transition-from-pre-PR-#37 scenarios. Manual recovery procedure (rm /data/bot_crypto.db* + restart) is no longer required.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two follow-on bugs from PR #37 surfaced by the first deploy on prod tonight. Both fired in the transition case (existing bot_crypto.db + no cached token files yet).
Bug 1: cached-token path doesn't update TOKEN
`_resolve_credentials` returns a cached token with `reminted=False`. main() previously only assigned `TOKEN = sr2_token` when `reminted=True`. So the cached path returned a working token but main left the global `TOKEN` pointing at the (stale) `os.environ["MATRIX_TOKEN"]`. Mautrix client constructed with the stale value → /sync failed with M_UNKNOWN_TOKEN.
Fix: assign `TOKEN = sr2_token` and `AUTH` unconditionally after self-heal returns. Same for `LOBBY_TOKEN` / `LOBBY_AUTH`.
Bug 2: crypto wipe missed on first PR-#37 boot
The first PR-#37 boot finds `/data/bot_crypto.db` present but no `/data/_device_id` yet (PR #37's persistence hasn't run). `_login_with_password` runs without device_id, gets a fresh device. `_device_changed(None, NEW) = False` → no wipe. Mautrix then tries to load crypto.db with pickle_key for the new device, which doesn't match the prior device's pickle → `OlmAccountError: BAD_ACCOUNT_KEY` → bot crashes at startup.
Fix: in main(), capture `cached_device_before` and `crypto_existed_before` BEFORE `_resolve_credentials` writes to disk. Wipe crypto when:
Test plan
Manual recovery executed before this fix
Tonight on prod (2026-05-09 ~19:50 UTC) the bot crash-looped on BAD_ACCOUNT_KEY after the first PR-#37 deploy. Workaround: `rm /data/bot_crypto.db* /data/_token /data/_device_id` then `docker restart` — bot did the full self-heal cycle from scratch and recovered. This PR removes that manual step.
🤖 Generated with Claude Code