Skip to content

fix(credentials): retry transient Windows FS errors when replacing auth-profiles.json (TAURI-RUST-92J) #3355

Description

@oxoxDev

Source

Sentry: https://sentry.tinyhumans.ai/organizations/tinyhumans/issues/10246/
Short ID: TAURI-RUST-92J (project tauri-rust)
Events: 10,158 · Users affected: 1 · First seen: 2026-06-03 04:50 UTC · Last seen: 2026-06-04 10:31 UTC
Reproducing release: openhuman@0.56.0+e8968077aeb5
Platform: Windows 10.0.26200 (Windows 11 24H2) · x86_64

Symptom

Failed to replace auth profile store at C:\Users\<user>\.openhuman\users\<uid>\auth-profiles.json

Captured by report_error_or_expected from the JSON-RPC error path during openhuman.app_state_snapshot (domain rpc, operation invoke_method, elapsed_ms ≈ 3). Message-only Sentry event (no stack).

Where it fails

src/openhuman/credentials/profiles.rs:931-943 (function write_persisted_locked):

fs::write(&tmp_path, &json).with_context(|| {
    format!("Failed to write temporary auth profile file at {}", tmp_path.display())
})?;

fs::rename(&tmp_path, &self.path).with_context(|| {
    format!("Failed to replace auth profile store at {}", self.path.display())
})?;

Neither fs::write nor fs::rename is wrapped in crate::openhuman::util::retry_with_backoff, which is the helper that handles the exact Windows transient FS-error family (is_transient_fs_error already recognises ERROR_ACCESS_DENIED (5), ERROR_SHARING_VIOLATION (32), ERROR_LOCK_VIOLATION (33), ERROR_DELETE_PENDING (303), ERROR_USER_MAPPED_FILE (1224) — see src/openhuman/util.rs:615).

The same helper IS used for the sibling .lock create at profiles.rs:987 (Sentry OPENHUMAN-TAURI-H1 / H8 fix, PRs #2085 / #1641). The .json rename path was left out — partial fix.

Why the event count is 10k+ in 24h

load_locked runs on every app_state_snapshot poll. When a profile is dropped (decrypt failure, unrecognized kind, or — pre-#3125 — OAuth missing access_token), load_locked calls write_persisted_locked at profiles.rs:744 to persist the purge. If the rename fails, the on-disk state is unchanged, so the next app_state_snapshot poll re-drops the same profile, re-attempts the same write, and re-fails. Tight loop until the file handle is released — and on Windows, AV / Search-Indexer / Defender can hold a file handle for many seconds.

The frontend health-check polls app_state_snapshot rapidly, so a single sustained AV hold amplifies into thousands of Sentry events.

Reproduces on

Bug shape

Windows transient FS-race on fs::rename. Same family as the lock-create races already retried in PR #1641 / #2085 / #2180. Generic classifier (is_transient_fs_error) already in place; the call site here just isn't routed through it.

Fix scope

  1. Route fs::write(&tmp_path, &json) and fs::rename(&tmp_path, &self.path) through retry_with_backoff("...", 6, 100, …), matching the parameters used by the .lock create at profiles.rs:987.
  2. Persisted-write amplification guard: when write_persisted_locked exhausts retries during a load_locked purge, log + tag the error path so subsequent rapid app_state_snapshot polls don't replay the same write-and-fail loop until the AV handle is released. Either short-cache a "purge already attempted this session" flag, or surface the rename failure once and return the in-memory purged state without persisting. Either route defuses the 10k-event-per-day amplification.
  3. Add a Rust regression test using the __TEST_TRANSIENT__ sentinel is_transient_fs_error already understands (src/openhuman/util.rs:618) to verify the rename path retries.

Sentry-Issue: TAURI-RUST-92J

Metadata

Metadata

Assignees

Labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions