Skip to content

fix(inference): refund client escrow for VOTING inferences on timeout (quorum miss)#1275

Open
vitaly-andr wants to merge 1 commit into
gonka-ai:mainfrom
vitaly-andr:fix/expire-voting-inference-refund
Open

fix(inference): refund client escrow for VOTING inferences on timeout (quorum miss)#1275
vitaly-andr wants to merge 1 commit into
gonka-ai:mainfrom
vitaly-andr:fix/expire-voting-inference-refund

Conversation

@vitaly-andr

Copy link
Copy Markdown

Problem

expireInferences (x/inference/module/module.go) filtered timed-out inferences by Status == STARTED only. A failing MsgValidation moves an inference to VOTING; if its x/group proposals miss quorum within the voting window, the timeout cleanup silently skips it — the InferenceTimeout entry is removed, the inference stays VOTING forever, and the client's escrow is stranded in the inference module account. The quorum-miss trigger (network lag, validator restart, split vote) is a routine production condition, not an exotic edge case.

Fix

Add a VOTING branch: on timeout, mark the inference EXPIRED and refund the client via expireInferenceAndIssueRefund (default-to-refund on quorum miss — client made whole, executor unpaid, no slashing). The x/group proposals are left to be pruned by x/group EndBlock.

Test

TestExpireInferences_VotingInferenceRefundedOnTimeout drives the real expireInferences (via an export_test.go wrapper, since it is unexported) with one VOTING timeout — it fails on current main (no refund, inference stays VOTING) and passes with the fix.

Context

Discovered while working on #998 (maintenance windows); the reproducer was built with the inference-chain simulation/fuzz harness (#982), where a NoStuckVoting invariant flagged an inference still VOTING at epoch 10.

Closes #1265.

cc @patimen (#982 author)

expireInferences filtered timed-out inferences by Status == STARTED only.
A failing MsgValidation moves an inference to VOTING; if the resulting
x/group proposals miss quorum within the voting window, timeout cleanup
silently skipped it -- the InferenceTimeout entry was removed, the
inference stayed VOTING forever, and the client's escrow was stranded in
the inference module account.

Add a VOTING branch: on timeout, mark the inference EXPIRED and refund the
client via expireInferenceAndIssueRefund (default-to-refund on quorum
miss). The x/group proposals are pruned by x/group EndBlock.

Adds TestExpireInferences_VotingInferenceRefundedOnTimeout, which drives
the real expireInferences (via an export_test.go wrapper) and fails on
current main (no refund, inference stays VOTING), passing with the fix.

Surfaced by simulation/fuzz testing for inference-chain (gonka-ai#982).

Closes gonka-ai#1265.

Signed-off-by: Vitaly Andrianov <vitaly.andr@gmail.com>
vitaly-andr added a commit to vitaly-andr/gonka that referenced this pull request May 29, 2026
…zzing, secondary-op factories

Completes gonka-ai#982 Phase 3 (x/inference hardening) on top of the second-wave ops:

- Custom invariants via RegisterInvariants (bank-backs-positive-balance
  solvency, no-stuck-voting, effective-epoch-fresh,
  active-invalidations-ref-live) + unit tests.
- Store decoders for the simulation decode harness (Participant / Inference /
  KeySet / Uint64 / Int64 / legacy-params, hex fallback) + tests.
- Aggressive parameter-boundary fuzzing (MutateSimParams): economic params to
  Validate() boundaries, widened EpochParams, corner profiles. Slashing-
  activation levers held at defaults — active slashing drains the Tokens=1 sim
  substrate.
- Three real secondary-op factories (SubmitUnitOfComputePriceProposal,
  SubmitHardwareDiff, SubmitNewUnfundedParticipant) replacing NoOp stubs.
- Data-driven weight tuning; a model-weight transient-cache test helper.
- docs/simulation.md: Phase-3 status, fuzzing design, invariant list, sweep.

The full 500-block sim run is green only with the production fixes it
surfaced: gonka-ai#1275 (VOTING-timeout refund), gonka-ai#1276
(revalidation membership guard), and gonka-ai/cosmos-sdk#16
(markValidatorForDeletion). Verified combined: sim-full 500x200 seed=99 PASS,
app-hash f76df5cd….

The post-run bank-backs invariant holds in the canonical 500x200 run
(transient intra-epoch under-backing heals via debt-recovery by settlement);
the asymmetric refund-debit drift it surfaces under shorter/aggressive runs is
documented as a design question in gonka-ai#1273
(TestRefundInvalidation_LeavesAccountingDrift).

Signed-off-by: Vitaly Andrianov <vitaly.andr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stuck VOTING inferences orphan client escrow when x/group proposals miss quorum

1 participant