Skip to content

Benchmarking: required scenarios end-to-end with safety guard and CI gating #784

@dahlia

Description

@dahlia

Note

Sub-issue of #744. Before reading further, read #744 in full, including all of its comments, where the benchmarking tool's design is worked out in detail. This issue is one slice of that design and assumes the decisions recorded in those comments.

This is step 3 of 5. It depends on #783 (and on #782). This is where the #744 acceptance criteria are first met end-to-end.

Scope

inbox and webfinger scenarios

  • inbox (completing it from Benchmarking: fedify bench engine, scenario format, and JSON schema hosting #783 if partial): takes a recipient (a handle like acct:alice@host or an actor URI), not a path, since Fedify has no default paths. The inbox URL is discovered the way a real peer does it: WebFinger gives the actor URI, and the actor document gives inbox and endpoints.sharedInbox. An inbox mode of shared (the realistic default), personal, or an explicit URL. An embedObject flag distinguishes the pure inbox path from inbox-plus-dereference. Discovery is one-time setup excluded from the timed window.
  • webfinger: handle resolution over configurable handle sets, the discovery primitive the other scenarios reuse.

Client-side safety guard

  • Target tiers from the resolved address: loopback (127.0.0.0/8, ::1, localhost) and private (RFC 1918, link-local, .local) versus public.
  • At startup the tool probes GET /.well-known/fedify/bench/stats to detect whether the target advertises benchmarkMode, which is the operator's “not production” assertion.
  • Two tiers: Safe (target is loopback/private, or advertises benchmarkMode) runs with no friction; Caution (a public target without benchmarkMode) is refused unless --allow-unsafe-target is given.
  • --allow-unsafe-target is honored only together with an explicit --target. In CI or any non-TTY context the tool never prompts; the flag is mandatory there. A TTY may offer an interactive confirmation instead.
  • Scenario effect classes (read/write/deliver/fault) drive the warning text. --dry-run resolves discovery and reports the planned load without sending. On a public target, rate and duration must be set explicitly (no aggressive defaults).

expect gating

  • A scenario's expect thresholds are evaluated and the process exits non-zero on failure. Each entry carries a severity (warn or fail, default fail). The metric vocabulary and the per-type definition of success (for example which status codes count as success for inbox) are pinned alongside the schema.

Fixture app

  • An app under test/bench/ (in-memory KV, in-process queue, benchmarkMode, with the recipients the inbox scenario targets) that doubles as the local test server for the scenario tests.

Dependencies

Depends on #783 (the engine and scenario format) and #782 (the benchmarkMode target and the stats probe).

Acceptance criteria

  • A documented command runs against a local Fedify app and yields latency, throughput, success rate, and error summaries.
  • inbox (shared, signed) and webfinger work end-to-end, with inbox discovery via WebFinger.
  • JSON output suitable for CI comparison; expect exits non-zero on a fail-severity violation.
  • The guard refuses a public non-benchmarkMode target without --allow-unsafe-target, which is mandatory (not a prompt) in CI.
  • --dry-run resolves discovery and reports planned load without sending.
  • The test/bench/ fixture is used by the scenario tests.

Documentation

Add usage, safety guidance, and CI examples to docs/manual/benchmarking.md, and link it from docs/manual/deploy.md.

Metadata

Metadata

Assignees

Labels

Priority

High

Effort

Medium

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions