From d3e574e83d173618221aeb7c1c6c6f2ff634989e Mon Sep 17 00:00:00 2001
From: Dejan K <sidejan@gmail.com>
Date: Thu, 16 Oct 2025 15:09:09 +0200
Subject: [PATCH 1/4] docs(plumber): add documentation and agent notes for
 plumber service components

---
 plumber/doc/block/AGENTS.md                   | 25 ++++++++
 plumber/doc/block/DOCUMENTATION.md            | 38 +++++++++++
 plumber/doc/definition_validator/AGENTS.md    | 15 +++++
 .../doc/definition_validator/DOCUMENTATION.md | 39 ++++++++++++
 plumber/doc/gofer_client/AGENTS.md            | 23 +++++++
 plumber/doc/gofer_client/DOCUMENTATION.md     | 36 +++++++++++
 plumber/doc/job_matrix/AGENTS.md              | 23 +++++++
 plumber/doc/job_matrix/DOCUMENTATION.md       | 38 +++++++++++
 plumber/doc/looper/AGENTS.md                  | 30 +++++++++
 plumber/doc/looper/DOCUMENTATION.md           | 39 ++++++++++++
 plumber/doc/ppl/AGENTS.md                     | 47 ++++++++++++++
 plumber/doc/ppl/DOCUMENTATION.md              | 63 +++++++++++++++++++
 plumber/doc/repo_proxy_ref/AGENTS.md          | 23 +++++++
 plumber/doc/repo_proxy_ref/DOCUMENTATION.md   | 33 ++++++++++
 .../doc/task_api_referent/DOCUMENTATION.md    | 33 ++++++++++
 15 files changed, 505 insertions(+)
 create mode 100644 plumber/doc/block/AGENTS.md
 create mode 100644 plumber/doc/block/DOCUMENTATION.md
 create mode 100644 plumber/doc/definition_validator/AGENTS.md
 create mode 100644 plumber/doc/definition_validator/DOCUMENTATION.md
 create mode 100644 plumber/doc/gofer_client/AGENTS.md
 create mode 100644 plumber/doc/gofer_client/DOCUMENTATION.md
 create mode 100644 plumber/doc/job_matrix/AGENTS.md
 create mode 100644 plumber/doc/job_matrix/DOCUMENTATION.md
 create mode 100644 plumber/doc/looper/AGENTS.md
 create mode 100644 plumber/doc/looper/DOCUMENTATION.md
 create mode 100644 plumber/doc/ppl/AGENTS.md
 create mode 100644 plumber/doc/ppl/DOCUMENTATION.md
 create mode 100644 plumber/doc/repo_proxy_ref/AGENTS.md
 create mode 100644 plumber/doc/repo_proxy_ref/DOCUMENTATION.md
 create mode 100644 plumber/doc/task_api_referent/DOCUMENTATION.md

diff --git a/plumber/doc/block/AGENTS.md b/plumber/doc/block/AGENTS.md
new file mode 100644
index 000000000..09c43c012
--- /dev/null
+++ b/plumber/doc/block/AGENTS.md
@@ -0,0 +1,25 @@
+# Block Agent Notes
+
+## Quick Map
+- Supervision root: `Block.Application` starts `Block.EctoRepo`, `Block.Sup.STM`, and `Block.Tasks.TaskEventsConsumer`.
+- State machines live in `block/lib/block/{blocks,tasks}/stm_handler/`; each module wraps a Looper worker that polls and advances rows.
+- Persistence: PostgreSQL schema under `block/priv/ecto_repo/migrations`; repo module is `Block.EctoRepo`.
+- RabbitMQ: consumer binds to `task_state_exchange` (routing key `finished`).
+
+## Daily Commands
+- Setup (deps + DB): `cd block && mix setup`.
+- Run migrations only: `cd block && mix ecto.migrate`.
+- Tests: `cd block && MIX_ENV=test mix test` (DB wiped automatically).
+- Console: `cd block && iex -S mix` (ensure `RABBITMQ_URL` and database env vars set).
+
+## Debug Pointers
+- Task stuck RUNNING? Inspect `Block.Tasks.STMHandler.RunningState` logic and confirm RabbitMQ event arrived. Use `rabbitmqadmin get queue=task_state_exchange.finished` for inspection.
+- Blocks not spawning tasks? Check the `Block.CodeRepo` command reader – invalid YAML results propagate from `definition_validator`.
+- STOPPING never finishes? Verify callbacks `:compile_task_done_notification_callback` / `:after_ppl_task_done_notification_callback` in config; missing modules will raise `apply/3` errors.
+- Database drift? Compare schema with latest migrations and rerun `mix ecto.migrate` (test env uses sandbox DB `block_test`).
+
+## Env Vars
+- `RABBITMQ_URL` – required for AMQP consumer/publishers.
+- `COMPILE_TASK_DONE_NOTIFICATION_CALLBACK`, `AFTER_PPL_TASK_DONE_NOTIFICATION_CALLBACK` – MFA tuples as `{Module, :function}`; defaults log warnings when unset.
+
+Keep this close when triaging block execution or termination flows.
diff --git a/plumber/doc/block/DOCUMENTATION.md b/plumber/doc/block/DOCUMENTATION.md
new file mode 100644
index 000000000..16309fc0f
--- /dev/null
+++ b/plumber/doc/block/DOCUMENTATION.md
@@ -0,0 +1,38 @@
+# Block Service
+
+## Overview
+Block manages the execution lifecycle of pipeline blocks and their Zebra tasks. It persists block state, reacts to events emitted by the build system, coordinates stop/cancel transitions and publishes follow-up events through AMQP. Although it can run standalone, it is usually started under the main `ppl` application.
+
+## Responsibilities
+- Accept block execution requests produced by `ppl` and materialise them as `block_requests` and `block_builds` rows.
+- Drive block state machines (`INITIALIZING`, `WAITING`, `RUNNING`, `STOPPING`, `DONE`) using Looper-based orchestrators under `Block.Sup.STM`.
+- Monitor Zebra task lifecycle via `Block.Tasks.TaskEventsConsumer` (RabbitMQ exchange `task_state_exchange`, routing key `finished`) and advance corresponding block/task records.
+- Coordinate compilation/after-pipeline callbacks through configurable hooks (`:compile_task_done_notification_callback` and `:after_ppl_task_done_notification_callback`).
+
+## Architecture
+- **Supervision tree**: `Block.Application` boots `Block.EctoRepo`, the STM supervisor (`Block.Sup.STM`) and the RabbitMQ consumer.
+- **State machines**: Implemented in `block/lib/block/blocks/stm_handler/*` and `block/lib/block/tasks/stm_handler/*`; each handler is a Looper worker that periodically picks pending records.
+- **Persistence**: `block/priv/ecto_repo/migrations` define tables for requests, builds, sub-pipelines, and task metadata. `Block.Repo` wraps PostgreSQL via `ecto_sql`.
+- **External dependencies**: communicates with Zebra/Gofer through task IDs, validates commands with `definition_validator`, and emits notifications via AMQP and Watchman metrics.
+
+## Data Flow Highlights
+1. `ppl` schedules a block → `block_requests` + `block_builds` rows are created.
+2. STM loopers transition blocks from `waiting` to `running`, provisioning tasks via Zebra.
+3. Zebra marks task finished → RabbitMQ message consumed → STM handlers move block/task to `done`, determine result/reason and trigger callbacks.
+4. Termination requests push blocks into `stopping` which instructs Zebra to cancel outstanding tasks; completion reason is persisted before publish.
+
+## Configuration
+- `RABBITMQ_URL` – connection string used by `Block.Tasks.TaskEventsConsumer` and Looper AMQP publishers.
+- `COMPILE_TASK_DONE_NOTIFICATION_CALLBACK`, `AFTER_PPL_TASK_DONE_NOTIFICATION_CALLBACK` – optional MFA tuples configured in `config/*.exs` for cross-service signalling.
+- Database credentials configured in `config/{dev,test,prod}.exs` under `Block.EctoRepo`.
+
+## Operations
+- Install deps & run migrations: `cd block && mix setup`.
+- Start locally: `cd block && iex -S mix` (ensure Postgres & RabbitMQ are reachable).
+- Run tests: `cd block && MIX_ENV=test mix test` (DB is managed by `mix test` fixtures).
+- Lint: `cd block && mix credo`.
+
+## Observability
+- Metrics: most STM operations wrap `Util.Metrics.benchmark` (look for Watchman entries prefixed with `Block.*`).
+- Logging: LogTee provides structured logs; search by `block_id`/`task_id` for correlation.
+- RabbitMQ dead-letter queues should be monitored when state transitions stall (stuck messages indicate decoding issues).
diff --git a/plumber/doc/definition_validator/AGENTS.md b/plumber/doc/definition_validator/AGENTS.md
new file mode 100644
index 000000000..abbc98f48
--- /dev/null
+++ b/plumber/doc/definition_validator/AGENTS.md
@@ -0,0 +1,15 @@
+# Definition Validator Agent Notes
+
+## Core Pieces
+- Entry point: `DefinitionValidator.validate_yaml_string/1`.
+- Parsers: `YamlStringParser` (YAML -> map), `YamlMapValidator` (schema via Jesse), `PplBlocksDependencies` (DAG checks), `PromotionsValidator` (promotion semantics).
+- Schema source: `spec/` dependency contains JSON schema files per YAML version.
+
+## Handy Commands
+- Install deps: `cd definition_validator && mix setup`.
+- Run suite: `mix test` (uses fixture YAML under `test/fixtures`).
+- Watch mode: `mix test.watch` while editing schemas or validators.
+- Linting: `mix credo`.
+
+## Debug Tips
+- Capture returned error tuples to surface line/column: `DefinitionValidator.validate_yaml_string(File.read!(
diff --git a/plumber/doc/definition_validator/DOCUMENTATION.md b/plumber/doc/definition_validator/DOCUMENTATION.md
new file mode 100644
index 000000000..8eb5d64e9
--- /dev/null
+++ b/plumber/doc/definition_validator/DOCUMENTATION.md
@@ -0,0 +1,39 @@
+# Definition Validator
+
+## Overview
+Definition Validator validates Semaphore pipeline YAML prior to scheduling. It parses YAML into Elixir maps, verifies schema compliance against the JSON schemas in `spec/`, and enforces additional semantic rules (block dependency graph, promotions). The app is embedded by `ppl` and `block`, but can run stand-alone for linting.
+
+## Responsibilities
+- Parse raw YAML (`DefinitionValidator.YamlStringParser`) and provide meaningful error locations.
+- Validate YAML structures against JSON schema via `YamlMapValidator` (uses Jesse and the `spec/` schemas bundled in the repo).
+- Check higher-level rules not covered by schema (e.g. block dependency DAG is acyclic, promotions configuration sound) through dedicated validators (`PplBlocksDependencies`, `PromotionsValidator`).
+- Expose a single API `DefinitionValidator.validate_yaml_string/1` returning either `{:ok, definition_map}` or `{:error, {:malformed, details}}` ready for UI display.
+
+## Architecture
+- `DefinitionValidator.Application` starts no persistent processes; the library is used synchronously.
+- Validators live in `definition_validator/lib/definition_validator/*` and are pure modules that transform or check maps.
+- Schema assets are maintained in the sibling `spec/` app; they are pulled in via Mix dependency and loaded on demand.
+- Error formatting (`pretty_print`) reorders error tuples to surface position and message first, easing consumer handling.
+
+## Typical Flow
+1. Call `DefinitionValidator.validate_yaml_string(yaml)`.
+2. YAML is decoded using `YamlElixir` and normalized.
+3. JSON schema validation runs; on failure errors are fed through `pretty_print` so UI layers receive structured tuples (`{:data_invalid, position, message, value, spec}`).
+4. Block dependency validator ensures dependencies resolve to existing blocks and the graph is acyclic.
+5. Promotions validator verifies promotion targets (switch definitions, required fields).
+6. Success returns `{:ok, definition_map}` which downstream services persist or forward.
+
+## Operations
+- Install deps: `cd definition_validator && mix setup`.
+- Run tests: `cd definition_validator && MIX_ENV=test mix test`.
+- Continuous validation for local work: `cd definition_validator && mix test.watch`.
+- Lint: `mix credo`.
+
+## Configuration
+- No runtime configuration is required; optional `spec` branch selection is achieved by changing the dependency revision.
+- The app reads `mix_env` from application env for log verbosity (set in `config/config.exs`).
+
+## Integration Notes
+- Consumers (ppl, block) treat any `{:error, {:malformed, ...}}` as hard failures and bubble them back to clients.
+- Ensure `spec/` is updated when YAML schema evolves; run the validator test suite to catch regressions.
+- All outputs are pure maps/tuples, making the library safe to call from IEx for debugging invalid YAML.
diff --git a/plumber/doc/gofer_client/AGENTS.md b/plumber/doc/gofer_client/AGENTS.md
new file mode 100644
index 000000000..dec866898
--- /dev/null
+++ b/plumber/doc/gofer_client/AGENTS.md
@@ -0,0 +1,23 @@
+# Gofer Client Agent Notes
+
+## Essentials
+- Public API: `create_switch/4`, `pipeline_done/3`, `verify_deployment_target_access/4`.
+- Transport: `GoferClient.GrpcClient` wraps `GRPC.Stub` with connection pooling.
+- Formatters: `RequestFormatter` builds protobuf structs, `ResponseParser` unwraps responses/errors.
+- Feature flag: `SKIP_PROMOTIONS=true` bypasses outbound calls (used in dev/tests).
+
+## Commands
+- Setup deps: `cd gofer_client && mix setup`.
+- Run tests: `mix test` (uses `grpc_mock` to fake Gofer).
+- Credo lint: `mix credo`.
+- Exercise manually: `SKIP_PROMOTIONS=false iex -S mix` then call helper functions with sample YAML maps.
+
+## Debug Tips
+- Verify host/port via `Application.get_env(:gofer_client, GoferClient.GrpcClient)`.
+- gRPC error tuples come back as `{:error, GRPC.RPCError}`; inspect `error.status` and `error.message`.
+- Promotions hanging? Check `SKIP_PROMOTIONS`, ensure Gofer is reachable, and confirm TLS cert paths in `config/runtime.exs`.
+- Request formatting failures usually mean YAML map lacks promotion data; see `RequestFormatter.form_create_request/4`.
+
+## Integration
+- Library does not supervise retries—callers (`ppl`) must decide how to handle failure.
+- When adding new Gofer RPCs, follow the same pattern: format -> client -> parser, and extend tests with mocked responses.
diff --git a/plumber/doc/gofer_client/DOCUMENTATION.md b/plumber/doc/gofer_client/DOCUMENTATION.md
new file mode 100644
index 000000000..71a6e335d
--- /dev/null
+++ b/plumber/doc/gofer_client/DOCUMENTATION.md
@@ -0,0 +1,36 @@
+# Gofer Client
+
+## Overview
+Gofer Client is a thin gRPC wrapper around the Gofer promotions service. It provisions promotion switches, notifies Gofer when pipelines finish, and verifies deployment target access. The library is consumed by `ppl` during promotion flows.
+
+## Responsibilities
+- Format protobuf requests for Gofer RPCs (`RequestFormatter`).
+- Maintain gRPC channels to the Gofer service (`GrpcClient`).
+- Parse responses into simple Elixir tuples (`ResponseParser`).
+- Allow promotion flows to be bypassed locally via `SKIP_PROMOTIONS`.
+
+## Architecture
+- `GoferClient` exposes three public functions: `create_switch/4`, `pipeline_done/3`, and `verify_deployment_target_access/4`.
+- `GoferClient.Application` supervises the gRPC connection workers (host/port defined in `config/*.exs`).
+- Request/response formatting lives in dedicated modules so they can be unit tested without hitting Gofer.
+- Test support mocks Gofer via `grpc_mock` to keep CI hermetic.
+
+## Interaction Points
+1. **Switch creation** – serialises YAML definition, previous artefact IDs, and ref args into `InternalApi.Gofer.CreateSwitchRequest` before dispatching to Gofer.
+2. **Pipeline done notification** – informs Gofer when a promoted pipeline finishes so switches can advance or unlock.
+3. **Deployment target access** – checks if the triggerer is authorised to deploy to a guarded environment before scheduling promotions.
+
+## Configuration
+- `SKIP_PROMOTIONS` (`true`/`false`) – when true all public functions short-circuit to `{:ok, ""}`.
+- `GOFER_GRPC_HOST`, `GOFER_GRPC_PORT`, and TLS parameters – set in `config/{dev,test,prod}.exs`; defaults point to docker-compose services.
+- Timeout/retry settings live in `config/config.exs` under the `GoferClient.GrpcClient` key.
+
+## Operations
+- Install deps: `cd gofer_client && mix setup` (alias pulls deps only).
+- Run tests: `mix test` (mocks gRPC calls by default).
+- Lint: `mix credo`.
+- Connect to a real Gofer instance by exporting the host/port env vars and ensuring network reachability.
+
+## Failure Modes
+- Network errors bubble up as `{:error, reason}`; callers in `ppl` decide whether to retry or skip promotions.
+- Gofer validation failures are returned as `{:error, {:gofer, status, message}}` by `ResponseParser` and should be surfaced to users.
diff --git a/plumber/doc/job_matrix/AGENTS.md b/plumber/doc/job_matrix/AGENTS.md
new file mode 100644
index 000000000..b40584f05
--- /dev/null
+++ b/plumber/doc/job_matrix/AGENTS.md
@@ -0,0 +1,23 @@
+# Job Matrix Agent Notes
+
+## Essentials
+- Validation entry: `JobMatrix.Validator.validate/1` (returns `{:ok, _}` or `{:error, {:malformed, message}}`).
+- Expansion entry: `JobMatrix.Handler.expand_job/1` (turns a job map with `matrix` key into a list of job variants).
+- Parallelism shorthand: `JobMatrix.ParallelismHandler.parallelize_jobs/1` converts `parallelism: N` into matrix/env vars.
+- Cartesian builder: `JobMatrix.Cartesian.product/3` and `JobMatrix.Transformer.to_env_vars_list/1` produce env-var combinations.
+
+## Quick Commands
+- Install deps: `cd job_matrix && mix deps.get` (or `mix setup`).
+- Run tests: `mix test` (covers validator, transformer, handler, parallelism).
+- Lint: `mix credo`.
+
+## Debug Tips
+- Capture `{ :error, {:malformed, msg} }` to bubble user-friendly errors; avoid raising.
+- When job names look odd, inspect `JobMatrix.Handler` name-generation logic (adds suffix describing env var values or index/count pairs).
+- Duplicate axis names trigger `Duplicate name` errors—ensure YAML block defines unique `env_var`/`software` keys.
+- Parallelism path injects `SEMAPHORE_JOB_INDEX` and `SEMAPHORE_JOB_COUNT` env vars; verify downstream relies on them before changing.
+
+## Integration Notes
+- Library is pure; no processes to supervise. Safe to use in tests and compile-time.
+- Update both validator and transformer when extending matrix syntax.
+- Consumers typically call validator first, then handler; keep that sequence to prevent `throw` propagation.
diff --git a/plumber/doc/job_matrix/DOCUMENTATION.md b/plumber/doc/job_matrix/DOCUMENTATION.md
new file mode 100644
index 000000000..983f6bc0a
--- /dev/null
+++ b/plumber/doc/job_matrix/DOCUMENTATION.md
@@ -0,0 +1,38 @@
+# Job Matrix Service
+
+## Overview
+`job_matrix` is a library application that expands a job's matrix/parallelism definition into concrete job variants. It converts YAML block definitions into explicit job maps with environment variables, validates matrix syntax, and is used by both `ppl` and `block` before persisting jobs or scheduling builds.
+
+## Responsibilities
+- Validate matrix definitions supplied in a job (`JobMatrix.Validator`).
+- Convert matrix axes to environment-variable combinations using cartesian products (`JobMatrix.Cartesian`, `JobMatrix.Transformer`).
+- Generate derived jobs with expanded env vars and unique names (`JobMatrix.Handler`).
+- Support the legacy `parallelism` shortcut for creating `SEMAPHORE_JOB_INDEX`/`SEMAPHORE_JOB_COUNT` environment variables (`JobMatrix.ParallelismHandler`).
+
+## Architecture
+- Pure functional modules; no supervision tree or runtime processes.
+- Entry points:
+  - `JobMatrix.Handler.expand_job/1` (called from block scheduler to expand `matrix` definitions).
+  - `JobMatrix.ParallelismHandler.parallelize_jobs/1` (mutates blocks by turning `parallelism` into a 1D matrix).
+  - `JobMatrix.Validator.validate/1` (ensures schema correctness; reused by `definition_validator`).
+- The cartesian builder works on `%{"env_var" => "FOO", "values" => [...]}` and `%{"software" => "BAR", "versions" => [...]}` axes.
+
+## Data Flow
+1. Definition validator ensures `matrix` fields are well shaped.
+2. `JobMatrix.Handler` receives a job map, validates the matrix, obtains env var combinations, and clones the job per combination.
+3. Each generated job receives concatenated env vars, with names suffixes reflecting the matrix values or counts.
+4. For `parallelism: N`, handler generates `matrix = SEMAPHORE_JOB_INDEX 1..N` and injects `SEMAPHORE_JOB_COUNT`.
+
+## Error Handling
+- Validation throws `{:malformed, message}` tuples when matrix structure is invalid (non-list, missing keys, duplicate axis names, empty value lists).
+- Handler catches thrown errors and propagates them upward; callers translate results into gRPC/HTTP errors.
+
+## Operations
+- Install deps: `cd job_matrix && mix deps.get` (or `mix setup`).
+- Run tests: `mix test` (covering transformer, validator, parallelism).
+- Used purely as a dependency; no application start required beyond compile.
+
+## Integration Notes
+- The library returns either `{ :ok, jobs }` or `{ :error, reason }`; consumers must handle tuples, not exceptions.
+- Ensure new YAML syntax updates keep validator/transformer modules in sync.
+- When extending axis syntax, add cases in both `Validator` and `Transformer` and update tests.
diff --git a/plumber/doc/looper/AGENTS.md b/plumber/doc/looper/AGENTS.md
new file mode 100644
index 000000000..ebdd70e61
--- /dev/null
+++ b/plumber/doc/looper/AGENTS.md
@@ -0,0 +1,30 @@
+# Looper Agent Notes
+
+## Core Modules
+- `Looper.STM` – macro for state machine workers; takes `repo`, `schema`, `allowed_states`, `publisher_cb`, `task_supervisor`, `cooling_time_sec`.
+- `Looper.Periodic` – macro for defining recurring jobs with jitter/backoff.
+- `Looper.StateResidency` / `Looper.StateWatch` – track how long records stay in each state.
+- `Looper.Util`, `Looper.CommonQuery`, `Looper.Ctx` – helper modules used by generated code.
+
+## Typical Usage
+1. Define a module using `use Looper.STM, id: :pipeline_initializing, repo: ..., schema: ...`.
+2. Implement callbacks (`scheduling_handler/1`, `terminate_request_handler/2`, etc.).
+3. Provide a `publisher_cb` for RabbitMQ if you expect events.
+4. Start the module under your supervision tree (see `Ppl.Sup.STM`).
+
+## Commands
+- Compile/test only: `cd looper && mix test`.
+- Static analysis: `mix credo`.
+- Docs (helpful for consumers): `mix docs` (generates moduledoc HTML).
+
+## Debug Tips
+- Looper wraps handler calls in `Wormhole.capture`; check Wormhole logs for retries.
+- Cooling time set too high blocks scheduling; adjust `cooling_time_sec` in the args map.
+- `publisher_cb: :skip` disables event emission—useful for tests.
+- All identifiers ending with `_id` are auto-extracted for publisher payloads; ensure structs expose them.
+- Looper catches thrown `{:error, reason}` tuples and logs via LogTee; review structured logs when loops halt.
+
+## Extending
+- When adding new arguments to macros, ensure backwards compatibility (default options).
+- Update consuming services (`ppl`, `block`) once new APIs land.
+- Provide tests in `looper/test/` demonstrating the new behaviour.
diff --git a/plumber/doc/looper/DOCUMENTATION.md b/plumber/doc/looper/DOCUMENTATION.md
new file mode 100644
index 000000000..08742ca1f
--- /dev/null
+++ b/plumber/doc/looper/DOCUMENTATION.md
@@ -0,0 +1,39 @@
+# Looper
+
+## Overview
+Looper is a shared library that provides reusable building blocks for long-running workers inside plumber services. It standardises state-machine loopers (STM), periodic jobs, state residency tracking, and publisher integrations. `ppl` and `block` rely on Looper macros to implement their background schedulers with consistent logging, metrics, and retry semantics.
+
+## Capabilities
+- **STM (`Looper.STM`)** – macro that generates GenServer-based schedulers which:
+  - Poll Ecto schemas using user-supplied queries.
+  - Enforce cooling-off periods between runs.
+  - Dispatch handler callbacks for scheduling and termination logic.
+  - Publish state transitions via RabbitMQ using the configured callback.
+- **Periodic (`Looper.Periodic`)** – DSL for recurring jobs with jitter, backoff, and metrics.
+- **State watch/residency (`Looper.StateWatch`, `Looper.StateResidency`)** – helpers that record how long records stay in given states; used for SLA monitoring.
+- **Common utilities** – context struct builders (`Looper.Ctx`), query helpers (`Looper.CommonQuery`), benchmarking/logging wrappers (`Looper.Util`).
+
+## How It Fits In
+- `ppl/lib/ppl/ppls/stm_handler/*` use `use Looper.STM` to implement pipeline state machines.
+- `block/lib/block/blocks/stm_handler/*` and task handlers rely on the same macros to control block execution.
+- RabbitMQ publishing hooks plug into Looper’s `publisher_cb` argument to send events after each state transition.
+- Metrics are emitted via `Util.Metrics` hooks that Looper invokes automatically when provided.
+
+## Extending Looper
+- Implement new behaviour by adding macros in `lib/looper/` and exposing them via documented APIs.
+- Ensure new features remain composable—Looper modules are consumed at compile time by other apps.
+- Provide sensible defaults via optional arguments in macros to reduce boilerplate for consumers.
+
+## Operations
+- Looper is a pure library; no runtime processes beyond what consuming apps spin up.
+- Install deps / compile: `cd looper && mix deps.get` (or `mix compile`).
+- Run tests: `mix test` (validates helper modules and macros with mocked repos).
+- Document macros via inline moduledocs to assist downstream developers.
+
+## Design Notes
+- Looper macros expect arguments such as `repo`, `schema`, `initial_query`, `allowed_states`, `publisher_cb`, and `task_supervisor`.
+- `Looper.STM` wraps handler callbacks in `Wormhole` for retry and exception capturing.
+- Publishing is performed only when state changes; Looper extracts identifiers ending in `_id` to include in events.
+- Periodic jobs use `:timer.send_after/2` under the hood; ensure handlers are idempotent.
+
+Keep this document updated when Looper macros gain new capabilities or contract changes so downstream services know how to adapt.
diff --git a/plumber/doc/ppl/AGENTS.md b/plumber/doc/ppl/AGENTS.md
new file mode 100644
index 000000000..71749ec6a
--- /dev/null
+++ b/plumber/doc/ppl/AGENTS.md
@@ -0,0 +1,47 @@
+# Plumber Agent Notes
+
+This file is a fast lane when you need to patch or extend the Pipelines (plumber) service.
+
+## Mental Model
+- `ppl/` holds the gRPC boundary and state machines; handlers live under `ppl/lib/ppl/grpc/` and call into context modules under `ppl/lib/ppl/`.
+- YAML validation, block execution, matrix expansion and Gofer integration are separate OTP apps within the repo (see `definition_validator/`, `block/`, `job_matrix/`, `gofer_client/`). They are started as dependencies of `ppl` via `mix.exs` and supervised from `Ppl.Application`.
+- Protobufs live in `proto/` (`InternalApi.Plumber.*` and `InternalApi.PlumberWF.*`). When protos change run `mix deps.get` + `mix compile` inside `proto/` and `ppl/` to regenerate modules.
+- Persistent data: PostgreSQL via `Ppl.EctoRepo` and `Block.EctoRepo`; migrations live under respective `priv/repo/migrations/` directories.
+
+## Workflow Cheat Sheet
+- **Local setup**: `cd ppl && mix setup` (installs deps, creates DBs, runs migrations). If DB credentials change, edit `ppl/config/*.exs` and `block/config/*.exs` together.
+- **Run tests**: `cd ppl && MIX_ENV=test mix test`. Add `MIX_ENV=test mix ecto.reset` when fixtures drift.
+- **Start gRPC server locally**: `cd ppl && iex -S mix`. Look for `Plumber.Endpoint` in supervision tree; it exposes both Pipeline and Workflow services on the configured port (default see `config/config.exs`).
+- **Lint**: `cd ppl && mix credo`.
+- **Dialyzer**: `cd ppl && mix dialyzer` (takes a while, usually cached in `_build`).
+
+## Observability Hooks
+- Pipeline/Block state changes publish to RabbitMQ exchanges (`pipeline_state_exchange`, `pipeline_block_state_exchange`, `after_pipeline_state_exchange`). Check publisher modules under `ppl/lib/ppl/publishers/` when debugging missing events.
+- Triggerers & request tokens are logged via `LogTee`; grep the request token in central logs to correlate gRPC call and DB row.
+- Cachex caches YAML, queue stats, etc. Purge via `Cachex.clear/1` while in IEx if stale data blocks you.
+
+## Common Code Paths
+- Scheduling: `ppl/lib/ppl/pipeline/scheduler.ex` (entry point from gRPC) → `definition_validator` → DB insert + event publish.
+- Termination & stopping: `ppl/lib/ppl/ppls/stm_handler/` (state machine handlers) coordinate with `block/` to stop running jobs.
+- Listing: `ppl/lib/ppl/pipelines/query/*.ex` contain Ecto queries; keyset pagination uses the `paginator` library.
+- Workflow surface: see `ppl/lib/ppl/workflows/`.
+
+## Gotchas
+- Every public response includes `ResponseStatus` / `InternalApi.Status`; handlers must set it even on success. Tests usually assert for `code == :OK`.
+- `Schedule`, `ScheduleExtension`, `PartialRebuild`, `Reschedule` all rely on unique `request_token`. Do not skip the idempotency check; see `ppl/lib/ppl/idempotency/`.
+- The proto still lists `BlockService.BuildFinished` but the RPC is intentionally disabled—status comes from AMQP. Avoid resurrecting it unless spec changes.
+- Pagination mix: offset (`List*`), keyset (`ListKeyset`, `ListGroupedKS`, `ListLatestWorkflows`). Ensure you return both tokens even when empty.
+- If you touch migrations remember to run them for both repos (`mix ecto.migrate -r Ppl.EctoRepo -r Block.EctoRepo`).
+
+## Useful Queries
+- Describe pipeline: `grpcurl -plaintext -proto proto/plumber.pipeline.proto -d '{"ppl_id":"..."}' localhost:50051 InternalApi.Plumber.PipelineService.Describe`
+- Terminate pipeline: same service, `Terminate` RPC. Always include `requester_id`.
+- Workflow schedule: `grpcurl ... InternalApi.PlumberWF.WorkflowService.Schedule` (requires repo metadata; easiest to capture from logs/tests).
+
+## When Things Break
+1. **Proto mismatch** – regenerate modules (`mix deps.compile proto`). Make sure `proto` app version matches.
+2. **DB issues** – inspect `ppl/priv/repo/migrations/` for expected schemas; confirm config in `config/dev.exs`.
+3. **State stuck in STOPPING** – review SM handlers in `ppl/lib/ppl/ppls/stm_handler/`, check AMQP event delivery, and ensure block worker acked the stop.
+4. **List endpoints slow** – check indices in migrations; pagination queries rely on composite indexes (branch, created_at, etc.).
+
+Keep this file close to the code; update alongside major refactors or schema/proto changes.
diff --git a/plumber/doc/ppl/DOCUMENTATION.md b/plumber/doc/ppl/DOCUMENTATION.md
new file mode 100644
index 000000000..b12cd2b0a
--- /dev/null
+++ b/plumber/doc/ppl/DOCUMENTATION.md
@@ -0,0 +1,63 @@
+# Pipelines Service (ppl)
+
+## Overview
+`ppl` is the entry-point application of the plumber stack. It exposes the gRPC APIs defined in `InternalApi.Plumber.*` and `InternalApi.PlumberWF.*`, persists pipeline/workflow state, and coordinates subordinate services such as Block, Definition Validator, Gofer Client, and Job Matrix. The service accepts scheduling requests, drives pipeline state machines, emits AMQP events, and services listing/describe calls used by UI and automation clients.
+
+## Responsibilities
+- Handle gRPC traffic for pipeline (`PipelineService`, `Admin`) and workflow (`WorkflowService`) APIs via handlers under `ppl/lib/ppl/grpc/`.
+- Persist pipelines, workflows, block/job snapshots, and auxiliary data in PostgreSQL (`Ppl.EctoRepo`).
+- Orchestrate pipeline state transitions through STM workers started under `Ppl.Sup.STM` (Looper-driven).
+- Publish pipeline/block/after-pipeline events to RabbitMQ exchanges for downstream consumers.
+- Integrate with sibling services: Block for block execution, Definition Validator for YAML checks, Gofer for promotions, Zebra/Task API through clients.
+
+## Architecture
+- **Supervision tree**: `Ppl.Application` boots cache processes (`Ppl.Cache`), Ecto repo, Looper supervisors (`Ppl.Sup.STM`), RabbitMQ consumers (e.g. `Ppl.OrgEventsConsumer`), and the gRPC server (`GRPC.Server.Supervisor` with `Ppl.Grpc.Server`, `Plumber.WorkflowAPI.Server`, `Ppl.Admin.Server`, `Ppl.Grpc.HealthCheck`).
+- **State machines**: Located in `ppl/lib/ppl/ppls/stm_handler/` (pipeline-level handlers) and other contexts; they poll for rows needing transitions (scheduling, stopping, cleanup).
+- **GRPC layer**: Request/response modules under `ppl/lib/ppl/grpc/` translate protobuf messages to domain commands. Separate modules exist per surface (pipeline, workflow, admin, health).
+- **Persistence**: Migrations live in `ppl/priv/repo/migrations`. Tables mirror proto structures (pipelines with state/result fields, workflows, queues, requesters, artefacts, etc.).
+- **Caching**: `Ppl.Cache` (Cachex) stores frequently accessed data such as YAML payloads or queue lookups.
+- **AMQP publishing**: See `ppl/lib/ppl/publishers/` for event emitters targeting `pipeline_state_exchange`, `pipeline_block_state_exchange`, and `after_pipeline_state_exchange`.
+
+## External Interfaces
+- **gRPC APIs**: Implements the surfaces defined in `proto/plumber.pipeline.proto` and `proto/plumber_w_f.workflow.proto` (schedule, describe, list, terminate, partial rebuild, run now, delete, admin terminate all, get yaml, etc.).
+- **RabbitMQ**: Consumes organisation events (`Ppl.OrgEventsConsumer`) and publishes pipeline/block state events.
+- **AMQP tasks**: Collaborates with Block which handles actual block execution; Ppl updates Block via database and AMQP triggers.
+- **Gofer**: Uses `GoferClient` to create switches and notify promotions.
+- **Task API**: Interacts with Zebra via Task API clients under `ppl/lib/ppl/task_api_client/` when managing tasks directly.
+
+## Typical Flow
+1. **Schedule**: `WorkflowService.Schedule` or `PipelineService.Schedule` receives a request → YAML validated (`DefinitionValidator`) → records created in Postgres → initial pipeline enters STM queue.
+2. **Execution**: STM handler transitions pipeline to `pending`, `queuing`, `running`, and coordinates with Block to run blocks. Events published on each change.
+3. **Inspection**: Clients call `Describe`/`DescribeMany`/`List*` handlers which query Ecto using modules in `ppl/lib/ppl/pipeline/query/` and `ppl/lib/ppl/workflow/workflow_queries.ex`.
+4. **Termination**: `Terminate` or `TerminateAll` sets termination intent and pushes pipeline to `stopping`; Block handles per-block cancellation; final `DONE` event emitted.
+5. **Promotions / Partial rebuild**: `ScheduleExtension` and `PartialRebuild` endpoints reuse existing pipeline metadata, call Gofer when needed, and ensure idempotency via request tokens.
+
+## Configuration
+Key environment variables (see `config/runtime.exs`):
+- Database: `DATABASE_URL` (or specific `PPL_DATABASE_*` vars), pool size, SSL settings.
+- AMQP: `RABBITMQ_URL` for publishers and consumers.
+- Rate limiting: `IN_FLIGHT_DESCRIBE_LIMIT`, `IN_FLIGHT_LIST_LIMIT` used by `Ppl.Grpc.InFlightCounter`.
+- Promotions: `SKIP_PROMOTIONS`, Gofer host/port.
+- Telemetry/logging: Watchman, LogTee configuration, Sentry (if enabled).
+
+## Operations
+- Setup everything: `cd ppl && mix setup` (deps + migrations for both `Ppl.EctoRepo` and `Block.EctoRepo`).
+- Run migrations: `mix ecto.migrate -r Ppl.EctoRepo -r Block.EctoRepo`.
+- Tests: `mix test` (spawns gRPC mocks, uses sandbox DB).
+- Lint: `mix credo`; Dialyzer: `mix dialyzer`.
+- Start locally: `iex -S mix` (ensures gRPC server listens on configured port, default 50051).
+
+## Observability
+- Metrics: Watchman metrics emitted via `Util.Metrics` wrappers (search prefixes `Ppl.*`).
+- Logging: LogTee structures logs with tags (`ppl_id`, `wf_id`, `request_token`).
+- Events: RabbitMQ exchanges provide realtime state transitions; monitor for missing events when UI seems stale.
+- Health: `Ppl.Grpc.HealthCheck` implements gRPC health checking (used by k8s).
+
+## Key Code Hotspots
+- Pipeline queries: `ppl/lib/ppl/workflow/workflow_queries.ex`, `ppl/lib/ppl/pipeline/query/`.
+- STM handlers: `ppl/lib/ppl/ppls/stm_handler/*`.
+- GRPC servers: `ppl/lib/ppl/grpc/server.ex`, `plumber/workflow_api/server.ex`, `ppl/admin/server.ex`.
+- Idempotency: `ppl/lib/ppl/idempotency` modules ensure `request_token` semantics.
+- Publishers: `ppl/lib/ppl/publishers/pipeline_event_publisher.ex` and related files.
+
+Keep this document in sync with proto changes and major STM refactors.
diff --git a/plumber/doc/repo_proxy_ref/AGENTS.md b/plumber/doc/repo_proxy_ref/AGENTS.md
new file mode 100644
index 000000000..0b90f3a82
--- /dev/null
+++ b/plumber/doc/repo_proxy_ref/AGENTS.md
@@ -0,0 +1,23 @@
+# Repo Proxy Referent Agent Notes
+
+## Key Files
+- `lib/repo_proxy_ref/grpc/server.ex` – canned responses for `Describe` and `CreateBlank` calls.
+- `lib/repo_proxy_ref/grpc/health_check.ex` – gRPC health endpoint.
+- `config/*.exs` – port, TLS, and logging configuration for the stub.
+
+## Commands
+- Setup deps: `cd repo_proxy_ref && mix deps.get` (or `mix setup`).
+- Run tests: `mix test`.
+- Start stub locally: `iex -S mix` (defaults to the port in config; override with `REPO_PROXY_REF_PORT`).
+- Smoke test: `grpcurl -plaintext localhost:<port> InternalApi.RepoProxy.RepoProxyService.Describe -d '{"hook_id":"master"}'`.
+
+## Debug Tips
+- Scenario selection is driven by `hook_id` (describe) and `request_token` (create_blank). Inspect matches in `server.ex` when adding new fixtures.
+- Timeout simulations use `:timer.sleep/1`; adjust durations if tests get flaky.
+- Protobuf structs are built with `Util.Proto.deep_new!`; mismatched fields usually mean the proto dependency is outdated.
+- Search logs with `tag:repo_proxy_ref` (LogTee) to correlate requests during plumber runs.
+
+## Extending
+- Add new canned repo states by updating helper functions (`mock_repo/1`, `mock_hook/1`).
+- Keep commit SHA generation deterministic if tests assert on values.
+- Update tests under `test/` whenever you add or change scenarios.
diff --git a/plumber/doc/repo_proxy_ref/DOCUMENTATION.md b/plumber/doc/repo_proxy_ref/DOCUMENTATION.md
new file mode 100644
index 000000000..aea8ad566
--- /dev/null
+++ b/plumber/doc/repo_proxy_ref/DOCUMENTATION.md
@@ -0,0 +1,33 @@
+# Repo Proxy Referent
+
+## Overview
+`repo_proxy_ref` is a lightweight gRPC stub that mimics the external Repo Proxy service used during integration tests and local development. It serves deterministic responses for `Describe` and `CreateBlank` RPCs so plumber components can exercise promotion and scheduling flows without contacting the real repo-proxy.
+
+## Responsibilities
+- Implement `InternalApi.RepoProxy.RepoProxyService` with canned responses covering happy-path, timeout, and error scenarios.
+- Provide synthetic hook metadata (branch, PR, tag cases) for pipelines under test.
+- Generate commit SHAs and repo details expected by downstream services when they schedule pipelines or fetch YAML from repositories.
+
+## Architecture
+- `RepoProxyRef.Application` boots a gRPC server (`GRPC.Server.Supervisor`) exposing `RepoProxyRef.Grpc.Server` and `RepoProxyRef.Grpc.HealthCheck`.
+- `RepoProxyRef.Grpc.Server` matches incoming `hook_id` / `request_token` values to predefined behaviours:
+  - `hook_id: "timeout"` / `request_token: "timeout"` simulate slow dependencies.
+  - `hook_id: "bad_param"` returns `BAD_PARAM` codes.
+  - Standard IDs return OK payloads built with helper functions.
+- Responses are constructed via `Util.Proto.deep_new!/2`, ensuring they stay aligned with protobuf definitions.
+
+## Usage Patterns
+- Docker compose and test suites start the referent alongside plumber to isolate repo-proxy interactions.
+- When running plumber locally, exporting `REPO_PROXY_URL` to point at the referent gRPC endpoint allows schedule/describe flows to proceed without external dependencies.
+- The service also acts as a fixture provider for scenario-specific commits (`10_schedule_extension`, `14_free_topology_failing_block`, etc.).
+
+## Operations
+- Install deps & compile: `cd repo_proxy_ref && mix deps.get && mix compile` (or `mix setup`).
+- Run tests: `mix test` (validates gRPC handlers and health check).
+- Start locally: `iex -S mix` (listens on port configured in `config/config.exs`).
+- Health check: use `grpcurl -plaintext localhost:<port> grpc.health.v1.Health/Check`.
+
+## Extending
+- Add new canned scenarios by updating pattern matches in `RepoProxyRef.Grpc.Server` and adjusting tests under `test/`.
+- Keep protobuf dependency (`proto` app) up to date to avoid type mismatches.
+- Ensure new mock data mirrors real repo-proxy contracts (branch names, commit ranges, repository IDs).
diff --git a/plumber/doc/task_api_referent/DOCUMENTATION.md b/plumber/doc/task_api_referent/DOCUMENTATION.md
new file mode 100644
index 000000000..a72988268
--- /dev/null
+++ b/plumber/doc/task_api_referent/DOCUMENTATION.md
@@ -0,0 +1,33 @@
+# Task API Referent
+
+## Overview
+`task_api_referent` is a mock implementation of Semaphore's Task API. It is used in integration tests to simulate Zebra behaviour when plumber schedules, monitors, or terminates tasks. The referent exposes the same gRPC surface that plumber expects, but persists data in memory for deterministic behaviour.
+
+## Responsibilities
+- Provide `schedule`, `describe_many`, and `terminate` operations compatible with InternalApi.Task messages.
+- Maintain an in-memory representation of tasks/jobs to support idempotency and termination flows during tests.
+- Log scheduling metadata for debugging (`LogTee` integration) and surface validation errors with descriptive messages.
+
+## Architecture
+- `TaskApiReferent.Application` starts the necessary supervision tree (in-memory stores, gRPC server).
+- Core logic lives in `TaskApiReferent.Actions`:
+  - Validates payloads via `TaskApiReferent.Validation`.
+  - Delegates persistence to `TaskApiReferent.Service`.
+  - Schedules asynchronous execution using `TaskApiReferent.Runner`.
+- The module raises `GRPC.RPCError` for not-found or invalid parameter scenarios, matching behaviour of the real service.
+
+## Usage Patterns
+- Plumber tests depend on the referent to simulate scheduling success, duplicates (via `request_token`), and termination callbacks.
+- Docker compose setups start this service so local plumber runs can exercise Task API without Zebra.
+- Validation helpers format nested job data for logging, aiding inspection when a schedule fails.
+
+## Operations
+- Setup: `cd task_api_referent && mix deps.get` (or `mix setup`).
+- Run tests: `mix test` (covers actions, validation, runner).
+- Start locally: `iex -S mix` (port configured in `config/config.exs`).
+- Smoke test: `grpcurl -plaintext localhost:<port> InternalApi.Task.TaskService.DescribeMany -d '{"task_ids":["..."]}'`.
+
+## Extending
+- When adding new Task API features, update `TaskApiReferent.Actions` and mirror behaviour in the referent's service layer.
+- Ensure validations stay in sync with real Task API contracts; adjust `Validation` module and tests concurrently.
+- Use descriptive LogTee messages to ease debugging of integration scenarios.

From c076addcaf65c0a3c00a545402b22b3a4cd8a18b Mon Sep 17 00:00:00 2001
From: Dejan K <sidejan@gmail.com>
Date: Thu, 16 Oct 2025 15:15:05 +0200
Subject: [PATCH 2/4] docs(plumber): add comprehensive agent notes and
 documentation for plumber stack components

---
 plumber/AGENTS.md        | 46 ++++++++++++++++++++++++++++++++
 plumber/DOCUMENTATION.md | 57 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+)
 create mode 100644 plumber/AGENTS.md
 create mode 100644 plumber/DOCUMENTATION.md

diff --git a/plumber/AGENTS.md b/plumber/AGENTS.md
new file mode 100644
index 000000000..6411ec04a
--- /dev/null
+++ b/plumber/AGENTS.md
@@ -0,0 +1,46 @@
+# Plumber Stack Agent Notes
+
+Use this file as the high-level triage map for the plumber stack. Each section links to the detailed agent notes that live under `doc/`.
+
+## Quick Map
+- `ppl/` – gRPC edge, pipeline/workflow state machines, and RabbitMQ publishers ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md)).
+- `block/` – block lifecycle orchestrator wired to Zebra task events ([doc/block/AGENTS.md](doc/block/AGENTS.md)).
+- `definition_validator/` – YAML parsing + schema/semantic validation before scheduling ([doc/definition_validator/AGENTS.md](doc/definition_validator/AGENTS.md)).
+- `job_matrix/` – pure library that expands matrix/parallelism definitions into concrete jobs ([doc/job_matrix/AGENTS.md](doc/job_matrix/AGENTS.md)).
+- `gofer_client/` – promotions gRPC client used during deploy flows ([doc/gofer_client/AGENTS.md](doc/gofer_client/AGENTS.md)).
+- `looper/` – shared STM/periodic worker macros powering `ppl` and `block` schedulers ([doc/looper/AGENTS.md](doc/looper/AGENTS.md)).
+- Support stubs: `repo_proxy_ref/` (mock repo-proxy) and `task_api_referent/` (mock Task API) keep local/dev runs hermetic ([doc/repo_proxy_ref/AGENTS.md](doc/repo_proxy_ref/AGENTS.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+
+## End-to-End Flow Scratchpad
+1. **Schedule request arrives** → `ppl` gRPC handlers validate YAML via `definition_validator`, expand jobs with `job_matrix`, persist pipeline + block rows, and kick STM workers ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md)).
+2. **Block execution** → `block` STM loopers provision Zebra tasks and watch RabbitMQ for task completion ([doc/block/AGENTS.md](doc/block/AGENTS.md)).
+3. **Task lifecycle** → In tests/local, `task_api_referent` simulates Zebra responses so blocks/pipelines advance predictably ([doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+4. **Promotions** → When promotions are enabled, `gofer_client` notifies Gofer and manages switches; `SKIP_PROMOTIONS` short-circuits locally ([doc/gofer_client/AGENTS.md](doc/gofer_client/AGENTS.md)).
+5. **Events** → `ppl` publishers push pipeline/block updates to AMQP exchanges for UI consumers.
+
+## Common Triage Paths
+- **Pipeline stuck in `SCHEDULING` / `RUNNING`** → Check `ppl` STM handlers and ensure dependent services (`definition_validator`, `block`, RabbitMQ) respond ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md)).
+- **Block stuck in `RUNNING` / `STOPPING`** → Inspect `block` STM handlers and incoming Zebra events; use RabbitMQ tooling if events seem missing ([doc/block/AGENTS.md](doc/block/AGENTS.md)).
+- **Matrix or YAML errors** → Re-run `DefinitionValidator.validate_yaml_string/1` locally to reproduce schema/semantic issues ([doc/definition_validator/AGENTS.md](doc/definition_validator/AGENTS.md)).
+- **Promotion failures** → Confirm `SKIP_PROMOTIONS` is set appropriately and inspect `GoferClient` gRPC error tuples ([doc/gofer_client/AGENTS.md](doc/gofer_client/AGENTS.md)).
+- **Mock data mismatches** → Update referents (`repo_proxy_ref`, `task_api_referent`) when integration tests need new scenarios ([doc/repo_proxy_ref/AGENTS.md](doc/repo_proxy_ref/AGENTS.md)).
+
+## Command Cheat Sheet
+- Bootstrap every app: `mix setup` inside `ppl/`, `block/`, `definition_validator/`, `job_matrix/`, `gofer_client/`, and `looper/`.
+- Run targeted tests where the failure originates (e.g. `cd ppl && MIX_ENV=test mix test`) before escalating ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)).
+- Use `mix credo` routinely on Elixir apps; Looper/library apps are pure so linting catches most regressions ([doc/looper/AGENTS.md](doc/looper/AGENTS.md)).
+- Mock services: start `repo_proxy_ref` and `task_api_referent` locally when plumbing end-to-end flows ([doc/repo_proxy_ref/AGENTS.md](doc/repo_proxy_ref/AGENTS.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+
+## Observability + Tooling
+- Watchman metrics prefixed with `Ppl.*`, `Block.*`, or `Looper.*` highlight slow handlers (see service-specific notes).
+- LogTee tags (`ppl_id`, `block_id`, `task_id`, `request_token`) support cross-service tracing ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)).
+- RabbitMQ exchanges: `pipeline_state_exchange`, `pipeline_block_state_exchange`, `after_pipeline_state_exchange`, `task_state_exchange`—confirm bindings when events disappear ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)).
+
+## Reference Index
+- Pipelines edge + workflows: [doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md)
+- Block service internals: [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md)
+- YAML validation: [doc/definition_validator/DOCUMENTATION.md](doc/definition_validator/DOCUMENTATION.md)
+- Matrix expansion: [doc/job_matrix/DOCUMENTATION.md](doc/job_matrix/DOCUMENTATION.md)
+- Promotions client: [doc/gofer_client/DOCUMENTATION.md](doc/gofer_client/DOCUMENTATION.md)
+- Worker macros: [doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md)
+- Repo & Task referents: [doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)
diff --git a/plumber/DOCUMENTATION.md b/plumber/DOCUMENTATION.md
new file mode 100644
index 000000000..48c43addc
--- /dev/null
+++ b/plumber/DOCUMENTATION.md
@@ -0,0 +1,57 @@
+# Plumber Stack Documentation Hub
+
+This document stitches together the service-level docs under `doc/` so you have a single place to understand how the plumber stack fits together. Follow the links for deep dives.
+
+## System Overview
+- **Pipelines (`ppl/`)** – primary gRPC surface and orchestrator for pipeline/workflow state machines. It persists pipeline data, publishes AMQP events, and coordinates subordinate apps ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md)).
+- **Block (`block/`)** – manages block lifecycle and Zebra task orchestration, reacting to RabbitMQ events to advance block/task state ([doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md)).
+- **Definition Validator** – validates pipeline YAML (schema + semantic rules) before anything is persisted ([doc/definition_validator/DOCUMENTATION.md](doc/definition_validator/DOCUMENTATION.md)).
+- **Job Matrix** – expands `matrix` / `parallelism` definitions into concrete job variants for downstream schedulers ([doc/job_matrix/DOCUMENTATION.md](doc/job_matrix/DOCUMENTATION.md)).
+- **Gofer Client** – gRPC client for promotion workflows; wraps request formatting, transport, and response parsing ([doc/gofer_client/DOCUMENTATION.md](doc/gofer_client/DOCUMENTATION.md)).
+- **Looper** – shared macros/utilities that generate STM and periodic workers used by `ppl` and `block` ([doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md)).
+- **Referents** – `repo_proxy_ref` (repo metadata) and `task_api_referent` (Zebra stand-in) supply deterministic fixtures for tests/local runs ([doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+
+## Core Pipeline Lifecycle
+1. **Ingress** – gRPC handlers in `ppl` accept schedule/terminate/list calls, convert protobufs into domain commands, and run YAML through `definition_validator` ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/definition_validator/DOCUMENTATION.md](doc/definition_validator/DOCUMENTATION.md)).
+2. **Job Expansion** – `job_matrix` (and `parallelism` helpers) expand job definitions before pipeline/block rows are inserted ([doc/job_matrix/DOCUMENTATION.md](doc/job_matrix/DOCUMENTATION.md)).
+3. **State Persistence** – `ppl` writes pipeline/build metadata via `Ppl.EctoRepo` and triggers Looper STM workers (`Ppl.Sup.STM`) ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md)).
+4. **Block Execution** – STM workers in `block` create and monitor Zebra tasks, consuming RabbitMQ events (`task_state_exchange`) to move blocks forward ([doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md)).
+5. **Completion & Notifications** – `ppl` publishers emit pipeline/block/after-pipeline events over RabbitMQ for UI subscribers, and `gofer_client` notifies Gofer when promotions are involved ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/gofer_client/DOCUMENTATION.md](doc/gofer_client/DOCUMENTATION.md)).
+6. **Testing & Referents** – During local/integration runs, referent services respond to repo/task RPCs so flows complete without external dependencies ([doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+
+## Data Stores & Messaging
+- **PostgreSQL** – `Ppl.EctoRepo` and `Block.EctoRepo` house pipeline/block/task state; migrations live alongside each app ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md)).
+- **RabbitMQ** – primary event bus (`pipeline_state_exchange`, `pipeline_block_state_exchange`, `after_pipeline_state_exchange`, `task_state_exchange`) for cross-service coordination ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md)).
+- **Watchman / LogTee** – metrics and structured logging used by STM workers and gRPC surfaces for observability ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md), [doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md)).
+
+## External Integrations
+- **Zebra Task API** – accessed via internal clients; mimic behaviour with `task_api_referent` in non-prod environments ([doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+- **Repo Proxy** – pipeline scheduling pulls repo metadata from repo-proxy (or the referent stub) before reading YAML ([doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md)).
+- **Gofer** – promotions go through Gofer via `gofer_client`; guard with `SKIP_PROMOTIONS` for dev/test ([doc/gofer_client/DOCUMENTATION.md](doc/gofer_client/DOCUMENTATION.md)).
+
+## Local Development & Operations
+- Run `mix setup` inside `ppl/`, `block/`, `definition_validator/`, `job_matrix/`, `gofer_client/`, and `looper/` to install deps and prepare databases ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)).
+- Launch the stack by starting `repo_proxy_ref` and `task_api_referent` (if external services unavailable), then `ppl` via `iex -S mix` ([doc/repo_proxy_ref/AGENTS.md](doc/repo_proxy_ref/AGENTS.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md), [doc/ppl/AGENTS.md](doc/ppl/AGENTS.md)).
+- Migrations often affect both repos; use `mix ecto.migrate -r Ppl.EctoRepo -r Block.EctoRepo` to keep schemas in sync ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md)).
+- Looper-based workers leverage `cooling_time_sec` and Wormhole retries; adjust configs or inspect metrics when loops stall ([doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md)).
+
+## Testing & QA
+- Each app has its own `mix test` suite; run the failing service’s tests first (e.g. `cd block && MIX_ENV=test mix test`) ([doc/block/AGENTS.md](doc/block/AGENTS.md)).
+- `definition_validator` includes fixture-based tests (`mix test.watch` is handy while editing schemas) ([doc/definition_validator/DOCUMENTATION.md](doc/definition_validator/DOCUMENTATION.md)).
+- Library apps (`job_matrix`, `gofer_client`, `looper`) are pure and quick to test—use them to pin down regressions before integrating ([doc/job_matrix/DOCUMENTATION.md](doc/job_matrix/DOCUMENTATION.md), [doc/gofer_client/DOCUMENTATION.md](doc/gofer_client/DOCUMENTATION.md), [doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md)).
+- Referents have their own suites to lock in canned scenarios; update tests when extending mock behaviours ([doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)).
+
+## Observability Checklist
+- Metrics prefixes: `Ppl.*`, `Block.*`, `Looper.*` (Watchman).
+- Log correlation keys: `ppl_id`, `wf_id`, `block_id`, `task_id`, `request_token` (LogTee).
+- RabbitMQ DLQs hint at decode/state issues—investigate when STM workers stall.
+- gRPC health endpoints exposed via each service’s `HealthCheck` module support Kubernetes probes ([doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md), [doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md)).
+
+## Reference Links
+- Pipelines edge & workflows: [doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md), [doc/ppl/AGENTS.md](doc/ppl/AGENTS.md)
+- Block lifecycle: [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)
+- YAML validation: [doc/definition_validator/DOCUMENTATION.md](doc/definition_validator/DOCUMENTATION.md), [doc/definition_validator/AGENTS.md](doc/definition_validator/AGENTS.md)
+- Matrix expansion: [doc/job_matrix/DOCUMENTATION.md](doc/job_matrix/DOCUMENTATION.md), [doc/job_matrix/AGENTS.md](doc/job_matrix/AGENTS.md)
+- Promotions: [doc/gofer_client/DOCUMENTATION.md](doc/gofer_client/DOCUMENTATION.md), [doc/gofer_client/AGENTS.md](doc/gofer_client/AGENTS.md)
+- Worker macros: [doc/looper/DOCUMENTATION.md](doc/looper/DOCUMENTATION.md), [doc/looper/AGENTS.md](doc/looper/AGENTS.md)
+- Referents: [doc/repo_proxy_ref/DOCUMENTATION.md](doc/repo_proxy_ref/DOCUMENTATION.md), [doc/task_api_referent/DOCUMENTATION.md](doc/task_api_referent/DOCUMENTATION.md)

From c299a6ca6cf64fcb85e121ff65989f8601fb9044 Mon Sep 17 00:00:00 2001
From: Dejan K <sidejan@gmail.com>
Date: Thu, 16 Oct 2025 15:18:22 +0200
Subject: [PATCH 3/4] docs(plumber): add guard rails for destructive operations
 in agent notes

---
 plumber/AGENTS.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/plumber/AGENTS.md b/plumber/AGENTS.md
index 6411ec04a..069a3472c 100644
--- a/plumber/AGENTS.md
+++ b/plumber/AGENTS.md
@@ -36,6 +36,14 @@ Use this file as the high-level triage map for the plumber stack. Each section l
 - LogTee tags (`ppl_id`, `block_id`, `task_id`, `request_token`) support cross-service tracing ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)).
 - RabbitMQ exchanges: `pipeline_state_exchange`, `pipeline_block_state_exchange`, `after_pipeline_state_exchange`, `task_state_exchange`—confirm bindings when events disappear ([doc/ppl/AGENTS.md](doc/ppl/AGENTS.md), [doc/block/AGENTS.md](doc/block/AGENTS.md)).
 
+## Guard Rails (Destructive Ops)
+- Never run destructive git commands (`git reset --hard`, `git checkout --`, `git restore` on others' work, etc.) without explicit written approval in the task thread.
+- Do not delete or revert files you did not author; coordinate with involved agents first. Moving/renaming is OK after agreement.
+- Treat `.env` and environment files as read-only—only the user may edit them.
+- Before deleting a file to silence lint/type failures, stop and confirm with the user; adjacent work may be in progress.
+- Keep commits scoped to files you changed; list paths explicitly during `git commit`.
+- When rebasing, avoid editor prompts (`GIT_EDITOR=:` / `--no-edit`) and never amend commits unless the user requests it.
+
 ## Reference Index
 - Pipelines edge + workflows: [doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md)
 - Block service internals: [doc/block/DOCUMENTATION.md](doc/block/DOCUMENTATION.md)

From a426d57dd038dd89b41c9966f8854ec1c8b7165f Mon Sep 17 00:00:00 2001
From: Dejan K <sidejan@gmail.com>
Date: Thu, 16 Oct 2025 17:00:38 +0200
Subject: [PATCH 4/4] docs(plumber): update AGENTS.md with guidance on
 documenting findings after task completion

---
 plumber/AGENTS.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/plumber/AGENTS.md b/plumber/AGENTS.md
index 069a3472c..b580004fe 100644
--- a/plumber/AGENTS.md
+++ b/plumber/AGENTS.md
@@ -43,6 +43,7 @@ Use this file as the high-level triage map for the plumber stack. Each section l
 - Before deleting a file to silence lint/type failures, stop and confirm with the user; adjacent work may be in progress.
 - Keep commits scoped to files you changed; list paths explicitly during `git commit`.
 - When rebasing, avoid editor prompts (`GIT_EDITOR=:` / `--no-edit`) and never amend commits unless the user requests it.
+- After finishing a task, fold any new findings into the relevant `AGENTS.md` or `DOCUMENTATION.md` files—fix mistakes, add context, and preserve useful knowledge while keeping existing valuable guidance intact.
 
 ## Reference Index
 - Pipelines edge + workflows: [doc/ppl/DOCUMENTATION.md](doc/ppl/DOCUMENTATION.md)