Skip to content

Releases: AytuncYildizli/reprompter

v12.1.0 — Codex CLI runtime contract + factual corrections

18 Apr 15:49
21d63c4

Choose a tag to compare

Headline

Option D (Codex CLI runtime) is now a first-class Phase 3 path with a full runtime contract — not the five-bullet prose stub it had been. Native subagents via the [agents] config block (D1) and shell-level parallelism via codex exec backgrounding (D2) are both documented with runnable commands, verified against Codex 0.121.0 source.

A SKILL.md frontmatter description that silently exceeded Codex's 1024-character load limit (skipping the entire skill in Codex CLI) was trimmed to 960 characters without losing any skill-selection trigger.

Docs-only release. No runtime code changes.

Codex CLI as a documented runtime (PR #43)

New file: references/runtime/codex-runtime.md — runtime contract paralleling the implicit ones used by Options A/B. Covers D1 vs D2 picker logic, prerequisites, invocation, artifact contract, concurrency caps, status-line patterns, retries, and known gotchas (#11435, #14866, #15177 — all cross-linked).

SKILL.md Option D rewritten:

  • D1 Native subagents[features] multi_agent = true + [agents] max_threads = 6 + ~/.codex/agents/<name>.toml role definitions + prompt-driven spawn. Includes a working orchestrator example that fans out one rpt_audit_explorer per audit dimension and synthesizes the result.
  • D2 Shell-level codex exec — runnable bash block with --ephemeral, --sandbox workspace-write, --output-last-message, artifact verification, FS-polling status line, hang recovery, and a FIFO semaphore with failure propagation.
  • Picker table (D1 vs D2 vs "neither, use Option B for cross-agent messaging").

SKILL.md Settings now has a Codex CLI subsection documenting ~/.codex/config.toml with [features], [agents], and skill-defined [reprompter] keys. Claude Code is clarified as optional when Codex is the target runtime.

Compatibility claim rewritten from hedged "parallel sessions if available" to naming the actual mechanism. README compatibility table aligned — removed the asterisk on Codex parallel and added a clarifier pointing to Option D.

Factual corrections (PR #44, 8 commits over 7 bot-review rounds)

Every correction verified against openai/codex rust-v0.121.0 source, codex exec --help, and the current status of each cited GitHub issue as of 2026-04-18.

  • --full-auto semantics in codex exec. Source: codex-rs/exec/src/cli.rs:50–52 defines it as "Convenience alias for low-friction sandboxed automatic execution (--sandbox workspace-write)". lib.rs:263 only selects the sandbox when full_auto is true; lib.rs:374–376 sets approval policy unconditionally to AskForApproval::Never for headless mode. Docs now recommend --sandbox workspace-write for readability and explain that both options work.
  • --sandbox read-only artifact-write bug. D2 workers write their own /tmp/rpt-*.md artifacts; read-only breaks this contract. Reverted; read-only is now documented only for pure-analysis workers using --output-last-message as the artifact path.
  • report_agent_job_result scope. Registered only for spawn_agents_on_csv batch workers, not ordinary prompt-spawned subagents. Removed from the D1 custom-agent template.
  • [agents] max_threads semantics. reserve_spawn_slot returns AgentLimitReached when the cap is reached — normal spawn_agent calls past the cap fail, they do not queue. Replaced the "queues 2 and runs 6" line with the correct failure mode and pointed readers at spawn_agents_on_csv for true fan-out.
  • Issue #11435 framing. Closed as not-reproducible after exec was reimplemented on the app server. --ephemeral reframed from "required to avoid corruption" to a historical motivation for the flag.
  • Issue #15177 fix claim. Still open with no linked fix. Removed the "Fixed in 0.122.0-alpha" claim; documented the actual current-state workaround.
  • codex exec approval default. Hardcoded Never in headless mode. The approval_policy key in config.toml applies to the interactive TUI only.
  • features.multi_agent default. Default-enabled in 0.121.0+. Docs no longer imply users must set this explicitly.
  • Native subagents ship date. "Shipped 2026-03-16" → "multi_agent feature flag stabilized in 0.115.0 on 2026-03-16" (matches rust-v0.115.0 release: #14622 Stabilize multi-agent feature flag).
  • Bash portability. POSIX-compatible artifact-count loop using [ -e "$f" ] and case (not Bash-only [[ ]]), zero-match safe under set -euo pipefail. Runs under dash too.
  • FIFO semaphore failure propagation. Explicit PID collection, per-PID wait, status aggregation, fd close after the loop, and exit "$status" so downstream synthesis does not run on missing artifacts. trap 'echo >&9' EXIT guarantees the semaphore token is returned even on non-zero worker exit.
  • Picker-table drift. Added the missing Cross-agent messaging required mid-run → use Option B row to the lower SKILL.md picker table.
  • macOS CPU-count. Added sysctl -n hw.ncpu alongside nproc.

SKILL.md description under Codex load limit (PR #45)

Codex 0.121.0 enforces a 1024-character limit on the SKILL.md description field via validate_len(&description, MAX_DESCRIPTION_LEN, "description"). The description was 1217 characters, so Codex silently skipped the skill with:

Skipped loading 1 skill(s) due to invalid SKILL.md files.
~/.codex/skills/reprompter/SKILL.md: invalid description: exceeds maximum length of 1024 characters

Claude Code did not enforce the limit, so the bug was Codex-only and easy to miss.

Trimmed the description to 960 characters (64-character safety margin). Every Single / Repromptverse / Reverse-mode trigger keyword preserved; only verbose phrasing and redundant aliases removed.

Review notes

PR #44 went through 7 rounds of automated Codex bot review plus a source-level cross-check at the rust-v0.121.0 tag. Each round traded a narrower, more accurate claim for a broader, sloppier one — the final wording is grounded in cited source lines rather than memory-from-spec.

Lesson captured in the commit messages: source-verify contested claims before prose lands in a docs-only PR.

What's next (deliberately out of scope)

  • TESTING.md scenarios for D1 (native subagent fan-out) and D2 (shell-level codex exec fan-out).
  • Codex-specific install one-liner in README (alongside the existing Claude Code curl | tar recipe).

Both fit better as small follow-up PRs so the review surface stays focused.

Contributors

Thanks to @dorukardahan for the full Codex CLI runtime write-up, factual corrections, and release prep across PRs #43, #44, #45, #46.


Full diff: v12.0.0...v12.1.0

v12.0.0 — Closed-loop Flywheel

17 Apr 09:15
3053354

Choose a tag to compare

Headline

Reprompter is no longer an open-loop prompt rewriter. Every generated prompt emits testable success criteria, every run can be recorded and scored, every outcome feeds a local flywheel that biases future generations toward historical winners, and npm run flywheel:ab proves whether the bias is actually helping. All data local. No telemetry.

This release also recovers Repromptverse under opus 4.7 (which enforces tool schemas strictly where 4.6 was lenient), ships a tool-drift linter as long-term regression insurance, and hardens the Repromptverse runtime selection path.

The closed loop, end-to-end

User rough prompt
  → Mode 1 / Mode 2 / Mode 3 interview
  → [opt-in] flywheel:query consults past outcomes
  → [if confidence ≥ medium] template + patterns biased
  → generated prompt with <success_criteria schema_version="1"> block
  → user runs it downstream
  → outcome-record.js stamps the run (with applied_recommendation if biased)
  → evaluate-outcome.js scores against criteria
  → flywheel:ingest bridges into NDJSON store
  → strategy-learner aggregates by recipe
  → flywheel:ab compares bias-on vs bias-off effectiveness

Major additions

Closed-loop flywheel (v2 + v3 rollout)

  • v1 outcome-record schema at .reprompter/outcomes/*.json (structured success_criteria, verification_results, score, optional role + applied_recommendation)
  • scripts/outcome-record.js — write records (with collision-safe filenames, role attribution, optional applied_recommendation stamping)
  • scripts/evaluate-outcome.js — score records against criteria (rule/regex, rule/predicate, llm_judge via user --judge-cmd, manual)
  • scripts/outcome-collector.js ingest bridge — idempotent, deterministic sort, role-domain routing, applied_recommendation preservation
  • scripts/strategy-learner.js::getRecommendation — read-only query API
  • scripts/strategy-learner.js::buildAbReport — bias-on vs bias-off effectiveness delta with low-sample warnings
  • Bias injectionREPROMPTER_FLYWHEEL_BIAS=0|1 env flag (default off). Mode 1 step 5 + Mode 2 Phase 2 consult the flywheel when set.
  • <success_criteria> emission across all three modes

Infrastructure + opus-4.7 compatibility

  • Tool-drift linter (scripts/validate-tool-refs.js) — catches every obsolete tool shape we've shipped a fix for. Multi-line regex support.
  • Auto-pick runtime (Repromptverse Phase 3) — detects capability and picks Options A–E automatically
  • Tool-schema guard — canonical signatures + pitfall list captured from the 4.6→4.7 drift

Opus 4.7 recovery

  • Task(subagent_type=...)Agent(...)
  • SendMessage(type=, recipient=)SendMessage(to=, message=)
  • Broadcast shutdown → per-agent
  • Plus a codex review round addressing filename collision, shell quoting, regex validation, idempotent ingest, deterministic sort, agent-identity routing, partial-promptShape wildcards, filter-before-limit, Mode-3 checklist

New npm scripts

  • npm run validate:tool-refs
  • npm run flywheel:query
  • npm run flywheel:ingest
  • npm run flywheel:ab

New env flag

  • REPROMPTER_FLYWHEEL_BIAS=0|1 (default off — read-path consultation; complements existing REPROMPTER_FLYWHEEL=0|1 which controls outcome writing)

Tests

205 tests pass (was 169). outcome-collector: 30 → 43. strategy-learner: 24 → 36. Plus new --self-test modes on outcome-record.js and evaluate-outcome.js.

What's deliberately still ahead

  • Default-on flip of REPROMPTER_FLYWHEEL_BIAS. Waiting for flywheel:ab to show a consistent positive delta_mean_effectiveness across multiple task types with ≥5 samples per group.
  • Per-role bias queries for Repromptverse teams once role-stamped records accumulate.
  • Visualizations / dashboards on top of flywheel:ab output.
  • Community / telemetry pooling — the loop stays local-first.

Full detail

See CHANGELOG.md for the complete v12.0.0 entry, including the PR-by-PR breakdown of the 20 PRs (#23 through #42) that shipped this release.

Credit

Thanks to codex for two rounds of review that caught P1/P2 issues before they shipped.

v10.0.0 — Repromptmania

19 Mar 18:07

Choose a tag to compare

Repromptmania

Agents now ask before they act, and show what they found.

Dimension Interview

Repromptverse Phase 1 scores your raw prompt on 4 dimensions (Clarity, Specificity, Constraints, Decomposition). Low-scoring dimensions become targeted questions (0-4 max). No more vague prompts spawning expensive agents.

Agent Cards

  • Plan Cards (Phase 1): see every agent's role, scope, exclusions, and output path before execution
  • Status Line (Phase 3): compact emoji-based polling during execution
  • Result Cards (Phase 4): per-agent score, finding count, and key insight before synthesis

User Confirmation Gate

Team plan shown before execution. You approve, adjust, or cancel before any agent runs.

Details

  • 42 test scenarios, 9 anti-patterns
  • All 141 unit tests + 4 benchmarks pass
  • No runtime code changes — behavioral spec only
  • Full changelog: CHANGELOG.md

v9.2.2 — Production Polish

15 Mar 15:04

Choose a tag to compare

Final polish pass for production readiness.

  • Version aligned to match across all files
  • CHANGELOG cleaned (no more semantic-release duplicates)
  • Template selection bias: flywheel now recommends historically best template, the most impactful decision
  • REPROMPTER_FLYWHEEL_MAX_OUTCOMES env var for configurable ledger size (default 500)
  • 125 tests + 188 benchmarks, 0 failures

Full changelog: v9.2.1...v9.2.2

v9.2.1 — 7 Flywheel Gaps Fixed

15 Mar 14:44

Choose a tag to compare

All 7 critical flywheel gaps resolved

Fixed by a 3-agent parallel team (RuntimeEngineer, OutcomeEngineer, DocsEngineer) using reprompter's own Repromptverse mode.

Fixes

# Gap Resolution
1 flywheelPreferredTier dead code capability-policy.js now reads tier, applies +2 score boost to matching models
2 postCorrectionEdits phantom collectGitSignals() counts recent file edits via git log
3 .reprompter/ not in gitignore Added to .gitignore
4 Pattern merge incomplete getPatternById() helper + full pattern object sync after bias
5 Ledger unbounded trimOutcomes(500) with atomic write, auto-trim on every write
6 No E2E integration test flywheel-e2e.test.js — 5 tests covering full cycle
7 SKILL.md no user guidance Flywheel user guidance subsection (when/how to surface)

Test results

  • 124 unit tests + 188 benchmark fixtures — 100% pass
  • 5 new E2E tests for full flywheel cycle
  • Zero regressions

Full changelog: v9.2.0...v9.2.1

v9.2.0 — Version Alignment

14 Mar 23:48

Choose a tag to compare

Version alignment release. Cleaned up semantic-release auto-generated changelog duplicates and aligned version strings across all files (package.json, SKILL.md, README.md).

No functional changes from v9.1.0.

Full changelog: v9.1.0...v9.2.0

v9.1.0 — Closed-Loop Flywheel

14 Mar 23:42

Choose a tag to compare

The loop is closed.

v9.0 introduced the Prompt Flywheel. v9.1 closes the loop: historical outcomes now automatically change future execution behavior.

  • bestRecipeForDomain() — domain-only lookup before decisions
  • applyFlywheelBias() — confidence-gated pattern merge + tier override
  • 8 new unit tests (118 total)

Full changelog: v9.0.0...v9.1.0

v9.0.0 — Prompt Flywheel

14 Mar 23:19

Choose a tag to compare

Prompt Flywheel — closed-loop outcome learning

The prompt engineer that gets smarter every time you use it.

New

  • Recipe fingerprinting — deterministic SHA-256 hash of prompt strategy vectors (template + patterns + tier + domain + layers + quality bucket)
  • Passive outcome collection — captures artifact scores, retry counts, execution time at finalize_run. All data stored locally in .reprompter/flywheel/outcomes.ndjson
  • Adaptive strategy learning — queries outcome ledger for similar past tasks, scores recipe groups with time-decay weighting (7-day half-life), recommends best-performing strategy with confidence levels
  • Runtime integration — flywheel hooks at plan_ready and finalize_run in repromptverse-runtime.js
  • Feature flagREPROMPTER_FLYWHEEL=0|1 (enabled by default)
  • 3 new telemetry stagesfingerprint_recipe, collect_outcome, learn_strategy
  • Flywheel benchmark harness — 13 fixtures (fingerprint 4, effectiveness 6, strategy 3)
  • 48 new unit tests — recipe-fingerprint (14), outcome-collector (19), strategy-learner (15)

Privacy

All flywheel data is stored locally. Nothing is transmitted anywhere.

Test results

  • 110 unit tests — 100% pass
  • 188 benchmark fixtures — 100% pass
  • Zero regressions on v8.3 tests and benchmarks

Full changelog: v8.3.0...v9.0.0

v8.3.0

14 Mar 23:12

Choose a tag to compare

8.3.0 (2026-03-14)

Features

  • milestone 1 telemetry and observability pipeline (45d05cb)
  • milestone 2 real-world benchmarks and routing calibration (540fd9a)
  • release v8.3.0 runtime optimization stack (d9420fb)
  • release v9.0.0 prompt flywheel engine (a87c9f1)

v8.2.0

24 Feb 13:48

Choose a tag to compare

Added

  • Deterministic intent routerscripts/intent-router.js with explicit profile triggers + weighted keyword routing
  • Router unit testsscripts/intent-router.test.js (8 passing tests)
  • Benchmark harnessscripts/run-swarm-benchmark.js + fixture set under benchmarks/fixtures/
  • Benchmark reports — generated markdown/json artifacts for pre-release checks

Changed

  • Codex/Claude operational parity hardened with runnable npm run check pipeline (templates + router tests + benchmark)
  • Packaging scope tightened — benchmark artifacts and router test file excluded from skill zip
  • Version alignment across docs and skill metadata to v8.2.0