Releases: AytuncYildizli/reprompter
v12.1.0 — Codex CLI runtime contract + factual corrections
Headline
Option D (Codex CLI runtime) is now a first-class Phase 3 path with a full runtime contract — not the five-bullet prose stub it had been. Native subagents via the [agents] config block (D1) and shell-level parallelism via codex exec backgrounding (D2) are both documented with runnable commands, verified against Codex 0.121.0 source.
A SKILL.md frontmatter description that silently exceeded Codex's 1024-character load limit (skipping the entire skill in Codex CLI) was trimmed to 960 characters without losing any skill-selection trigger.
Docs-only release. No runtime code changes.
Codex CLI as a documented runtime (PR #43)
New file: references/runtime/codex-runtime.md — runtime contract paralleling the implicit ones used by Options A/B. Covers D1 vs D2 picker logic, prerequisites, invocation, artifact contract, concurrency caps, status-line patterns, retries, and known gotchas (#11435, #14866, #15177 — all cross-linked).
SKILL.md Option D rewritten:
- D1 Native subagents —
[features] multi_agent = true+[agents] max_threads = 6+~/.codex/agents/<name>.tomlrole definitions + prompt-driven spawn. Includes a working orchestrator example that fans out onerpt_audit_explorerper audit dimension and synthesizes the result. - D2 Shell-level
codex exec— runnable bash block with--ephemeral,--sandbox workspace-write,--output-last-message, artifact verification, FS-polling status line, hang recovery, and a FIFO semaphore with failure propagation. - Picker table (D1 vs D2 vs "neither, use Option B for cross-agent messaging").
SKILL.md Settings now has a Codex CLI subsection documenting ~/.codex/config.toml with [features], [agents], and skill-defined [reprompter] keys. Claude Code is clarified as optional when Codex is the target runtime.
Compatibility claim rewritten from hedged "parallel sessions if available" to naming the actual mechanism. README compatibility table aligned — removed the asterisk on Codex parallel and added a clarifier pointing to Option D.
Factual corrections (PR #44, 8 commits over 7 bot-review rounds)
Every correction verified against openai/codex rust-v0.121.0 source, codex exec --help, and the current status of each cited GitHub issue as of 2026-04-18.
--full-autosemantics incodex exec. Source:codex-rs/exec/src/cli.rs:50–52defines it as "Convenience alias for low-friction sandboxed automatic execution (--sandbox workspace-write)".lib.rs:263only selects the sandbox whenfull_autois true;lib.rs:374–376sets approval policy unconditionally toAskForApproval::Neverfor headless mode. Docs now recommend--sandbox workspace-writefor readability and explain that both options work.--sandbox read-onlyartifact-write bug. D2 workers write their own/tmp/rpt-*.mdartifacts;read-onlybreaks this contract. Reverted;read-onlyis now documented only for pure-analysis workers using--output-last-messageas the artifact path.report_agent_job_resultscope. Registered only forspawn_agents_on_csvbatch workers, not ordinary prompt-spawned subagents. Removed from the D1 custom-agent template.[agents] max_threadssemantics.reserve_spawn_slotreturnsAgentLimitReachedwhen the cap is reached — normalspawn_agentcalls past the cap fail, they do not queue. Replaced the "queues 2 and runs 6" line with the correct failure mode and pointed readers atspawn_agents_on_csvfor true fan-out.- Issue #11435 framing. Closed as not-reproducible after
execwas reimplemented on the app server.--ephemeralreframed from "required to avoid corruption" to a historical motivation for the flag. - Issue #15177 fix claim. Still open with no linked fix. Removed the "Fixed in 0.122.0-alpha" claim; documented the actual current-state workaround.
codex execapproval default. HardcodedNeverin headless mode. Theapproval_policykey inconfig.tomlapplies to the interactive TUI only.features.multi_agentdefault. Default-enabled in 0.121.0+. Docs no longer imply users must set this explicitly.- Native subagents ship date. "Shipped 2026-03-16" → "
multi_agentfeature flag stabilized in 0.115.0 on 2026-03-16" (matchesrust-v0.115.0release:#14622 Stabilize multi-agent feature flag). - Bash portability. POSIX-compatible artifact-count loop using
[ -e "$f" ]andcase(not Bash-only[[ ]]), zero-match safe underset -euo pipefail. Runs under dash too. - FIFO semaphore failure propagation. Explicit PID collection, per-PID
wait,statusaggregation, fd close after the loop, andexit "$status"so downstream synthesis does not run on missing artifacts.trap 'echo >&9' EXITguarantees the semaphore token is returned even on non-zero worker exit. - Picker-table drift. Added the missing
Cross-agent messaging required mid-run → use Option Brow to the lower SKILL.md picker table. - macOS CPU-count. Added
sysctl -n hw.ncpualongsidenproc.
SKILL.md description under Codex load limit (PR #45)
Codex 0.121.0 enforces a 1024-character limit on the SKILL.md description field via validate_len(&description, MAX_DESCRIPTION_LEN, "description"). The description was 1217 characters, so Codex silently skipped the skill with:
Skipped loading 1 skill(s) due to invalid SKILL.md files.
~/.codex/skills/reprompter/SKILL.md: invalid description: exceeds maximum length of 1024 characters
Claude Code did not enforce the limit, so the bug was Codex-only and easy to miss.
Trimmed the description to 960 characters (64-character safety margin). Every Single / Repromptverse / Reverse-mode trigger keyword preserved; only verbose phrasing and redundant aliases removed.
Review notes
PR #44 went through 7 rounds of automated Codex bot review plus a source-level cross-check at the rust-v0.121.0 tag. Each round traded a narrower, more accurate claim for a broader, sloppier one — the final wording is grounded in cited source lines rather than memory-from-spec.
Lesson captured in the commit messages: source-verify contested claims before prose lands in a docs-only PR.
What's next (deliberately out of scope)
- TESTING.md scenarios for D1 (native subagent fan-out) and D2 (shell-level
codex execfan-out). - Codex-specific install one-liner in README (alongside the existing Claude Code
curl | tarrecipe).
Both fit better as small follow-up PRs so the review surface stays focused.
Contributors
Thanks to @dorukardahan for the full Codex CLI runtime write-up, factual corrections, and release prep across PRs #43, #44, #45, #46.
Full diff: v12.0.0...v12.1.0
v12.0.0 — Closed-loop Flywheel
Headline
Reprompter is no longer an open-loop prompt rewriter. Every generated prompt emits testable success criteria, every run can be recorded and scored, every outcome feeds a local flywheel that biases future generations toward historical winners, and npm run flywheel:ab proves whether the bias is actually helping. All data local. No telemetry.
This release also recovers Repromptverse under opus 4.7 (which enforces tool schemas strictly where 4.6 was lenient), ships a tool-drift linter as long-term regression insurance, and hardens the Repromptverse runtime selection path.
The closed loop, end-to-end
User rough prompt
→ Mode 1 / Mode 2 / Mode 3 interview
→ [opt-in] flywheel:query consults past outcomes
→ [if confidence ≥ medium] template + patterns biased
→ generated prompt with <success_criteria schema_version="1"> block
→ user runs it downstream
→ outcome-record.js stamps the run (with applied_recommendation if biased)
→ evaluate-outcome.js scores against criteria
→ flywheel:ingest bridges into NDJSON store
→ strategy-learner aggregates by recipe
→ flywheel:ab compares bias-on vs bias-off effectiveness
Major additions
Closed-loop flywheel (v2 + v3 rollout)
- v1 outcome-record schema at
.reprompter/outcomes/*.json(structuredsuccess_criteria,verification_results,score, optionalrole+applied_recommendation) scripts/outcome-record.js— write records (with collision-safe filenames, role attribution, optional applied_recommendation stamping)scripts/evaluate-outcome.js— score records against criteria (rule/regex, rule/predicate, llm_judge via user--judge-cmd, manual)scripts/outcome-collector.jsingest bridge — idempotent, deterministic sort, role-domain routing,applied_recommendationpreservationscripts/strategy-learner.js::getRecommendation— read-only query APIscripts/strategy-learner.js::buildAbReport— bias-on vs bias-off effectiveness delta with low-sample warnings- Bias injection —
REPROMPTER_FLYWHEEL_BIAS=0|1env flag (default off). Mode 1 step 5 + Mode 2 Phase 2 consult the flywheel when set. <success_criteria>emission across all three modes
Infrastructure + opus-4.7 compatibility
- Tool-drift linter (
scripts/validate-tool-refs.js) — catches every obsolete tool shape we've shipped a fix for. Multi-line regex support. - Auto-pick runtime (Repromptverse Phase 3) — detects capability and picks Options A–E automatically
- Tool-schema guard — canonical signatures + pitfall list captured from the 4.6→4.7 drift
Opus 4.7 recovery
Task(subagent_type=...)→Agent(...)SendMessage(type=, recipient=)→SendMessage(to=, message=)- Broadcast shutdown → per-agent
- Plus a codex review round addressing filename collision, shell quoting, regex validation, idempotent ingest, deterministic sort, agent-identity routing, partial-promptShape wildcards, filter-before-limit, Mode-3 checklist
New npm scripts
npm run validate:tool-refsnpm run flywheel:querynpm run flywheel:ingestnpm run flywheel:ab
New env flag
REPROMPTER_FLYWHEEL_BIAS=0|1(default off — read-path consultation; complements existingREPROMPTER_FLYWHEEL=0|1which controls outcome writing)
Tests
205 tests pass (was 169). outcome-collector: 30 → 43. strategy-learner: 24 → 36. Plus new --self-test modes on outcome-record.js and evaluate-outcome.js.
What's deliberately still ahead
- Default-on flip of
REPROMPTER_FLYWHEEL_BIAS. Waiting forflywheel:abto show a consistent positivedelta_mean_effectivenessacross multiple task types with ≥5 samples per group. - Per-role bias queries for Repromptverse teams once role-stamped records accumulate.
- Visualizations / dashboards on top of
flywheel:aboutput. - Community / telemetry pooling — the loop stays local-first.
Full detail
See CHANGELOG.md for the complete v12.0.0 entry, including the PR-by-PR breakdown of the 20 PRs (#23 through #42) that shipped this release.
Credit
Thanks to codex for two rounds of review that caught P1/P2 issues before they shipped.
v10.0.0 — Repromptmania
Repromptmania
Agents now ask before they act, and show what they found.
Dimension Interview
Repromptverse Phase 1 scores your raw prompt on 4 dimensions (Clarity, Specificity, Constraints, Decomposition). Low-scoring dimensions become targeted questions (0-4 max). No more vague prompts spawning expensive agents.
Agent Cards
- Plan Cards (Phase 1): see every agent's role, scope, exclusions, and output path before execution
- Status Line (Phase 3): compact emoji-based polling during execution
- Result Cards (Phase 4): per-agent score, finding count, and key insight before synthesis
User Confirmation Gate
Team plan shown before execution. You approve, adjust, or cancel before any agent runs.
Details
- 42 test scenarios, 9 anti-patterns
- All 141 unit tests + 4 benchmarks pass
- No runtime code changes — behavioral spec only
- Full changelog: CHANGELOG.md
v9.2.2 — Production Polish
Final polish pass for production readiness.
- Version aligned to match across all files
- CHANGELOG cleaned (no more semantic-release duplicates)
- Template selection bias: flywheel now recommends historically best template, the most impactful decision
REPROMPTER_FLYWHEEL_MAX_OUTCOMESenv var for configurable ledger size (default 500)- 125 tests + 188 benchmarks, 0 failures
Full changelog: v9.2.1...v9.2.2
v9.2.1 — 7 Flywheel Gaps Fixed
All 7 critical flywheel gaps resolved
Fixed by a 3-agent parallel team (RuntimeEngineer, OutcomeEngineer, DocsEngineer) using reprompter's own Repromptverse mode.
Fixes
| # | Gap | Resolution |
|---|---|---|
| 1 | flywheelPreferredTier dead code | capability-policy.js now reads tier, applies +2 score boost to matching models |
| 2 | postCorrectionEdits phantom | collectGitSignals() counts recent file edits via git log |
| 3 | .reprompter/ not in gitignore | Added to .gitignore |
| 4 | Pattern merge incomplete | getPatternById() helper + full pattern object sync after bias |
| 5 | Ledger unbounded | trimOutcomes(500) with atomic write, auto-trim on every write |
| 6 | No E2E integration test | flywheel-e2e.test.js — 5 tests covering full cycle |
| 7 | SKILL.md no user guidance | Flywheel user guidance subsection (when/how to surface) |
Test results
- 124 unit tests + 188 benchmark fixtures — 100% pass
- 5 new E2E tests for full flywheel cycle
- Zero regressions
Full changelog: v9.2.0...v9.2.1
v9.2.0 — Version Alignment
Version alignment release. Cleaned up semantic-release auto-generated changelog duplicates and aligned version strings across all files (package.json, SKILL.md, README.md).
No functional changes from v9.1.0.
Full changelog: v9.1.0...v9.2.0
v9.1.0 — Closed-Loop Flywheel
The loop is closed.
v9.0 introduced the Prompt Flywheel. v9.1 closes the loop: historical outcomes now automatically change future execution behavior.
bestRecipeForDomain()— domain-only lookup before decisionsapplyFlywheelBias()— confidence-gated pattern merge + tier override- 8 new unit tests (118 total)
Full changelog: v9.0.0...v9.1.0
v9.0.0 — Prompt Flywheel
Prompt Flywheel — closed-loop outcome learning
The prompt engineer that gets smarter every time you use it.
New
- Recipe fingerprinting — deterministic SHA-256 hash of prompt strategy vectors (template + patterns + tier + domain + layers + quality bucket)
- Passive outcome collection — captures artifact scores, retry counts, execution time at finalize_run. All data stored locally in
.reprompter/flywheel/outcomes.ndjson - Adaptive strategy learning — queries outcome ledger for similar past tasks, scores recipe groups with time-decay weighting (7-day half-life), recommends best-performing strategy with confidence levels
- Runtime integration — flywheel hooks at
plan_readyandfinalize_runin repromptverse-runtime.js - Feature flag —
REPROMPTER_FLYWHEEL=0|1(enabled by default) - 3 new telemetry stages —
fingerprint_recipe,collect_outcome,learn_strategy - Flywheel benchmark harness — 13 fixtures (fingerprint 4, effectiveness 6, strategy 3)
- 48 new unit tests — recipe-fingerprint (14), outcome-collector (19), strategy-learner (15)
Privacy
All flywheel data is stored locally. Nothing is transmitted anywhere.
Test results
- 110 unit tests — 100% pass
- 188 benchmark fixtures — 100% pass
- Zero regressions on v8.3 tests and benchmarks
Full changelog: v8.3.0...v9.0.0
v8.3.0
v8.2.0
Added
- Deterministic intent router —
scripts/intent-router.jswith explicit profile triggers + weighted keyword routing - Router unit tests —
scripts/intent-router.test.js(8 passing tests) - Benchmark harness —
scripts/run-swarm-benchmark.js+ fixture set underbenchmarks/fixtures/ - Benchmark reports — generated markdown/json artifacts for pre-release checks
Changed
- Codex/Claude operational parity hardened with runnable
npm run checkpipeline (templates + router tests + benchmark) - Packaging scope tightened — benchmark artifacts and router test file excluded from skill zip
- Version alignment across docs and skill metadata to
v8.2.0