Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
1f87e0e
Add forecast command for projecting token usage and costs
Copilot May 10, 2026
f0ea10b
Fix incorrect comment in extractConcurrencyLimit
Copilot May 10, 2026
28d6d1c
feat: wire --repo flag through workflow discovery in forecast command
Copilot May 10, 2026
c09b8e6
feat: integrate episode analysis into forecast command
Copilot May 10, 2026
cc65d5b
feat: add Monte Carlo simulation to forecast command (Poisson + boots…
Copilot May 10, 2026
7891bc1
fix: address code review feedback on Monte Carlo forecast implementation
Copilot May 10, 2026
4c8c65e
feat: remove cost forecasts, focus forecast output on effective token…
Copilot May 10, 2026
d0f18a3
fix: use exact float mean in meanStdDevInt to avoid variance bias; fi…
Copilot May 10, 2026
daf9e73
feat: mark forecast command as experimental
Copilot May 10, 2026
16006d6
docs: add W3C-style forecast command specification (sidebar order 1355)
Copilot May 10, 2026
3dd1fcc
feat: limit forecast --days to max 30 (remove 90-day option)
Copilot May 10, 2026
ad45f74
docs(adr): add draft ADR-31377 for forecast Monte Carlo projection
github-actions[bot] May 10, 2026
022be18
fix: address all reviewer comments on forecast command
Copilot May 11, 2026
3a52f21
fix: apply code review feedback on forecast tests and comments
Copilot May 11, 2026
52eff75
feat: upgrade Monte Carlo to Gamma–Poisson compound model with IsReli…
Copilot May 11, 2026
30c4aeb
refine: address code review feedback on gammaSample doc, test coverag…
Copilot May 11, 2026
ff07ec3
Merge remote-tracking branch 'origin/main' into copilot/add-forecast-…
Copilot May 11, 2026
28f4717
feat: implement --eval backtesting mode for forecast command
Copilot May 11, 2026
e853c1e
fix: address code review issues in --eval backtesting implementation
Copilot May 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions cmd/gh-aw/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -768,6 +768,7 @@ Use "` + string(constants.CLIExtensionPrefix) + ` help all" to show help for all
lintCmd := cli.NewLintCommand()
domainsCmd := cli.NewDomainsCommand()
experimentsCmd := cli.NewExperimentsCommand()
forecastCmd := cli.NewForecastCommand()

// Assign commands to groups
// Setup Commands
Expand Down Expand Up @@ -802,6 +803,7 @@ Use "` + string(constants.CLIExtensionPrefix) + ` help all" to show help for all
healthCmd.GroupID = "analysis"
checksCmd.GroupID = "analysis"
experimentsCmd.GroupID = "analysis"
forecastCmd.GroupID = "analysis"

// Utilities
mcpServerCmd.GroupID = "utilities"
Expand Down Expand Up @@ -844,6 +846,7 @@ Use "` + string(constants.CLIExtensionPrefix) + ` help all" to show help for all
rootCmd.AddCommand(projectCmd)
rootCmd.AddCommand(domainsCmd)
rootCmd.AddCommand(experimentsCmd)
rootCmd.AddCommand(forecastCmd)

// Fix help flag descriptions for all subcommands to be consistent with the
// root command ("Show help for gh aw" vs the Cobra default "help for [cmd]").
Expand Down
97 changes: 97 additions & 0 deletions docs/adr/31377-monte-carlo-projection-for-forecast-command.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# ADR-31377: Monte Carlo Projection for `gh aw forecast` Command

**Date**: 2026-05-10
**Status**: Draft
**Deciders**: Unknown (PR authored by `app/copilot-swe-agent`; human deciders TBD)

---

## Part 1 — Narrative (Human-Friendly)

### Context

Users of `gh-aw` want to project the future cost and yield of their agentic workflows before scheduling them at higher cadence or rolling them out organization-wide. Historical run data is highly variable: per-run effective token usage can vary by an order of magnitude depending on agent decisions, runs per period follow a counting process, and not every run succeeds. A naive point estimate (e.g. `avg(tokens) × avg(runs/period)`) hides this uncertainty and tends to under-state tail risk. The command must also integrate with existing analysis infrastructure (episode classification, A/B experiment variant tracking, JSON output for agent consumers) and remain useful on small samples (≤30 days of history).

### Decision

We will introduce a new **experimental** `gh aw forecast` CLI command that projects per-workflow effective token usage using **Monte Carlo simulation** (10 000 trials) rather than a single point estimate. Each trial composes three independent sources of uncertainty — Poisson-distributed run counts, bootstrap-resampled per-run effective tokens, and Bernoulli-distributed success — and the aggregated trials yield P10/P50/P90 confidence intervals. The command lives in `pkg/cli/forecast*.go`, reuses the existing `buildEpisodeData` engine from `logs_episode.go` for episode analysis, supports remote repositories via `--repo`, and is gated as experimental (stderr warning + `(experimental)` short description) because the interface and statistical assumptions may change.

### Alternatives Considered

#### Alternative 1: Point estimates from historical averages

Compute `mean(effective_tokens) × mean(runs_per_period) × success_rate` and report a single projected number per workflow. Simple, deterministic, and cheap. Rejected because it hides variance, gives users no way to reason about tail risk (which is the operationally interesting question for cost budgeting), and makes side-by-side comparisons across workflows misleading when their variance profiles differ.

#### Alternative 2: Closed-form analytical distribution (e.g. compound Poisson)

Model run count as Poisson(λ) and per-run tokens as a parametric distribution (lognormal, gamma) and derive percentiles analytically. More elegant and faster than simulation. Rejected because the historical token distribution is typically multi-modal (different agent paths produce qualitatively different cost profiles) and ill-suited to a single parametric family; bootstrap resampling preserves the empirical shape without forcing a fit. Closed form also makes per-variant A/B splits and success-rate composition awkward.

#### Alternative 3: Reuse the existing `audit` command and add a `--forecast` flag

Extend the audit command instead of creating a new top-level command. Rejected because forecasting has a different mental model from auditing (forward projection vs. retrospective analysis), a different input shape (workflow IDs vs. run IDs), and different output structure (per-period projections vs. per-run metrics). Bundling them would muddy both commands' interfaces.

### Consequences

#### Positive
- Users get P10/P50/P90 intervals, exposing tail risk that point estimates would hide.
- Bootstrap resampling preserves the empirical token distribution without imposing a parametric model.
- JSON output (`monte_carlo` field) gives downstream agents structured access to the full distribution summary.
- Reuse of `buildEpisodeData` avoids duplicating episode-classification logic and keeps semantics consistent with `logs`/`audit`.
- Experimental gating lets us iterate on the statistical model (e.g. switching distributions, adjusting trial count) without a stability commitment.

#### Negative
- Monte Carlo introduces nondeterminism in output — two consecutive runs on the same data produce slightly different P50/P10/P90 values unless a seed is pinned. This complicates regression testing and snapshot comparisons.
- 10 000 trials × N workflows × bootstrap sampling adds CPU cost; the Poisson sampler has two regimes (Knuth exact for λ ≤ 15, Normal approximation otherwise) to stay within ~10 ms/workflow, but this adds complexity vs. a closed-form approach.
- Episode counts for orchestrator-style workflows are a lower-bound estimate because `AwContext` (dispatch/workflow_call) lineage is unavailable without artifact downloads, which the command intentionally skips for speed.
- Remote-repo mode (`--repo`) degrades frontmatter metadata to empty since Markdown source is local-only, creating a subtle behavior split between local and remote forecasts.
- Adds three new files in `pkg/cli/` (forecast_command.go, forecast.go, forecast_montecarlo.go) plus tests, increasing maintenance surface in an already large package.

#### Neutral
- The `--days` flag is capped at 30, which is a deliberate sampling-window choice; longer windows would require pagination changes in `gh run list`.
- The W3C-style specification at `docs/src/content/docs/reference/forecast-specification.md` (sidebar order 1355) commits us to keeping spec and implementation in sync while the command is experimental.
- Trial count (10 000) is currently hardcoded; making it configurable is a future option but not part of this decision.

---

## Part 2 — Normative Specification (RFC 2119)

> The key words **MUST**, **MUST NOT**, **REQUIRED**, **SHALL**, **SHALL NOT**, **SHOULD**, **SHOULD NOT**, **RECOMMENDED**, **MAY**, and **OPTIONAL** in this section are to be interpreted as described in [RFC 2119](https://www.rfc-editor.org/rfc/rfc2119).

### Projection Algorithm

1. The `forecast` command **MUST** project per-workflow effective token usage using Monte Carlo simulation, not a single point estimate.
2. The simulation **MUST** run at least 10 000 independent trials per workflow per forecast invocation.
3. Each trial **MUST** compose three independent random variables: run count drawn from a Poisson process, per-run effective tokens drawn by bootstrap resampling of historical observations, and per-run success drawn as a Bernoulli with the historical success rate.
4. The Poisson sampler **MUST** use Knuth's exact algorithm when λ ≤ 15 and **MUST** use a Normal approximation when λ > 15.
5. The command **MUST** report P10, P50, and P90 effective-token percentiles in both the console table and JSON output.
6. The command **MUST NOT** emit only a point estimate without accompanying P10/P90 bounds.

### Command Interface

1. The command **MUST** be registered in the `analysis` command group as `gh aw forecast`.
2. The command **MUST** be marked experimental: its Cobra short description **MUST** include the literal substring `(experimental)`, and it **MUST** print an experimental warning to stderr at runtime.
3. The `--days` flag **MUST** accept only the values `7` and `30`; values outside this set **MUST** be rejected with a clear error.
4. The `--json` flag **MUST** emit the full `ForecastResult` struct including a `monte_carlo` object with `mean_projected_effective_tokens`, `std_dev_effective_tokens`, and P10/P50/P90 fields.
5. The command **MAY** accept multiple workflow IDs as positional arguments; when omitted, it **MUST** forecast all agentic workflows discoverable in the target repository.
6. When `--repo owner/repo` is supplied, workflow discovery **MUST** use the GitHub API (`fetchGitHubWorkflows`) and **MUST NOT** read local `.lock.yml` files for that invocation.
7. Workflow ID matching against remote repositories **MUST** be case-insensitive against both display names and file-path basenames.

### Episode Analysis

1. Episode grouping **MUST** reuse `buildEpisodeData` and `classifyEpisode` from `logs_episode.go`; it **MUST NOT** reimplement episode classification.
2. Because no artifacts are downloaded, episode linkage **MUST** rely only on GitHub Actions API fields (`event`, `headSha`, `headBranch`) and **MUST** gracefully degrade when `AwContext` is unavailable.
3. The console output **SHOULD** display an episode breakdown table only when `runs/episode > 1` (i.e. orchestrator-style workflows).

### Frontmatter and Variants

1. When forecasting local workflows, the command **MUST** surface active trigger types and concurrency configuration from each workflow's Markdown frontmatter.
2. When forecasting via `--repo`, frontmatter-derived fields **MAY** be empty without causing the forecast to fail.
3. When a workflow defines A/B experiment variants, run counts and fractions **MUST** be reported per variant in both console and JSON output.

### Conformance

An implementation is considered conformant with this ADR if it satisfies all **MUST** and **MUST NOT** requirements above. Failure to meet any **MUST** or **MUST NOT** requirement constitutes non-conformance.

---

*This is a DRAFT ADR generated by the [Design Decision Gate](https://github.com/github/gh-aw/actions/runs/25642964043) workflow. The PR author must review, complete, and finalize this document before the PR can merge.*
Loading