Skip to content

feat: add gh aw forecast command for projecting workflow effective token usage (experimental)#31377

Merged
pelikhan merged 19 commits into
mainfrom
copilot/add-forecast-cli-command
May 11, 2026
Merged

feat: add gh aw forecast command for projecting workflow effective token usage (experimental)#31377
pelikhan merged 19 commits into
mainfrom
copilot/add-forecast-cli-command

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 10, 2026

New Feature

Adds gh aw forecast — a new experimental analysis command that samples recent GitHub Actions run history and projects effective token usage and yield forward on a per-week or per-month basis using a Gamma–Poisson Monte Carlo simulation.

What does this feature do?

  • Samples run history — fetches completed runs over a configurable window (7 or 30 days, up to N runs per workflow) and derives per-run averages: effective tokens, duration, success rate. Duration is computed from started_at/updated_at timestamps; effective tokens are enriched from locally-cached run_summary.json files written by gh aw logs (no artifact re-download required)
  • Gamma–Poisson Monte Carlo projection — runs 10 000 simulation trials per workflow modelling three independent sources of uncertainty: run count (Gamma–Poisson compound process), per-run token usage (bootstrap resampling of historical observations), and per-run success (Bernoulli draws). Produces P10/P50/P90 effective-token confidence intervals rather than a single point estimate. Results are sorted by Monte Carlo P50. Low-data workflows (fewer than 5 sampled runs) are flagged with IsReliable = false and marked in the console table
  • Backtesting with --eval — shifts the entire training window back by one projection period so the forecast can be compared against actual runs in the most recent period. Computes accuracy metrics (P50ErrorAbs, P50ErrorPct, InCI) and renders a Backtesting evaluation table showing Actual Runs, Actual ET, Forecast P50, Error (abs), Error %, and whether the actual result fell within the 80% CI
  • Reads frontmatter — surfaces active trigger types (schedule, pull_request, etc.) from each workflow's Markdown source
  • Splits by experiment variant — when a workflow defines A/B experiments, run counts and fractions are broken out per variant
  • Multi-workflow comparison — pass multiple workflow IDs as positional args, or omit to forecast all agentic workflows side-by-side; per-workflow failures emit a warning and continue rather than aborting the entire forecast
  • Progress spinner — long API sampling phase shows animated feedback via console.NewSpinner
  • JSON output--json emits the full ForecastResult struct for agent consumption, including the monte_carlo field with ET mean, stddev, P10/P50/P90 percentile fields, is_reliable, and (in eval mode) the full evaluation object per workflow
  • Remote repository support--repo owner/repo forecasts workflows in any accessible repository; workflows are discovered via the GitHub API and run history is fetched with gh run list --repo
  • Episode analysis — groups runs into logical task episodes using the existing buildEpisodeData engine; surfaces per-episode token usage and episodes-per-period, and prints an episode breakdown table when orchestrator-style workflows are detected (runs/episode > 1)
  • Marked experimental — the command short description includes (experimental) and a warning is printed to stderr at runtime so users know the interface may change
  • W3C-style specification — a formal specification document is published at docs/src/content/docs/reference/forecast-specification.md (sidebar order 1355, adjacent to the MCP Gateway and Effective Tokens specs), covering command interface, workflow discovery, the Monte Carlo algorithm, episode analysis, JSON schema, error handling, and compliance test cases
gh aw forecast                              # all workflows, monthly
gh aw forecast ci-doctor daily-planner      # compare two workflows
gh aw forecast --period week --days 7       # weekly projections over 7-day window
gh aw forecast --json                       # machine-readable output
gh aw forecast --repo owner/repo            # forecast workflows in another repository
gh aw forecast --repo owner/repo ci-doctor  # forecast a specific workflow in another repository
gh aw forecast --eval                       # backtest: compare last period's forecast against actuals

Implementation details

File Role
pkg/cli/forecast_command.go Cobra command + flag definitions (experimental label, --eval flag, corrected help text)
pkg/cli/forecast.go Sampling, aggregation, projection, episode analysis, eval backtesting, rendering, experimental warning
pkg/cli/forecast_montecarlo.go Monte Carlo simulation engine (Gamma sampler, Poisson sampler, bootstrap, percentile helpers)
pkg/cli/forecast_montecarlo_test.go 19 unit tests for simulation components
pkg/cli/forecast_test.go Unit tests for format helpers, flag validation, and duration enrichment
cmd/gh-aw/main.go Registers forecast in the analysis command group
docs/src/content/docs/reference/forecast-specification.md W3C-style formal specification (Experimental Draft, sidebar order 1355)

Projection is driven by a Gamma–Poisson compound Monte Carlo simulation (the Negative Binomial model standard in actuarial science and industrial reliability). For each trial, the arrival rate λ is drawn from its Bayesian posterior Gamma(n + 0.5, scale = λ̂/n) — where n is the observed run count and 0.5 is the Jeffreys non-informative prior shape — then the run count is drawn from Poisson(λ_trial). Per-run effective tokens are sampled via bootstrap resampling of historical observations, and each run independently succeeds with the historical success rate (Bernoulli). This compound model naturally produces wider confidence intervals for small samples and converges to the classical Poisson estimate as n grows. The gammaSample function uses the Marsaglia-Tsang squeeze method. Aggregating 10 000 trials yields P10/P50/P90 effective-token estimates.

Backtesting (--eval) date window:

[anchor - days ... anchor]  training window (same --days of history, shifted back one period)
[anchor ... now]            validation window (the period being evaluated)
anchor = now - periodDays

The ForecastEvaluation struct records training_start_date, training_end_date, validation_end_date, actual_runs, actual_effective_tokens, p50_error_abs, p50_error_pct, and in_ci (whether actual ET fell within the P10–P90 interval). Runs with missing timestamps are excluded from validation-window counting to avoid undefined bias.

The --days flag accepts 7 or 30 (maximum 30 days). The console table columns are: Workflow, Sampled Runs, Success Rate, Yield/Period (throughput: success_rate × runs_per_period), Avg ET, Proj. ET (P50), 80% CI (P10–P90), and Triggers. ET values are formatted as K/M abbreviations. Workflows with fewer than 5 sampled runs are marked * in the table with a footnote warning. The --json output includes the full monte_carlo summary with mean_projected_effective_tokens, std_dev_effective_tokens, all three ET percentile fields, and is_reliable. The yield JSON field represents the throughput rate (success_rate × observed_runs_per_period), distinct from success_rate.

When --repo is set, workflow discovery uses fetchGitHubWorkflows (GitHub API) instead of local .lock.yml files. Provided workflow IDs are matched case-insensitively against remote workflow display names and file-path basenames. Frontmatter metadata degrades gracefully to empty for remote repos since Markdown source files are local-only.

Episode analysis reuses the existing buildEpisodeData + classifyEpisode engine from logs_episode.go. Because no artifact downloads occur during forecasting, only GitHub Actions API fields (event, headSha, headBranch) are used for linkage — the resulting episode count is therefore a lower-bound estimate for orchestrator-style workflows. The ForecastEpisodeSummary struct exposes sampled_episodes, runs_per_episode, avg_effective_tokens_per_episode, and observed_episodes_per_period in JSON output and as a console table.

Copilot AI and others added 2 commits May 10, 2026 21:01
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title Add forecast command for projecting workflow token usage and costs feat: add gh aw forecast command for projecting workflow token usage and costs May 10, 2026
Copilot AI requested a review from pelikhan May 10, 2026 21:04
@github-actions
Copy link
Copy Markdown
Contributor

Hey @Copilot 👋 — the gh aw forecast command is a genuinely interesting addition, and the PR description is thorough and well-structured. That said, there are a couple of things that need to be addressed before this can move forward:

  • Process violation — per the CONTRIBUTING.md, non-core team members should not open PRs directly. The correct path is to open a detailed issue with an agentic plan, then a core team member picks it up and implements it using agents. The Copilot bot has CONTRIBUTOR (not MEMBER/OWNER) association, which places it outside the inner-circle process.
  • No tests — the three new files (pkg/cli/forecast.go, pkg/cli/forecast_command.go, cmd/gh-aw/main.go) add 665 lines with zero test coverage. Sampling logic, projection arithmetic, and rendering helpers all warrant unit tests.

If you'd like a hand drafting the issue plan or adding tests, assign this prompt to your coding agent:

Close this PR. Instead, open a GitHub issue on github/gh-aw titled "Plan: add `gh aw forecast` command for projecting workflow token usage and costs".

In the issue body, write a detailed agentic plan that describes:
1. The motivation and user story for the forecast command.
2. The proposed file structure (forecast_command.go, forecast.go, main.go registration).
3. The full implementation plan, including sampling logic, projection formula, frontmatter reading, and JSON output.
4. A testing plan covering unit tests for: sampling aggregation, projection calculations, per-variant splitting, and JSON rendering.

Do NOT include code — only the plan. The core team will implement it using agents.

Generated by Contribution Check · ● 9.3M ·

@pelikhan
Copy link
Copy Markdown
Collaborator

@mnkiefer feel free to prompt further

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot add --repo support

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 10, 2026

@copilot add --repo support

Done in feat: wire --repo flag through workflow discovery in forecast command.

The --repo flag was already wired into the GitHub API calls for fetching run history. The missing piece was workflow discovery: resolveForecastWorkflows was still reading local .lock.yml files even when --repo is set.

Changes made in forecast.go:

  • When --repo is provided, resolveForecastWorkflows now calls fetchGitHubWorkflows to list workflows from the remote repo via the API, rather than reading local .lock.yml files.
  • When no workflow IDs are specified, all workflows from the remote repo are forecasted.
  • When workflow IDs are specified, they are matched case-insensitively against remote workflow display names and file-path basenames (matchRemoteWorkflowName).

Frontmatter metadata (triggers, concurrency) degrades gracefully to empty for remote repos since the Markdown source files are local-only.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot integrate episode analyzis

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot use best in class predictive technique based on operational research and statistics, montecarlo

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 10, 2026

@copilot integrate episode analyzis

Done in feat: integrate episode analysis into forecast command.

The existing buildEpisodeData + classifyEpisode engine from logs_episode.go now runs on the completed runs sampled during forecasting. Since no artifact downloads happen, episode linkage uses the fields available in the GitHub Actions run list (event, headSha, headBranch). Dispatch and workflow_call lineage that require aw_info.json are not available, so the episode count is a lower-bound estimate for orchestrator workflows — this is noted in the code.

New additions:

  • ForecastEpisodeSummarysampled_episodes, runs_per_episode, avg_effective_tokens_per_episode, observed_episodes_per_period, projected_cost_per_episode
  • EpisodeAnalysis *ForecastEpisodeSummary field on ForecastWorkflowResult (JSON episode_analysis, omitted when nil)
  • workflowRunToRunData — converts WorkflowRunRunData using API-available fields
  • buildForecastEpisodeSummary — calls the episode engine and aggregates per-episode metrics
  • Console: when any workflow shows runs_per_episode > 1 (multi-run episodes detected), an "Episode analysis" table is printed after the main forecast table

Copilot AI and others added 2 commits May 10, 2026 22:05
…trap + Bernoulli CI)

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 10, 2026

@copilot use best in class predictive technique based on operational research and statistics, montecarlo

Done in feat: add Monte Carlo simulation to forecast command (Poisson + bootstrap + Bernoulli CI).

The forecast now runs a 10 000-trial Monte Carlo simulation per workflow modelling three independent sources of uncertainty:

Source Model
Run count per period Poisson(λ = observed runs/period)
Per-run token usage Bootstrap resampling of historical ET observations
Per-run success Bernoulli(historical success rate)

New output fields (monte_carlo in --json):

  • mean_projected_cost_usd, std_dev_cost_usd
  • p10_projected_cost_usd, p50_projected_cost_usd, p90_projected_cost_usd
  • p10/p50/p90_projected_effective_tokens

Console table now shows P50 as the primary estimate plus a new "80% CI (P10–P90)" column, replacing the single point estimate.

The Poisson sampler uses Knuth's exact algorithm for λ ≤ 15 and a Normal approximation for larger λ to keep 10k-trial simulation time well under 10ms. 14 unit tests cover the sampler, percentile helpers, ordering invariants, and order-of-magnitude accuracy.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot focus on effective token forecast, remove cost forecasts

… projections

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot review all comments

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot as an expert in industrial and market forecasting, review the algorithm used here

Copilot AI and others added 2 commits May 11, 2026 00:14
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 11, 2026

@copilot review all comments

All reviewer comments addressed across two commits:

Code fixes (forecast.go):

  • ET observations now loaded from cached run_summary.json files (written by gh aw logs) — fixes zero-ET forecasts without downloading artifacts
  • Duration computed from UpdatedAt − StartedAt (gh run list never sets the Duration field)
  • Sort order uses Monte Carlo P50 when available, falling back to point estimate
  • formatForecastPercent now takes a hasData bool0% is shown when all runs failed, N/A only when no runs were sampled
  • Per-workflow failures emit a warning and continue; remaining workflows are still forecasted

Help text (forecast_command.go):

  • Removed "cost" and inaccurate trigger-frequency/concurrency claims; description now accurately describes what the command does

Spec (forecast-specification.md):

  • --days permitted values updated to {7, 30} (removed 90)
  • Yield redefined as throughput rate (success_rate × observed_runs_per_period)
  • ET retrieval section updated to describe cached run summaries instead of artifact downloads
  • Duration formula corrected to updated_at − started_at
  • Monte Carlo Bernoulli section now references success_rate not yield
  • Console table columns updated to match actual implementation (Triggers column added, K/M formatting documented)

Tests (forecast_test.go):

  • Added //go:build !integration tag, testify assertions, and 12 tests covering format helpers, validation, and Duration enrichment

Copilot AI and others added 2 commits May 11, 2026 00:27
…able flag

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
…e, and footer text

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 11, 2026

@copilot as an expert in industrial and market forecasting, review the algorithm used here

Done in feat: upgrade Monte Carlo to Gamma–Poisson compound model with IsReliable flag.

Algorithm review findings and changes:

The original model treated the Poisson arrival rate λ as a known constant equal to the observed rate. This is the primary weakness in industrial/operational forecasting: with a short history window (e.g., 7 days, 5 runs), the rate estimate itself carries significant uncertainty, causing the CI to be artificially narrow.

Key improvement — Gamma–Poisson compound model (Negative Binomial):
Instead of numRuns ~ Poisson(λ_observed) per trial, the arrival rate is now drawn from its Bayesian posterior each trial:

λ_trial ~ Gamma(n + 0.5, scale = observedRunsPerPeriod / n)
numRuns ~ Poisson(λ_trial)

where n is the observed run count and 0.5 is the Jeffreys non-informative prior shape. This Gamma–Poisson compound is equivalent to the Negative Binomial — the standard model for overdispersed count data in actuarial science and industrial reliability. It naturally widens the CI for sparse samples and converges to the classical Poisson for large n, with no function signature changes required.

A new TestRunMonteCarloGammaPoissonWiderCI test directly verifies that 3-run histories produce wider CIs than 100-run histories at the same observed rate.

Additional improvements:

  • IsReliable flag on ForecastMonteCarloSummary (set when n ≥ 5); low-data workflows are marked * in the console table with an explanatory footnote
  • gammaSample (Marsaglia-Tsang squeeze method) as the new sampling primitive
  • Footer updated to explain what the model does in plain language

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot merge main and recompile

…cli-command

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 11, 2026

@copilot merge main and recompile

Done. Merged origin/main (b0dbcb7 — rate-limit rename + firewall bump) into the branch. go build ./... and all forecast tests pass cleanly after the merge.

@pelikhan
Copy link
Copy Markdown
Collaborator

@copilot add a flag --eval that does forecasting on past events (start forecast from a past date to be able to compare to measured data) and evaluates the quality of the forecast.

Copilot AI and others added 2 commits May 11, 2026 01:28
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants