You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analysis Period: Last 30 days (2026-05-17 → 2026-06-05) Total Tasks Analyzed: 999 copilot-authored PRs Clusters Identified: 8 (KMeans, silhouette 0.0419) Overall Success Rate: 75.4% merged Avg Iterations: 4.38 commits/PR
Eight stable themes emerged. The two largest — shared-helper refactors and safe-output/schema work — account for 59% of all tasks. Success skews high overall, but two clusters drag below the mean: Codex/AWF generated config & defaults (66%) and the Copilot SDK-driver work (68%).
Key Findings
Refactor + safe-output work dominates. Clusters C6 and C2 together are 586 of 999 PRs and merge near the mean (~75–76%) — the reliable bread-and-butter of the agent fleet.
Generated/Codex config (C7) is the weak spot. At 66% merge with the largest blast radius (avg 78 files changed, ~1054/780 add/del), large auto-generated diffs are the least likely to land.
Sous-chef tasks (C4) succeed but grind. Highest merge rate (85%) yet by far the most iterations (9.09 commits/PR) — they get there, but slowly.
Small, well-scoped fixes win. "Fix failing Actions job" (C1) merges at 81% in only 3.11 commits — the tightest scope, the cleanest outcome.
Trend: success dipped this period. Overall merge rate 75.4% vs 80.3% (2026-06-02) and 78.8% earlier — worth watching whether the SDK/Codex clusters are pulling the average down.
Tighten Codex/AWF-generated config tasks (C7). The lowest merge rate pairs with the biggest diffs. Split large generated changes into reviewable chunks, or add a pre-merge diff-size guard so reviewers aren't handed 78-file PRs.
Investigate Copilot SDK-driver failures (C0). 68% merge over 5.9 commits suggests the harness/SDK mode is still flaky — a good candidate for a focused reliability pass.
Cap iteration churn on sous-chef tasks (C4). They land but average 9 commits; clearer up-front task specs or a turn budget could cut the back-and-forth.
Keep leaning on tightly-scoped fix prompts (C1). Cheapest and among the most reliable — the pattern to replicate when phrasing new tasks.
Methodology & limitations
TF-IDF (1–2 grams, domain stop-words removed) over title+body of 999 cleaned PR descriptions; KMeans with k chosen by silhouette across k=4–8.
Firewall/warning boilerplate, code blocks, and URLs stripped before vectorizing.
Iterations use commit count as a proxy: the gh-aw workflow-run logs (true turn counts/cost) were not fetched this run, so turn/cost metrics are approximated by commits. Silhouette is low (0.04), expected for short, overlapping technical text — clusters are directional, not hard partitions.
History persisted to cache for trend tracking across runs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analysis Period: Last 30 days (2026-05-17 → 2026-06-05)
Total Tasks Analyzed: 999 copilot-authored PRs
Clusters Identified: 8 (KMeans, silhouette 0.0419)
Overall Success Rate: 75.4% merged
Avg Iterations: 4.38 commits/PR
Eight stable themes emerged. The two largest — shared-helper refactors and safe-output/schema work — account for 59% of all tasks. Success skews high overall, but two clusters drag below the mean: Codex/AWF generated config & defaults (66%) and the Copilot SDK-driver work (68%).
Key Findings
Success Rate by Cluster
Detailed cluster breakdown
C6: Shared helpers & error/path refactors
C2: Safe-outputs & schema validation
recreate-ref: trueto prevent branch-exists push failure #32769, Prevent chaos create-pull-request fallback when branch already exists #32770, deps: bumpgithub.com/charmbracelet/x/exp/goldentov0.0.0-20260525135217-abeec2b8bf0b#35188, [WIP] Add deprecated: true flag to rate-limit and inline-sub-agents #34776C3: Prompts, skills & experiments
C7: Codex/AWF generated config & defaults
C4: Sous-chef multi-agent (GPT-mini) tasks
add_commenttargets #35371, Fix AWF tool-cache mounting so Daily News Copilot can start in chroot #36900, Addopusplanbuiltin alias to Claude model routing #34263C5: Model alias/multiplier plumbing
raptor-minialias coverage and missing GPT-5 search multipliers #33177, Add/sitemap.xmlalias and strengthen README GEO/brand signals #32862C0: Copilot SDK driver/harness mode
otlp-env-varsskill for OpenTelemetry SDK env var configuration #32827, Update bundled Copilot SDK to 1.0.0 and recompile lockfiles #36495, chore(deps): update github.com/modelcontextprotocol/go-sdk v1.6.0 → v1.6.1 #35201C1: Fixing failing GitHub Actions jobs
Representative data table (2 highest-iteration PRs per cluster)
github-app.missing-keyignore mode and guard App token||expressions in prompt markdown never subcreate-check-runsafe output type for multi-agent PR agh aw initto create the Agentic Workflows custom aon.pull_request_reviewer: slash_commandsynthetic trigRecommendations
Methodology & limitations
gh-awworkflow-run logs (true turn counts/cost) were not fetched this run, so turn/cost metrics are approximated by commits. Silhouette is low (0.04), expected for short, overlapping technical text — clusters are directional, not hard partitions.References: §27010769683
Beta Was this translation helpful? Give feedback.
All reactions