🧪 Experiment Campaign: daily-fact
Workflow file: .github/workflows/daily-fact.md
Selected dimension: reasoning_depth
Triggered by: ab-testing-advisor on 2026-05-10
Background
The daily-fact workflow posts a daily poetic verse about the gh-aw project to discussion #4750, mining recent PRs, issues, and releases for inspiration. It currently uses a single-pass approach where the agent picks an interesting event and writes a verse immediately. Testing whether a multi-candidate deliberation step — evaluate several options before committing to one — produces more engaging and novel output is a high-leverage experiment because output quality is the sole success criterion for this workflow.
Hypothesis
Null hypothesis (H0): Introducing a multi-candidate deliberation step does not improve perceived novelty or engagement of the daily verse compared to the single-pass baseline.
Alternative hypothesis (H1): Multi-candidate reasoning produces measurably more novel and engaging verses (higher discussion reaction rate, fewer near-duplicate topics compared to recent posts).
Experiment Configuration
Add the following experiments: block to the workflow frontmatter:
experiments:
reasoning_depth:
variants: [single_pass, multi_candidate]
description: "Tests whether deliberating over multiple candidate facts before writing improves verse novelty and engagement."
hypothesis: "H0: no change in discussion engagement rate. H1: multi_candidate produces more novel verses with higher reaction counts (expected +20% reactions)."
metric: discussion_reaction_count
secondary_metrics: [output_length_chars, run_duration_ms]
guardrail_metrics:
- name: empty_output_rate
direction: min
threshold: "<0.05"
- name: run_success_rate
direction: min
threshold: ">=0.95"
min_samples: 30
weight: [50, 50]
start_date: "2026-05-11"
issue: 0
Variant descriptions:
single_pass: Current behavior — agent finds one interesting recent event and immediately writes the verse (no explicit candidate comparison).
multi_candidate: Agent identifies 3 distinct candidate facts, briefly scores each on novelty and poetic potential, then writes a verse for the highest-scoring candidate.
Workflow Changes Required
In the workflow body, the Data Sources and Guidelines sections are replaced by a variant-conditional block:
Before (existing prompt excerpt):
## Data Sources
Mine recent activity from the repository to find interesting facts. Focus on:
1. **Recent PRs** (merged in the last 1-2 weeks)
...
## Guidelines
- **Check memory first**: Skip any PR, issue, or release that already appears in the palace results from Step 0
- **Favor recent updates** but include variety - pick something interesting, not just the most recent
...
After (with experiment conditional):
## Data Sources
Mine recent activity from the repository to find interesting facts. Focus on:
1. **Recent PRs** (merged in the last 1-2 weeks)
...
## Guidelines
- **Check memory first**: Skip any PR, issue, or release that already appears in the palace results from Step 0
{{#if experiments.reasoning_depth "multi_candidate"}}
- **Multi-candidate deliberation**: Before writing, identify exactly **3 distinct candidate facts** (one PR, one issue or release, one contributor or pattern). For each candidate write one sentence on why it is novel today. Then score each candidate 1–5 on: (a) novelty vs palace memory, (b) intrinsic poetic potential. Select the highest-scoring candidate and write the verse for that one only.
{{else}}
- **Favor recent updates** but include variety - pick something interesting, not just the most recent
{{/if}}
...
Success Metrics
| Metric |
Type |
Target |
| discussion_reaction_count (👍/❤️/🎉 on posted comment) |
Primary |
+20% lift vs baseline |
| output_length_chars |
Secondary |
Within ±15% of baseline median |
| run_duration_ms |
Secondary |
No more than +30s overhead |
| empty_output_rate |
Guardrail |
Must stay < 5% |
| run_success_rate |
Guardrail |
Must stay ≥ 95% |
Statistical Design
- Variants:
single_pass (control), multi_candidate (treatment)
- Assignment: Round-robin via
gh-aw experiments runtime (cache-based)
- Minimum runs per variant: 30 (power = 0.80, α = 0.05, expected effect size = 20% lift in reactions)
- Expected experiment duration: ~12 weeks (workflow runs weekdays only, ~5 runs/week → 6 per variant/week → 30 in ~5 weeks; allowing for a dedup buffer and holiday gaps, target ~12 weeks)
- Analysis approach: Mann-Whitney U test on discussion reaction counts per variant (non-parametric; counts are likely non-normal and low-volume)
Implementation Steps
Infrastructure note: Fields analysis_type, tags, and notify are all fully implemented in both compiler_experiments.go and pick_experiment.cjs. No infrastructure sub-issue is needed.
References
Generated by Daily A/B Testing Advisor · ● 3.9M · ◷
🧪 Experiment Campaign: daily-fact
Workflow file:
.github/workflows/daily-fact.mdSelected dimension:
reasoning_depthTriggered by:
ab-testing-advisoron 2026-05-10Background
The
daily-factworkflow posts a daily poetic verse about thegh-awproject to discussion #4750, mining recent PRs, issues, and releases for inspiration. It currently uses a single-pass approach where the agent picks an interesting event and writes a verse immediately. Testing whether a multi-candidate deliberation step — evaluate several options before committing to one — produces more engaging and novel output is a high-leverage experiment because output quality is the sole success criterion for this workflow.Hypothesis
Null hypothesis (H0): Introducing a multi-candidate deliberation step does not improve perceived novelty or engagement of the daily verse compared to the single-pass baseline.
Alternative hypothesis (H1): Multi-candidate reasoning produces measurably more novel and engaging verses (higher discussion reaction rate, fewer near-duplicate topics compared to recent posts).
Experiment Configuration
Add the following
experiments:block to the workflow frontmatter:Variant descriptions:
single_pass: Current behavior — agent finds one interesting recent event and immediately writes the verse (no explicit candidate comparison).multi_candidate: Agent identifies 3 distinct candidate facts, briefly scores each on novelty and poetic potential, then writes a verse for the highest-scoring candidate.Workflow Changes Required
In the workflow body, the Data Sources and Guidelines sections are replaced by a variant-conditional block:
Before (existing prompt excerpt):
After (with experiment conditional):
Success Metrics
Statistical Design
single_pass(control),multi_candidate(treatment)gh-awexperiments runtime (cache-based)Implementation Steps
experiments:section to frontmatter of.github/workflows/daily-fact.mdexperiments.reasoning_depth.issueafter this issue is created{{#if experiments.reasoning_depth "multi_candidate"}}...{{else}}...{{/if}}gh aw compile daily-factto regeneratedaily-fact.lock.yml/tmp/gh-aw/experiments/state.jsonReferences
.github/workflows/daily-fact.md