[ab-advisor] Experiment campaign for daily-fact: A/B test reasoning_depth

### 🧪 Experiment Campaign: daily-fact

**Workflow file**: `.github/workflows/daily-fact.md`
**Selected dimension**: `reasoning_depth`
**Triggered by**: `ab-testing-advisor` on 2026-05-10

---

### Background

The `daily-fact` workflow posts a daily poetic verse about the `gh-aw` project to discussion #4750, mining recent PRs, issues, and releases for inspiration. It currently uses a single-pass approach where the agent picks an interesting event and writes a verse immediately. Testing whether a multi-candidate deliberation step — evaluate several options before committing to one — produces more engaging and novel output is a high-leverage experiment because output quality is the sole success criterion for this workflow.

### Hypothesis

**Null hypothesis (H0):** Introducing a multi-candidate deliberation step does not improve perceived novelty or engagement of the daily verse compared to the single-pass baseline.

**Alternative hypothesis (H1):** Multi-candidate reasoning produces measurably more novel and engaging verses (higher discussion reaction rate, fewer near-duplicate topics compared to recent posts).

### Experiment Configuration

Add the following `experiments:` block to the workflow frontmatter:

```yaml
experiments:
  reasoning_depth:
    variants: [single_pass, multi_candidate]
    description: "Tests whether deliberating over multiple candidate facts before writing improves verse novelty and engagement."
    hypothesis: "H0: no change in discussion engagement rate. H1: multi_candidate produces more novel verses with higher reaction counts (expected +20% reactions)."
    metric: discussion_reaction_count
    secondary_metrics: [output_length_chars, run_duration_ms]
    guardrail_metrics:
      - name: empty_output_rate
        direction: min
        threshold: "<0.05"
      - name: run_success_rate
        direction: min
        threshold: ">=0.95"
    min_samples: 30
    weight: [50, 50]
    start_date: "2026-05-11"
    issue: 0
```

**Variant descriptions**:
- `single_pass`: Current behavior — agent finds one interesting recent event and immediately writes the verse (no explicit candidate comparison).
- `multi_candidate`: Agent identifies 3 distinct candidate facts, briefly scores each on novelty and poetic potential, then writes a verse for the highest-scoring candidate.

### Workflow Changes Required

In the workflow body, the **Data Sources** and **Guidelines** sections are replaced by a variant-conditional block:

**Before** (existing prompt excerpt):

```
## Data Sources

Mine recent activity from the repository to find interesting facts. Focus on:

1. **Recent PRs** (merged in the last 1-2 weeks)
   ...

## Guidelines

- **Check memory first**: Skip any PR, issue, or release that already appears in the palace results from Step 0
- **Favor recent updates** but include variety - pick something interesting, not just the most recent
...
```

**After** (with experiment conditional):

```
## Data Sources

Mine recent activity from the repository to find interesting facts. Focus on:

1. **Recent PRs** (merged in the last 1-2 weeks)
   ...

## Guidelines

- **Check memory first**: Skip any PR, issue, or release that already appears in the palace results from Step 0
{{#if experiments.reasoning_depth "multi_candidate"}}
- **Multi-candidate deliberation**: Before writing, identify exactly **3 distinct candidate facts** (one PR, one issue or release, one contributor or pattern). For each candidate write one sentence on why it is novel today. Then score each candidate 1–5 on: (a) novelty vs palace memory, (b) intrinsic poetic potential. Select the highest-scoring candidate and write the verse for that one only.
{{else}}
- **Favor recent updates** but include variety - pick something interesting, not just the most recent
{{/if}}
...
```

### Success Metrics

| Metric | Type | Target |
|--------|------|--------|
| discussion_reaction_count (👍/❤️/🎉 on posted comment) | Primary | +20% lift vs baseline |
| output_length_chars | Secondary | Within ±15% of baseline median |
| run_duration_ms | Secondary | No more than +30s overhead |
| empty_output_rate | Guardrail | Must stay < 5% |
| run_success_rate | Guardrail | Must stay ≥ 95% |

### Statistical Design

- **Variants**: `single_pass` (control), `multi_candidate` (treatment)
- **Assignment**: Round-robin via `gh-aw` experiments runtime (cache-based)
- **Minimum runs per variant**: 30 (power = 0.80, α = 0.05, expected effect size = 20% lift in reactions)
- **Expected experiment duration**: ~12 weeks (workflow runs weekdays only, ~5 runs/week → 6 per variant/week → 30 in ~5 weeks; allowing for a dedup buffer and holiday gaps, target ~12 weeks)
- **Analysis approach**: Mann-Whitney U test on discussion reaction counts per variant (non-parametric; counts are likely non-normal and low-volume)

### Implementation Steps

- [ ] Add `experiments:` section to frontmatter of `.github/workflows/daily-fact.md`
- [ ] Update issue number in `experiments.reasoning_depth.issue` after this issue is created
- [ ] Add conditional blocks to the workflow prompt body using `{{#if experiments.reasoning_depth "multi_candidate"}}...{{else}}...{{/if}}`
- [ ] Run `gh aw compile daily-fact` to regenerate `daily-fact.lock.yml`
- [ ] Monitor experiment artifact uploaded per run to `/tmp/gh-aw/experiments/state.json`
- [ ] After 30 runs per variant, compare reaction counts on discussion #4750 comments between the two cohorts using Mann-Whitney U
- [ ] Document findings and promote winning variant

> **Infrastructure note**: Fields `analysis_type`, `tags`, and `notify` are all fully implemented in both `compiler_experiments.go` and `pick_experiment.cjs`. No infrastructure sub-issue is needed.

### References

- [A/B Testing in gh-aw](https://github.com/github/gh-aw/blob/main/.github/aw/github-agentic-workflows.md)
- Workflow file: `.github/workflows/daily-fact.md`







> Generated by [Daily A/B Testing Advisor](https://github.com/github/gh-aw/actions/runs/25626711331/agentic_workflow) · ● 3.9M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fab-testing-advisor%22&type=issues)
> - [x] expires  on May 24, 2026, 10:51 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ab-advisor] Experiment campaign for daily-fact: A/B test reasoning_depth #31324

🧪 Experiment Campaign: daily-fact

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Type	Target
discussion_reaction_count (👍/❤️/🎉 on posted comment)	Primary	+20% lift vs baseline
output_length_chars	Secondary	Within ±15% of baseline median
run_duration_ms	Secondary	No more than +30s overhead
empty_output_rate	Guardrail	Must stay < 5%
run_success_rate	Guardrail	Must stay ≥ 95%

[ab-advisor] Experiment campaign for daily-fact: A/B test reasoning_depth #31324

Description

🧪 Experiment Campaign: daily-fact

Background

Hypothesis

Experiment Configuration

Workflow Changes Required

Success Metrics

Statistical Design

Implementation Steps

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions