Skip to content

[ab-advisor] Experiment campaign for daily-fact: A/B test reasoning_depth #31324

@github-actions

Description

@github-actions

🧪 Experiment Campaign: daily-fact

Workflow file: .github/workflows/daily-fact.md
Selected dimension: reasoning_depth
Triggered by: ab-testing-advisor on 2026-05-10


Background

The daily-fact workflow posts a daily poetic verse about the gh-aw project to discussion #4750, mining recent PRs, issues, and releases for inspiration. It currently uses a single-pass approach where the agent picks an interesting event and writes a verse immediately. Testing whether a multi-candidate deliberation step — evaluate several options before committing to one — produces more engaging and novel output is a high-leverage experiment because output quality is the sole success criterion for this workflow.

Hypothesis

Null hypothesis (H0): Introducing a multi-candidate deliberation step does not improve perceived novelty or engagement of the daily verse compared to the single-pass baseline.

Alternative hypothesis (H1): Multi-candidate reasoning produces measurably more novel and engaging verses (higher discussion reaction rate, fewer near-duplicate topics compared to recent posts).

Experiment Configuration

Add the following experiments: block to the workflow frontmatter:

experiments:
  reasoning_depth:
    variants: [single_pass, multi_candidate]
    description: "Tests whether deliberating over multiple candidate facts before writing improves verse novelty and engagement."
    hypothesis: "H0: no change in discussion engagement rate. H1: multi_candidate produces more novel verses with higher reaction counts (expected +20% reactions)."
    metric: discussion_reaction_count
    secondary_metrics: [output_length_chars, run_duration_ms]
    guardrail_metrics:
      - name: empty_output_rate
        direction: min
        threshold: "<0.05"
      - name: run_success_rate
        direction: min
        threshold: ">=0.95"
    min_samples: 30
    weight: [50, 50]
    start_date: "2026-05-11"
    issue: 0

Variant descriptions:

  • single_pass: Current behavior — agent finds one interesting recent event and immediately writes the verse (no explicit candidate comparison).
  • multi_candidate: Agent identifies 3 distinct candidate facts, briefly scores each on novelty and poetic potential, then writes a verse for the highest-scoring candidate.

Workflow Changes Required

In the workflow body, the Data Sources and Guidelines sections are replaced by a variant-conditional block:

Before (existing prompt excerpt):

## Data Sources

Mine recent activity from the repository to find interesting facts. Focus on:

1. **Recent PRs** (merged in the last 1-2 weeks)
   ...

## Guidelines

- **Check memory first**: Skip any PR, issue, or release that already appears in the palace results from Step 0
- **Favor recent updates** but include variety - pick something interesting, not just the most recent
...

After (with experiment conditional):

## Data Sources

Mine recent activity from the repository to find interesting facts. Focus on:

1. **Recent PRs** (merged in the last 1-2 weeks)
   ...

## Guidelines

- **Check memory first**: Skip any PR, issue, or release that already appears in the palace results from Step 0
{{#if experiments.reasoning_depth "multi_candidate"}}
- **Multi-candidate deliberation**: Before writing, identify exactly **3 distinct candidate facts** (one PR, one issue or release, one contributor or pattern). For each candidate write one sentence on why it is novel today. Then score each candidate 1–5 on: (a) novelty vs palace memory, (b) intrinsic poetic potential. Select the highest-scoring candidate and write the verse for that one only.
{{else}}
- **Favor recent updates** but include variety - pick something interesting, not just the most recent
{{/if}}
...

Success Metrics

Metric Type Target
discussion_reaction_count (👍/❤️/🎉 on posted comment) Primary +20% lift vs baseline
output_length_chars Secondary Within ±15% of baseline median
run_duration_ms Secondary No more than +30s overhead
empty_output_rate Guardrail Must stay < 5%
run_success_rate Guardrail Must stay ≥ 95%

Statistical Design

  • Variants: single_pass (control), multi_candidate (treatment)
  • Assignment: Round-robin via gh-aw experiments runtime (cache-based)
  • Minimum runs per variant: 30 (power = 0.80, α = 0.05, expected effect size = 20% lift in reactions)
  • Expected experiment duration: ~12 weeks (workflow runs weekdays only, ~5 runs/week → 6 per variant/week → 30 in ~5 weeks; allowing for a dedup buffer and holiday gaps, target ~12 weeks)
  • Analysis approach: Mann-Whitney U test on discussion reaction counts per variant (non-parametric; counts are likely non-normal and low-volume)

Implementation Steps

  • Add experiments: section to frontmatter of .github/workflows/daily-fact.md
  • Update issue number in experiments.reasoning_depth.issue after this issue is created
  • Add conditional blocks to the workflow prompt body using {{#if experiments.reasoning_depth "multi_candidate"}}...{{else}}...{{/if}}
  • Run gh aw compile daily-fact to regenerate daily-fact.lock.yml
  • Monitor experiment artifact uploaded per run to /tmp/gh-aw/experiments/state.json
  • After 30 runs per variant, compare reaction counts on discussion Daily facts #4750 comments between the two cohorts using Mann-Whitney U
  • Document findings and promote winning variant

Infrastructure note: Fields analysis_type, tags, and notify are all fully implemented in both compiler_experiments.go and pick_experiment.cjs. No infrastructure sub-issue is needed.

References

Generated by Daily A/B Testing Advisor · ● 3.9M ·

  • expires on May 24, 2026, 10:51 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions