Harnessa

A GAN-inspired multi-agent harness framework for orchestrating AI agents that build software better together than any single agent can alone.

Quick Start

One-Command Usage (any repo)

# Install the skill
mkdir -p .github/copilot/skills/harnessa
curl -o .github/copilot/skills/harnessa/SKILL.md \
  https://raw.githubusercontent.com/ridermw/harnessa/main/.github/copilot/skills/harnessa/SKILL.md

# Run the trio on any task
copilot -p '/harnessa Fix the authentication bug' --allow-all

See INSTALL.md for full installation options.

What Is This?

Harnessa is an open-source framework built on research from Anthropic's harness design work. It implements a three-agent architecture — Planner, Generator, Evaluator — where adversarial tension between the builder and the critic drives output quality far beyond what a solo agent achieves.

Think of it like a GAN for software: the Generator builds, the Evaluator tears it apart, and the feedback loop drives both toward better outcomes. The Planner ensures they're building the right thing in the first place.

Why?

A single AI agent building software hits two walls:

It loses coherence as context grows — and some models prematurely wrap up work ("context anxiety")
It can't judge its own work — agents reliably praise mediocre output, even when bugs are obvious

Separating generation from evaluation breaks through both. The evidence from Anthropic's experiments:

Approach	Duration	Cost	Result
Solo agent	20 min	$9	Core feature broken
3-agent harness	6 hr	$200	16-feature app, working core, polished UI

Not incrementally better — categorically different.

Architecture

┌─────────────────────────────────────────────────────┐
│                    Orchestrator                       │
│                                                       │
│   ┌──────────┐    ┌──────────┐    ┌──────────────┐   │
│   │ Planner  │───▶│Generator │◀──▶│  Evaluator   │   │
│   │          │    │          │    │              │   │
│   │ Expands  │    │ Builds   │    │ Tests live   │   │
│   │ prompt   │    │ features │    │ app, grades  │   │
│   │ into     │    │ in       │    │ against      │   │
│   │ full     │    │ sprints  │    │ criteria,    │   │
│   │ spec     │    │          │    │ files bugs   │   │
│   └──────────┘    └──────────┘    └──────────────┘   │
│                                                       │
│   ┌─────────────────────────────────────────────┐     │
│   │              Telemetry Layer                 │     │
│   │  Timing · Cost · Scores · Bugs · Trends     │     │
│   └─────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────┘

Planner — Turns 1-4 sentence prompts into ambitious product specs. Focuses on what, not how.

Generator — Implements the spec in sprints. Negotiates "sprint contracts" with the Evaluator before coding. Uses git for checkpointing.

Evaluator — Interacts with the live running application (via Playwright). Grades against criteria with hard thresholds. Skeptical by default. Files specific, actionable bugs.

Agents communicate through files on disk — every decision, score, and bug report is written down, creating a full audit trail.

Key Concepts

Sprint Contracts

Before each sprint, the Generator and Evaluator negotiate what "done" looks like. This bridges the gap between high-level spec and testable implementation without over-specifying too early.

Grading Criteria

Subjective quality becomes gradable through concrete criteria. The framework ships with defaults for frontend and full-stack work, and supports custom criteria per project type. Criteria wording directly steers the Generator's output — they're not just measurement, they're guidance.

Adversarial Feedback Loop

The Evaluator's feedback flows back to the Generator as input for the next iteration. The Generator decides: refine the current direction, or pivot entirely. 5-15 iterations per run, with scores trending upward before plateauing.

Telemetry

Every run produces structured telemetry: timing, cost, scores, bugs, and quality trends. Claims about improvement are backed by data.

Experimental Results

Full data: RESULTS.md — 10 benchmark runs across 5 tasks in Python, TypeScript, and Go.

Metric	Solo	Trio	Δ
Verdicts	3 PASS, 2 FAIL	4 PASS, 1 FAIL	+1 PASS
Mean functionality score	4.8	7.6	+2.8
Benchmarks won	—	3 of 5
Duration multiplier	—	~1.8x

Headline: Solo FAIL → Trio PASS on the fullstack benchmark — the categorical difference the article predicted.

Article claims validated: 5 of 9 confirmed, 2 partially confirmed, 2 inconclusive. All 3 Harnessa-specific hypotheses evaluated (2 confirmed, 1 inconclusive). See Section 6.1.

Showcase App

The showcase/ directory contains a full-stack AI Code Review Dashboard built by the trio pattern itself. 32 files: Express + React + Vite + Tailwind + sql.js.

cd showcase && npm install && npm run dev

See showcase/BUILD_LOG.md for the full build narrative (Planner→Generator→Evaluator phases).

Presentation Site

The repo now includes a polished web presentation for The Adversarial Architecture in website/ — built as a GitHub Pages-friendly keynote experience rather than a slide export.

cd website && npm install && npm run dev

Static production build:

cd website && npm run build

Documentation

Document	Purpose
PROJECT_SPEC.md	Complete project specification — the "bible" for this repo
RESULTS.md	Experimental results — solo vs trio across 5 benchmarks
INSTALL.md	Installation guide with verification, troubleshooting, uninstall
website/PLAN.md	Presentation site plan and scene architecture
showcase/BUILD_LOG.md	How the trio built the showcase app end-to-end
docs/ARTICLE_REFERENCE.md	Full text of the Anthropic article that inspired this project
docs/ARCHITECTURE.md	Technical architecture deep-dive
CONTRIBUTING.md	How to contribute

Project Status

✅ V1 complete — Framework built (21 source files, 213 tests), 5 benchmarks run in both solo and trio modes, experimental results documented. The trio pattern shows measurable quality improvement on medium-complexity tasks, with the strongest signal on fullstack work where solo agents fail.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harnessa

Quick Start

One-Command Usage (any repo)

What Is This?

Why?

Architecture

Key Concepts

Sprint Contracts

Grading Criteria

Adversarial Feedback Loop

Telemetry

Experimental Results

Showcase App

Presentation Site

Documentation

Project Status

Related Reading

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github		.github
benchmarks		benchmarks
criteria		criteria
docs		docs
presentation		presentation
scripts		scripts
showcase		showcase
src/harnessa		src/harnessa
telemetry-archive		telemetry-archive
tests		tests
website		website
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
PROJECT_SPEC.md		PROJECT_SPEC.md
README.md		README.md
RESULTS.md		RESULTS.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Harnessa

Quick Start

One-Command Usage (any repo)

What Is This?

Why?

Architecture

Key Concepts

Sprint Contracts

Grading Criteria

Adversarial Feedback Loop

Telemetry

Experimental Results

Showcase App

Presentation Site

Documentation

Project Status

Related Reading

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages