Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ npm install -g cue-ai

---

<p align="center">
<img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/demo.gif" alt="cue picking a cuecard and launching the agent" width="820">
</p>

## Why this exists

If you've been using AI coding agents for a while, you've probably collected a pile of skills, MCP servers, and custom instructions. Maybe hundreds. Here's the problem:
Expand All @@ -57,6 +61,10 @@ cue fixes this by scoping everything per directory. Your Medusa shop loads the M

### Before vs after — in numbers

<p align="center">
<img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/isolation-comparison.svg" alt="Everything-loaded vs a scoped cuecard — always-on context compared" width="820">
</p>

| Loadout | Always-on context | Cost / 100 msgs (Sonnet input) |
|---|---|---|
| Everything loaded (`full` profile) | ~81k tokens | ~$24 |
Expand Down Expand Up @@ -93,6 +101,10 @@ One cuecard per project. Your agent reads the right one the moment you launch it

No daemon, no background process. cue intercepts the *call* to your agent, resolves the directory's cuecard, materializes it once, then hands off to the real binary:

<p align="center">
<img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/architecture.svg" alt="cue resolve to materialize to exec flow" width="820">
</p>

```
you type `claude`
Expand Down Expand Up @@ -249,6 +261,10 @@ cue failures --propose # let Claude draft profile improvements from failur

`cue --help` shows the full ~50-subcommand surface; the set above covers a typical week.

<p align="center">
<img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/optimizer-dashboard.svg" alt="cue optimizer dashboard — skills, MCPs, CLIs, and usage per profile" width="820">
</p>

---

## API
Expand Down
16 changes: 16 additions & 0 deletions evals/agent-eval/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# cue-agent-eval — credentials
#
# Pick ONE agent API key. The experiments default to `agent: 'claude-code'`
# (direct Anthropic), so ANTHROPIC_API_KEY is the simplest path.
ANTHROPIC_API_KEY=sk-ant-...

# Or, to route through the Vercel AI Gateway instead, set this and change the
# experiments to `agent: 'vercel-ai-gateway/claude-code'`. This key also enables
# the optional failure classifier (infra/timeout vs model failures).
# AI_GATEWAY_API_KEY=your-ai-gateway-api-key

# Sandbox: the experiments use `sandbox: 'docker'`, so NO Vercel token is needed.
# (Docker must be installed and running.) To use the Vercel sandbox instead,
# set one of these and drop `sandbox: 'docker'` from the experiment configs.
# VERCEL_TOKEN=your-vercel-token
# VERCEL_OIDC_TOKEN=your-oidc-token
7 changes: 7 additions & 0 deletions evals/agent-eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
node_modules/
dist/
.env
.env.local
results/
*.log
.DS_Store
72 changes: 72 additions & 0 deletions evals/agent-eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# cue-agent-eval

A/B test cue **profiles** with [`@vercel/agent-eval`](https://github.com/vercel-labs/agent-eval).
Same task, same model, run in an isolated sandbox — the **only variable is the
cue profile's `CLAUDE.md`** (persona + rules + skill-routing). The output is a
**pass rate** per profile, so "gstack is better than core for X" stops being a
vibe and becomes a number.

```
experiments/core.ts baseline
experiments/gstack.ts 58-role full profile
experiments/improver.ts goal-with-a-check loop
evals/create-button/ one starter task (PROMPT.md + EVAL.ts)
lib/with-profile.ts injects a profile's CLAUDE.md into the sandbox
```

## How a profile becomes the variable

`lib/with-profile.ts` adds a `setup()` that runs **on the host**:

1. `cue launch <profile> --rematerialize` — materializes the profile's full
runtime (persona + rules + skill-routing + `_always` fragments → `CLAUDE.md`)
**without exec'ing claude**, and prints `{ runtimeDir }`.
2. Reads that `CLAUDE.md` and writes it into the sandbox project root, so the
in-sandbox Claude Code runs under that profile's instructions.

**Not injected** (by design): MCP servers, hooks, and the headroom proxy env.
They need keys/network and would fail to start in a throwaway sandbox. The
measured variable is the profile's *instructions*, not its infra. Skill **bodies**
aren't injected yet either (the `CLAUDE.md` already carries each profile's skill
list + routing) — see the extension note at the bottom of `lib/with-profile.ts`.

## Prerequisites

- `@vercel/agent-eval` — installed globally (`agent-eval --help`).
- **Docker** — the experiments use `sandbox: 'docker'` (no Vercel token needed).
- **An agent API key** — `ANTHROPIC_API_KEY` (direct, the default) **or**
`AI_GATEWAY_API_KEY` (gateway). Without one the sandbox can't authenticate, so
runs fail. `--dry` needs no key.
- `cue` on `PATH` (this repo's CLI) — `with-profile.ts` shells out to it.
Run the eval from a **plain shell**, not from inside a cue-launched session:
`CUE_LAUNCHING=1` trips cue's recursion guard and `--rematerialize` would fail.

## Run

```bash
cd evals/agent-eval
cp .env.example .env # add ANTHROPIC_API_KEY (or AI_GATEWAY_API_KEY)
npm install # only needed to typecheck / run EVAL.ts locally

npx @vercel/agent-eval --dry # preview all 3 profiles — no API calls, no cost
npx @vercel/agent-eval # run all 3 (core, gstack, improver)
npx @vercel/agent-eval core # run just one
npx @vercel/agent-eval --smoke # one run per experiment — verify keys/sandbox
```

Results land in `results/<experiment>/<timestamp>/`. Compare `passRate` in each
`summary.json` across the three profiles.

## Extend

- **More tasks:** add `evals/<task>/` with `PROMPT.md` + `EVAL.ts` (+ `package.json`,
`src/`). They run against every experiment automatically.
- **More profiles:** copy an experiment file and change `withProfile('<name>')`.
- **All profiles:** map `withProfile` over `cue list` output from a generator
script (heavier; most configs you'd never run — start with the 3 here).

## Cost note

Each run spins up a sandbox, installs Claude Code, and runs the agent. `runs: 3`
× 3 profiles × tasks = real API tokens. Start with `--dry`, then `--smoke`, then
a full run on a small task set.
23 changes: 23 additions & 0 deletions evals/agent-eval/evals/create-button/EVAL.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import { readFileSync, existsSync } from 'fs';
import { execSync } from 'child_process';
import { test, expect } from 'vitest';

test('Button component file exists', () => {
expect(existsSync('src/Button.tsx')).toBe(true);
});

test('Button accepts label and onClick props', () => {
const src = readFileSync('src/Button.tsx', 'utf-8');
expect(src).toContain('label');
expect(src).toContain('onClick');
});

test('index re-exports Button', () => {
const idx = readFileSync('src/index.ts', 'utf-8');
expect(idx).toMatch(/Button/);
});

test('project builds', () => {
// Throws (and fails the test) if the build fails.
execSync('npm run build', { stdio: 'pipe' });
});
8 changes: 8 additions & 0 deletions evals/agent-eval/evals/create-button/PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Create a reusable Button component.

Requirements:
- Export a `Button` component from `src/Button.tsx`
- It accepts two props: `label` (string) and `onClick` (() => void)
- It renders a `<button>` that displays `label` and calls `onClick` when clicked
- Update `src/index.ts` to re-export `Button`
- The project must build with `npm run build`
16 changes: 16 additions & 0 deletions evals/agent-eval/evals/create-button/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"name": "create-button",
"type": "module",
"scripts": {
"build": "tsc"
},
"dependencies": {
"react": "^18.0.0"
},
"devDependencies": {
"@types/node": "^22.0.0",
"@types/react": "^18.0.0",
"typescript": "^5.0.0",
"vitest": "^2.1.0"
}
}
3 changes: 3 additions & 0 deletions evals/agent-eval/evals/create-button/src/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
// Starter code. The agent should create ./Button and re-export it here.
// TODO: export { Button } from './Button';
export {};
12 changes: 12 additions & 0 deletions evals/agent-eval/evals/create-button/tsconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"compilerOptions": {
"target": "ES2020",
"module": "ESNext",
"moduleResolution": "bundler",
"jsx": "react-jsx",
"strict": true,
"outDir": "dist",
"skipLibCheck": true
},
"include": ["src"]
}
20 changes: 20 additions & 0 deletions evals/agent-eval/experiments/core.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import type { ExperimentConfig } from '@vercel/agent-eval';
import { withProfile } from '../lib/with-profile.js';

// A/B variable = the cue PROFILE. Same task, same model across all three
// experiments — only the injected CLAUDE.md (persona + rules + skill-routing)
// changes. Compare pass rates: core vs gstack vs improver.
const config: ExperimentConfig = {
agent: 'claude-code', // direct Anthropic; needs ANTHROPIC_API_KEY.
// Swap to 'vercel-ai-gateway/claude-code' + AI_GATEWAY_API_KEY for the gateway.
// model omitted → the claude CLI's native default, so the comparison is about
// the profile, not the model.
runs: 3, // a pass RATE is the signal, not a single run
earlyExit: false, // run all N even after a success
sandbox: 'docker', // Docker is available locally; no Vercel token needed
scripts: ['build'],
timeout: 600,
...withProfile('core'),
};

export default config;
16 changes: 16 additions & 0 deletions evals/agent-eval/experiments/gstack.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import type { ExperimentConfig } from '@vercel/agent-eval';
import { withProfile } from '../lib/with-profile.js';

// Treatment B: the full gstack 58-role profile. Compare its pass rate on the
// same task against core (baseline) and improver.
const config: ExperimentConfig = {
agent: 'claude-code',
runs: 3,
earlyExit: false,
sandbox: 'docker',
scripts: ['build'],
timeout: 600,
...withProfile('gstack'),
};

export default config;
16 changes: 16 additions & 0 deletions evals/agent-eval/experiments/improver.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import type { ExperimentConfig } from '@vercel/agent-eval';
import { withProfile } from '../lib/with-profile.js';

// Treatment C: the improver profile (goal-with-a-check loop). Compare its pass
// rate on the same task against core (baseline) and gstack.
const config: ExperimentConfig = {
agent: 'claude-code',
runs: 3,
earlyExit: false,
sandbox: 'docker',
scripts: ['build'],
timeout: 600,
...withProfile('improver'),
};

export default config;
109 changes: 109 additions & 0 deletions evals/agent-eval/lib/with-profile.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
import { execFileSync } from 'node:child_process';
import { existsSync, readFileSync } from 'node:fs';
import { join } from 'node:path';
import type { ExperimentConfig, Sandbox } from '@vercel/agent-eval';

/**
* Make a cue PROFILE the measured variable.
*
* `setup` runs on the HOST (where `cue` lives). It self-primes the profile's
* runtime with:
*
* cue launch <profile> --rematerialize
*
* which materializes the full runtime — persona + rules + skill-routing +
* `_always` fragments, rendered into CLAUDE.md — WITHOUT exec'ing claude, and
* prints `{ runtimeDir }` as JSON (see src/commands/launch.ts: materializeRuntime
* runs before the --rematerialize gate). We read that CLAUDE.md and write it into
* the sandbox so the in-sandbox Claude Code runs under the profile's instructions.
*
* Deliberately NOT injected: MCP servers, hooks, and the headroom proxy env.
* Those need keys / network and would fail to start in the throwaway sandbox.
* The measured variable here is the profile's INSTRUCTIONS, not its infra — which
* is exactly what differs most between core / gstack / improver.
*
* Skill BODIES are not injected yet (the CLAUDE.md already carries each profile's
* skill list + routing). Injecting the materialized SKILL.md files is the obvious
* next extension — see the note at the bottom of this file.
*/
export function withProfile(profile: string): Pick<ExperimentConfig, 'setup'> {
return {
setup: async (sandbox: Sandbox) => {
const claudeMd = renderProfileClaudeMd(profile);
await sandbox.writeFiles({ 'CLAUDE.md': claudeMd });
},
};
}

/** Materialize a cue profile on the host and return its rendered CLAUDE.md. */
export function renderProfileClaudeMd(profile: string): string {
let out: string;
try {
// The agent positional MUST be "claude" (selects agentKind claude-code);
// the profile is forced regardless of cwd via `--cue-profile <name>`.
// A bare `cue launch <profile>` leaves agent null → "missing agent" exit 1.
// stderr is piped (not inherited) so cue's rebuild diagnostics don't bleed
// into the eval harness output; it surfaces in the error on failure.
out = execFileSync(
'cue',
['launch', 'claude', '--cue-profile', profile, '--rematerialize'],
{ encoding: 'utf8', stdio: ['ignore', 'pipe', 'pipe'] },
);
} catch (err) {
const stderr = (err as { stderr?: string }).stderr ?? '';
throw new Error(
`withProfile("${profile}"): \`cue launch claude --cue-profile ${profile} --rematerialize\` failed. ` +
`Is the \`cue\` CLI on PATH and "${profile}" a valid profile? ` +
`Run from a PLAIN shell — inside a cue session CUE_LAUNCHING=1 trips the recursion guard. ` +
`${stderr.trim() ? `cue stderr: ${stderr.trim()} — ` : ''}` +
`Original error: ${err instanceof Error ? err.message : String(err)}`,
);
}

// --rematerialize prints pretty JSON to stdout FOLLOWED BY a status line
// (e.g. "✅ Rematerialized."), so JSON.parse(out) would throw. Slice out the
// object. The payload (profile/agent/runtimeDir/rebuilt/hash) is all scalars,
// so first "{" .. last "}" is the whole object.
const start = out.indexOf('{');
const end = out.lastIndexOf('}');
if (start === -1 || end === -1) {
throw new Error(
`withProfile("${profile}"): no JSON in \`cue --rematerialize\` output:\n${out}`,
);
}
let runtimeDir: string;
try {
runtimeDir = (JSON.parse(out.slice(start, end + 1)) as { runtimeDir?: string })
.runtimeDir as string;
} catch (err) {
throw new Error(
`withProfile("${profile}"): could not parse runtimeDir from cue output. ` +
`Original error: ${err instanceof Error ? err.message : String(err)}`,
);
}
if (!runtimeDir) {
throw new Error(`withProfile("${profile}"): cue did not report a runtimeDir.`);
}

// CLAUDE.md sits at runtimeDir/CLAUDE.md or runtimeDir/claude/CLAUDE.md
// depending on how runtimeDir is reported. Check both.
const claudeMdPath = [
join(runtimeDir, 'CLAUDE.md'),
join(runtimeDir, 'claude', 'CLAUDE.md'),
].find(existsSync);

if (!claudeMdPath) {
throw new Error(
`withProfile("${profile}"): no CLAUDE.md found under ${runtimeDir}.`,
);
}
return readFileSync(claudeMdPath, 'utf8');
}

// Extension point — inject skill bodies as well as the CLAUDE.md routing:
// 1. From the same runtimeDir, read `skills/**/SKILL.md` (materialize symlinks
// them; deref with realpathSync before reading).
// 2. Write each to `.claude/skills/<id>/SKILL.md` in the sandbox alongside
// CLAUDE.md so Claude Code auto-discovers them.
// Left out of the first cut to keep the measured variable clean and the scaffold
// small; the CLAUDE.md already differs sharply across profiles.
18 changes: 18 additions & 0 deletions evals/agent-eval/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"name": "cue-agent-eval",
"version": "0.0.1",
"private": true,
"type": "module",
"description": "A/B test cue PROFILES with @vercel/agent-eval — same task, same model, the profile's CLAUDE.md is the only variable.",
"scripts": {
"dry": "agent-eval --dry",
"smoke": "agent-eval --smoke",
"eval": "agent-eval"
},
"devDependencies": {
"@vercel/agent-eval": "^1.2.0",
"@types/node": "^22.0.0",
"typescript": "^5.6.0",
"vitest": "^2.1.0"
}
}
Loading
Loading