opencue · NagyVikt · Jun 30, 2026 · Jun 29, 2026 · Jun 29, 2026 · Jun 29, 2026
diff --git a/README.md b/README.md
@@ -42,6 +42,10 @@ npm install -g cue-ai
 
 ---
 
+<p align="center">
+  <img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/demo.gif" alt="cue picking a cuecard and launching the agent" width="820">
+</p>
+
 ## Why this exists
 
 If you've been using AI coding agents for a while, you've probably collected a pile of skills, MCP servers, and custom instructions. Maybe hundreds. Here's the problem:
@@ -57,6 +61,10 @@ cue fixes this by scoping everything per directory. Your Medusa shop loads the M
 
 ### Before vs after — in numbers
 
+<p align="center">
+  <img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/isolation-comparison.svg" alt="Everything-loaded vs a scoped cuecard — always-on context compared" width="820">
+</p>
+
 | Loadout | Always-on context | Cost / 100 msgs (Sonnet input) |
 |---|---|---|
 | Everything loaded (`full` profile) | ~81k tokens | ~$24 |
@@ -93,6 +101,10 @@ One cuecard per project. Your agent reads the right one the moment you launch it
 
 No daemon, no background process. cue intercepts the *call* to your agent, resolves the directory's cuecard, materializes it once, then hands off to the real binary:
 
+<p align="center">
+  <img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/architecture.svg" alt="cue resolve to materialize to exec flow" width="820">
+</p>
+
 ```
 you type `claude`
        │
@@ -249,6 +261,10 @@ cue failures --propose       # let Claude draft profile improvements from failur
 
 `cue --help` shows the full ~50-subcommand surface; the set above covers a typical week.
 
+<p align="center">
+  <img src="https://raw.githubusercontent.com/opencue/cuecards/main/docs/assets/optimizer-dashboard.svg" alt="cue optimizer dashboard — skills, MCPs, CLIs, and usage per profile" width="820">
+</p>
+
 ---
 
 ## API

diff --git a/evals/agent-eval/.env.example b/evals/agent-eval/.env.example
@@ -0,0 +1,16 @@
+# cue-agent-eval — credentials
+#
+# Pick ONE agent API key. The experiments default to `agent: 'claude-code'`
+# (direct Anthropic), so ANTHROPIC_API_KEY is the simplest path.
+ANTHROPIC_API_KEY=sk-ant-...
+
+# Or, to route through the Vercel AI Gateway instead, set this and change the
+# experiments to `agent: 'vercel-ai-gateway/claude-code'`. This key also enables
+# the optional failure classifier (infra/timeout vs model failures).
+# AI_GATEWAY_API_KEY=your-ai-gateway-api-key
+
+# Sandbox: the experiments use `sandbox: 'docker'`, so NO Vercel token is needed.
+# (Docker must be installed and running.) To use the Vercel sandbox instead,
+# set one of these and drop `sandbox: 'docker'` from the experiment configs.
+# VERCEL_TOKEN=your-vercel-token
+# VERCEL_OIDC_TOKEN=your-oidc-token
diff --git a/evals/agent-eval/.gitignore b/evals/agent-eval/.gitignore
@@ -0,0 +1,7 @@
+node_modules/
+dist/
+.env
+.env.local
+results/
+*.log
+.DS_Store
diff --git a/evals/agent-eval/README.md b/evals/agent-eval/README.md
@@ -0,0 +1,72 @@
+# cue-agent-eval
+
+A/B test cue **profiles** with [`@vercel/agent-eval`](https://github.com/vercel-labs/agent-eval).
+Same task, same model, run in an isolated sandbox — the **only variable is the
+cue profile's `CLAUDE.md`** (persona + rules + skill-routing). The output is a
+**pass rate** per profile, so "gstack is better than core for X" stops being a
+vibe and becomes a number.
+
+```
+experiments/core.ts       baseline
+experiments/gstack.ts     58-role full profile
+experiments/improver.ts   goal-with-a-check loop
+evals/create-button/      one starter task (PROMPT.md + EVAL.ts)
+lib/with-profile.ts       injects a profile's CLAUDE.md into the sandbox
+```
+
+## How a profile becomes the variable
+
+`lib/with-profile.ts` adds a `setup()` that runs **on the host**:
+
+1. `cue launch <profile> --rematerialize` — materializes the profile's full
+   runtime (persona + rules + skill-routing + `_always` fragments → `CLAUDE.md`)
+   **without exec'ing claude**, and prints `{ runtimeDir }`.
+2. Reads that `CLAUDE.md` and writes it into the sandbox project root, so the
+   in-sandbox Claude Code runs under that profile's instructions.
+
+**Not injected** (by design): MCP servers, hooks, and the headroom proxy env.
+They need keys/network and would fail to start in a throwaway sandbox. The
+measured variable is the profile's *instructions*, not its infra. Skill **bodies**
+aren't injected yet either (the `CLAUDE.md` already carries each profile's skill
+list + routing) — see the extension note at the bottom of `lib/with-profile.ts`.
+
+## Prerequisites
+
+- `@vercel/agent-eval` — installed globally (`agent-eval --help`).
+- **Docker** — the experiments use `sandbox: 'docker'` (no Vercel token needed).
+- **An agent API key** — `ANTHROPIC_API_KEY` (direct, the default) **or**
+  `AI_GATEWAY_API_KEY` (gateway). Without one the sandbox can't authenticate, so
+  runs fail. `--dry` needs no key.
+- `cue` on `PATH` (this repo's CLI) — `with-profile.ts` shells out to it.
+  Run the eval from a **plain shell**, not from inside a cue-launched session:
+  `CUE_LAUNCHING=1` trips cue's recursion guard and `--rematerialize` would fail.
+
+## Run
+
+```bash
+cd evals/agent-eval
+cp .env.example .env          # add ANTHROPIC_API_KEY (or AI_GATEWAY_API_KEY)
+npm install                   # only needed to typecheck / run EVAL.ts locally
+
+npx @vercel/agent-eval --dry  # preview all 3 profiles — no API calls, no cost
+npx @vercel/agent-eval        # run all 3 (core, gstack, improver)
+npx @vercel/agent-eval core   # run just one
+npx @vercel/agent-eval --smoke  # one run per experiment — verify keys/sandbox
+```
+
+Results land in `results/<experiment>/<timestamp>/`. Compare `passRate` in each
+`summary.json` across the three profiles.
+
+## Extend
+
+- **More tasks:** add `evals/<task>/` with `PROMPT.md` + `EVAL.ts` (+ `package.json`,
+  `src/`). They run against every experiment automatically.
+- **More profiles:** copy an experiment file and change `withProfile('<name>')`.
+- **All profiles:** map `withProfile` over `cue list` output from a generator
+  script (heavier; most configs you'd never run — start with the 3 here).
+
+## Cost note
+
+Each run spins up a sandbox, installs Claude Code, and runs the agent. `runs: 3`
+× 3 profiles × tasks = real API tokens. Start with `--dry`, then `--smoke`, then
+a full run on a small task set.
diff --git a/evals/agent-eval/evals/create-button/EVAL.ts b/evals/agent-eval/evals/create-button/EVAL.ts
@@ -0,0 +1,23 @@
+import { readFileSync, existsSync } from 'fs';
+import { execSync } from 'child_process';
+import { test, expect } from 'vitest';
+
+test('Button component file exists', () => {
+  expect(existsSync('src/Button.tsx')).toBe(true);
+});
+
+test('Button accepts label and onClick props', () => {
+  const src = readFileSync('src/Button.tsx', 'utf-8');
+  expect(src).toContain('label');
+  expect(src).toContain('onClick');
+});
+
+test('index re-exports Button', () => {
+  const idx = readFileSync('src/index.ts', 'utf-8');
+  expect(idx).toMatch(/Button/);
+});
+
+test('project builds', () => {
+  // Throws (and fails the test) if the build fails.
+  execSync('npm run build', { stdio: 'pipe' });
+});
diff --git a/evals/agent-eval/evals/create-button/PROMPT.md b/evals/agent-eval/evals/create-button/PROMPT.md
@@ -0,0 +1,8 @@
+Create a reusable Button component.
+
+Requirements:
+- Export a `Button` component from `src/Button.tsx`
+- It accepts two props: `label` (string) and `onClick` (() => void)
+- It renders a `<button>` that displays `label` and calls `onClick` when clicked
+- Update `src/index.ts` to re-export `Button`
+- The project must build with `npm run build`
diff --git a/evals/agent-eval/evals/create-button/package.json b/evals/agent-eval/evals/create-button/package.json
@@ -0,0 +1,16 @@
+{
+  "name": "create-button",
+  "type": "module",
+  "scripts": {
+    "build": "tsc"
+  },
+  "dependencies": {
+    "react": "^18.0.0"
+  },
+  "devDependencies": {
+    "@types/node": "^22.0.0",
+    "@types/react": "^18.0.0",
+    "typescript": "^5.0.0",
+    "vitest": "^2.1.0"
+  }
+}
diff --git a/evals/agent-eval/evals/create-button/src/index.ts b/evals/agent-eval/evals/create-button/src/index.ts
@@ -0,0 +1,3 @@
+// Starter code. The agent should create ./Button and re-export it here.
+// TODO: export { Button } from './Button';
+export {};
diff --git a/evals/agent-eval/evals/create-button/tsconfig.json b/evals/agent-eval/evals/create-button/tsconfig.json
@@ -0,0 +1,12 @@
+{
+  "compilerOptions": {
+    "target": "ES2020",
+    "module": "ESNext",
+    "moduleResolution": "bundler",
+    "jsx": "react-jsx",
+    "strict": true,
+    "outDir": "dist",
+    "skipLibCheck": true
+  },
+  "include": ["src"]
+}
diff --git a/evals/agent-eval/experiments/core.ts b/evals/agent-eval/experiments/core.ts
@@ -0,0 +1,20 @@
+import type { ExperimentConfig } from '@vercel/agent-eval';
+import { withProfile } from '../lib/with-profile.js';
+
+// A/B variable = the cue PROFILE. Same task, same model across all three
+// experiments — only the injected CLAUDE.md (persona + rules + skill-routing)
+// changes. Compare pass rates: core vs gstack vs improver.
+const config: ExperimentConfig = {
+  agent: 'claude-code', // direct Anthropic; needs ANTHROPIC_API_KEY.
+  //                       Swap to 'vercel-ai-gateway/claude-code' + AI_GATEWAY_API_KEY for the gateway.
+  // model omitted → the claude CLI's native default, so the comparison is about
+  // the profile, not the model.
+  runs: 3, // a pass RATE is the signal, not a single run
+  earlyExit: false, // run all N even after a success
+  sandbox: 'docker', // Docker is available locally; no Vercel token needed
+  scripts: ['build'],
+  timeout: 600,
+  ...withProfile('core'),
+};
+
+export default config;
diff --git a/evals/agent-eval/experiments/gstack.ts b/evals/agent-eval/experiments/gstack.ts
@@ -0,0 +1,16 @@
+import type { ExperimentConfig } from '@vercel/agent-eval';
+import { withProfile } from '../lib/with-profile.js';
+
+// Treatment B: the full gstack 58-role profile. Compare its pass rate on the
+// same task against core (baseline) and improver.
+const config: ExperimentConfig = {
+  agent: 'claude-code',
+  runs: 3,
+  earlyExit: false,
+  sandbox: 'docker',
+  scripts: ['build'],
+  timeout: 600,
+  ...withProfile('gstack'),
+};
+
+export default config;
diff --git a/evals/agent-eval/experiments/improver.ts b/evals/agent-eval/experiments/improver.ts
@@ -0,0 +1,16 @@
+import type { ExperimentConfig } from '@vercel/agent-eval';
+import { withProfile } from '../lib/with-profile.js';
+
+// Treatment C: the improver profile (goal-with-a-check loop). Compare its pass
+// rate on the same task against core (baseline) and gstack.
+const config: ExperimentConfig = {
+  agent: 'claude-code',
+  runs: 3,
+  earlyExit: false,
+  sandbox: 'docker',
+  scripts: ['build'],
+  timeout: 600,
+  ...withProfile('improver'),
+};
+
+export default config;
diff --git a/evals/agent-eval/lib/with-profile.ts b/evals/agent-eval/lib/with-profile.ts
@@ -0,0 +1,109 @@
+import { execFileSync } from 'node:child_process';
+import { existsSync, readFileSync } from 'node:fs';
+import { join } from 'node:path';
+import type { ExperimentConfig, Sandbox } from '@vercel/agent-eval';
+
+/**
+ * Make a cue PROFILE the measured variable.
+ *
+ * `setup` runs on the HOST (where `cue` lives). It self-primes the profile's
+ * runtime with:
+ *
+ *     cue launch <profile> --rematerialize
+ *
+ * which materializes the full runtime — persona + rules + skill-routing +
+ * `_always` fragments, rendered into CLAUDE.md — WITHOUT exec'ing claude, and
+ * prints `{ runtimeDir }` as JSON (see src/commands/launch.ts: materializeRuntime
+ * runs before the --rematerialize gate). We read that CLAUDE.md and write it into
+ * the sandbox so the in-sandbox Claude Code runs under the profile's instructions.
+ *
+ * Deliberately NOT injected: MCP servers, hooks, and the headroom proxy env.
+ * Those need keys / network and would fail to start in the throwaway sandbox.
+ * The measured variable here is the profile's INSTRUCTIONS, not its infra — which
+ * is exactly what differs most between core / gstack / improver.
+ *
+ * Skill BODIES are not injected yet (the CLAUDE.md already carries each profile's
+ * skill list + routing). Injecting the materialized SKILL.md files is the obvious
+ * next extension — see the note at the bottom of this file.
+ */
+export function withProfile(profile: string): Pick<ExperimentConfig, 'setup'> {
+  return {
+    setup: async (sandbox: Sandbox) => {
+      const claudeMd = renderProfileClaudeMd(profile);
+      await sandbox.writeFiles({ 'CLAUDE.md': claudeMd });
+    },
+  };
+}
+
+/** Materialize a cue profile on the host and return its rendered CLAUDE.md. */
+export function renderProfileClaudeMd(profile: string): string {
+  let out: string;
+  try {
+    // The agent positional MUST be "claude" (selects agentKind claude-code);
+    // the profile is forced regardless of cwd via `--cue-profile <name>`.
+    // A bare `cue launch <profile>` leaves agent null → "missing agent" exit 1.
+    // stderr is piped (not inherited) so cue's rebuild diagnostics don't bleed
+    // into the eval harness output; it surfaces in the error on failure.
+    out = execFileSync(
+      'cue',
+      ['launch', 'claude', '--cue-profile', profile, '--rematerialize'],
+      { encoding: 'utf8', stdio: ['ignore', 'pipe', 'pipe'] },
+    );
+  } catch (err) {
+    const stderr = (err as { stderr?: string }).stderr ?? '';
+    throw new Error(
+      `withProfile("${profile}"): \`cue launch claude --cue-profile ${profile} --rematerialize\` failed. ` +
+        `Is the \`cue\` CLI on PATH and "${profile}" a valid profile? ` +
+        `Run from a PLAIN shell — inside a cue session CUE_LAUNCHING=1 trips the recursion guard. ` +
+        `${stderr.trim() ? `cue stderr: ${stderr.trim()} — ` : ''}` +
+        `Original error: ${err instanceof Error ? err.message : String(err)}`,
+    );
+  }
+
+  // --rematerialize prints pretty JSON to stdout FOLLOWED BY a status line
+  // (e.g. "✅ Rematerialized."), so JSON.parse(out) would throw. Slice out the
+  // object. The payload (profile/agent/runtimeDir/rebuilt/hash) is all scalars,
+  // so first "{" .. last "}" is the whole object.
+  const start = out.indexOf('{');
+  const end = out.lastIndexOf('}');
+  if (start === -1 || end === -1) {
+    throw new Error(
+      `withProfile("${profile}"): no JSON in \`cue --rematerialize\` output:\n${out}`,
+    );
+  }
+  let runtimeDir: string;
+  try {
+    runtimeDir = (JSON.parse(out.slice(start, end + 1)) as { runtimeDir?: string })
+      .runtimeDir as string;
+  } catch (err) {
+    throw new Error(
+      `withProfile("${profile}"): could not parse runtimeDir from cue output. ` +
+        `Original error: ${err instanceof Error ? err.message : String(err)}`,
+    );
+  }
+  if (!runtimeDir) {
+    throw new Error(`withProfile("${profile}"): cue did not report a runtimeDir.`);
+  }
+
+  // CLAUDE.md sits at runtimeDir/CLAUDE.md or runtimeDir/claude/CLAUDE.md
+  // depending on how runtimeDir is reported. Check both.
+  const claudeMdPath = [
+    join(runtimeDir, 'CLAUDE.md'),
+    join(runtimeDir, 'claude', 'CLAUDE.md'),
+  ].find(existsSync);
+
+  if (!claudeMdPath) {
+    throw new Error(
+      `withProfile("${profile}"): no CLAUDE.md found under ${runtimeDir}.`,
+    );
+  }
+  return readFileSync(claudeMdPath, 'utf8');
+}
+
+// Extension point — inject skill bodies as well as the CLAUDE.md routing:
+//   1. From the same runtimeDir, read `skills/**/SKILL.md` (materialize symlinks
+//      them; deref with realpathSync before reading).
+//   2. Write each to `.claude/skills/<id>/SKILL.md` in the sandbox alongside
+//      CLAUDE.md so Claude Code auto-discovers them.
+// Left out of the first cut to keep the measured variable clean and the scaffold
+// small; the CLAUDE.md already differs sharply across profiles.
diff --git a/evals/agent-eval/package.json b/evals/agent-eval/package.json
@@ -0,0 +1,18 @@
+{
+  "name": "cue-agent-eval",
+  "version": "0.0.1",
+  "private": true,
+  "type": "module",
+  "description": "A/B test cue PROFILES with @vercel/agent-eval — same task, same model, the profile's CLAUDE.md is the only variable.",
+  "scripts": {
+    "dry": "agent-eval --dry",
+    "smoke": "agent-eval --smoke",
+    "eval": "agent-eval"
+  },
+  "devDependencies": {
+    "@vercel/agent-eval": "^1.2.0",
+    "@types/node": "^22.0.0",
+    "typescript": "^5.6.0",
+    "vitest": "^2.1.0"
+  }
+}