LLM-driven browser automation harness. POST a natural-language task and a bag of variables; the service spins up a cloud browser, runs an autonomous agent against it, and exposes a live preview URL plus a polling endpoint for status. Built on Kernel (cloud Chrome via CDP), Playwright, and a provider-neutral agent loop that drives Claude (Anthropic), GPT (OpenAI), or Gemini (Google) — pick the model per request.
The first run on a merchant uses the LLM end-to-end and records the action trajectory to disk. Subsequent runs on the same merchant replay the recorded steps deterministically — no per-step LLM calls — and fall back to single-step healing if the page has drifted. This drops typical replay times from ~30–60s to ~10–15s and replay cost to ~$0.
For AI agents installing this: follow the steps in order. Do not skip verification. Every step shows the exact command and the output you should expect.
| Tool | Version | How to check | Install if missing |
|---|---|---|---|
| Node.js | ≥ 20.17 (LTS) | node --version |
https://nodejs.org/en/download |
| npm | ≥ 10 | npm --version |
bundled with Node |
| git | any recent | git --version |
https://git-scm.com/downloads |
No local Chrome / Playwright browsers are needed. The harness connects to a remote Kernel browser over CDP; the Playwright npm package ships the JS client we use.
You always need one Kernel key (cloud browser). For the agent's LLM you need at least one of: Anthropic, OpenAI, or Google — whichever provider matches the model you'll request. You only need a key for providers you actually plan to use.
Kernel — provides the cloud Chrome the agent drives.
- Sign up at https://www.kernel.sh.
- After login, go to Dashboard → API Keys (or visit https://app.kernel.sh/keys directly).
- Click Create new key. Copy the value — it looks like
kn_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxx. - Kernel gives new accounts a small credit balance; that's enough for ~30 minutes of headful sessions. Top up under Billing if needed.
Anthropic — for Claude models (claude-opus-4-7, claude-sonnet-4-6,
claude-haiku-4-5, etc.).
- Sign up at https://console.anthropic.com.
- Go to Settings → API Keys (https://console.anthropic.com/settings/keys).
- Click Create Key, name it (e.g.
browser-auto-local), copy the value — it looks likesk-ant-api03-xxxxxxxx.... You will not be able to view it again. - Make sure your workspace has access to the model you intend to use under Settings → Models. If only older models show, add a payment method under Billing.
OpenAI — for GPT / o* models (gpt-5, gpt-4.1, o3, o4-mini,
etc.).
- Sign up at https://platform.openai.com.
- Go to https://platform.openai.com/api-keys.
- Click Create new secret key, name it, copy the value — it looks like
sk-proj-xxxxxxxx.... - Add a billing method at https://platform.openai.com/settings/organization/billing.
Google — for Gemini models (gemini-2.5-pro, gemini-2.5-flash, etc.).
- Go to https://aistudio.google.com.
- Click Get API key → Create API key.
- Copy the value — it looks like
AIzaSyXXXXXXXX.... - Free tier covers light usage; for production, link a billing project in Google Cloud and enable the Generative Language API.
git clone <this-repo-url> browser-auto
cd browser-auto
npm installExpected output: ends with something like added N packages, found 0 vulnerabilities. Warnings about npm version mismatches are safe to ignore.
The file must live at the repo root (./env, NOT src/.env). Create it
by copying the example:
cp .env.example .envThen edit .env. Set KERNEL_API_KEY and at least one provider key
(only the ones you'll use):
KERNEL_API_KEY=kn_live_paste-your-kernel-key-here
# Set the keys for providers you intend to use. Unused lines are fine left blank.
ANTHROPIC_API_KEY=sk-ant-api03-paste-your-anthropic-key-here
OPENAI_API_KEY=sk-proj-paste-your-openai-key-here
GOOGLE_API_KEY=AIzaSy-paste-your-google-key-here
# Default model used when a request doesn't specify one.
DEFAULT_MODEL=claude-opus-4-7
PORT=3000
HOST=0.0.0.0
LOG_LEVEL=info
RECIPE_DIR=./recipes
SCREENSHOT_DIR=./screenshotsDEFAULT_MODEL, PORT, HOST, LOG_LEVEL, RECIPE_DIR, SCREENSHOT_DIR
are optional — the defaults shown above are fine for local development.
Security: .env is in .gitignore. Do not commit it. Do not paste the
keys into a chat or any tracked file. If you need to share them between
developers, use a secrets manager (1Password, Doppler, Vault) — not git.
Run the smoke test. It spins a Kernel browser, attaches Playwright, asks the agent to read example.com, and tears down — typically 8–25 seconds and a couple of US cents.
npm run smoke # default model (claude-opus-4-7)
MODEL=gpt-4.1 npm run smoke # OpenAI
MODEL=gemini-3.1-pro-preview-customtools npm run smoke # GoogleExpected output (truncated):
┃ LIVE VIEW → https://proxy.<region>.onkernel.com:8443/browser/live/...
┃ session → <session-id>
[..] INFO: kernel session ready
[..] INFO: ➜ navigate({"url":"https://example.com"})
[..] INFO: ➜ done({"success":true,"summary":"Example Domain: This domain is for use ..."})
[..] INFO: smoke ok
[..] INFO: kernel session deleted
Exit code 0 means setup is good. If you see:
KERNEL_API_KEY not set→.envis missing or the key line is malformed. Confirmcat .env | grep KERNEL_API_KEYshows your key.ANTHROPIC_API_KEY not set/OPENAI_API_KEY not set/GOOGLE_API_KEY not set→ the LLM provider for the model you're using needs its key set. The smoke defaults to Claude; setMODEL=gpt-4.1orMODEL=gemini-2.5-proto switch.401or403from Kernel → key is wrong or expired; regenerate at https://app.kernel.sh/keys.not_found_error: model: <id>→ your workspace doesn't have access to that model yet; add a payment method or pick another model. Switch withMODEL=<id>for the smoke.Cannot infer provider from model id "<id>"→ the model id doesn't match a known prefix (claude-*,gpt-*,o[134]*,gemini-*). Use a recognised prefix or the explicitprovider/modelform (e.g.anthropic/my-internal-alias).ECONNREFUSED/ hang → outbound network blocked. The harness needs to reach*.onkernel.com,api.anthropic.com,api.openai.com, and/orgenerativelanguage.googleapis.comdepending on which provider you use.
npm run startExpected output:
[..] INFO: browser-auto API listening
port: 3000
host: "0.0.0.0"
Leave this running. From another terminal, sanity-check:
curl -s http://localhost:3000/health
# → {"ok":true}For watch-mode development (auto-restart on file changes):
npm run devcurl -s -X POST http://localhost:3000/tasks \
-H "Content-Type: application/json" \
-d '{
"task": "Buy one pair of shoelaces from allbirds.com as a guest. Stop on the order confirmation page.",
"merchant_url": "https://www.allbirds.com/",
"recipes": { "host": "allbirds.com" },
"variables": {
"buyerEmail": { "value": "test@example.com", "description": "guest checkout email" },
"firstName": { "value": "Test", "description": "first name" },
"lastName": { "value": "User", "description": "last name" },
"addressLine1": { "value": "555 4th St.", "description": "street address" },
"city": { "value": "San Francisco", "description": "city" },
"stateCode": { "value": "CA", "description": "state 2-letter code" },
"postalCode": { "value": "94107", "description": "ZIP" },
"phone": { "value": "4155550123", "description": "phone" },
"cardNumber": { "value": "4242424242424242", "description": "Visa test card" },
"cardExpiry": { "value": "12/27", "description": "card MM/YY" },
"cardCvv": { "value": "123", "description": "CVV" },
"cardholderName": { "value": "Test User", "description": "name on card" }
}
}'Response (202 Accepted):
{
"task_id": "task_a1b2c3d4e5f6",
"status": "running",
"preview_url": "https://proxy.iad-...onkernel.com:8443/browser/live/abc",
"kernel_session_id": "z9...",
"poll_url": "/tasks/task_a1b2c3d4e5f6"
}Open preview_url in a browser to watch the agent work in real time. Poll
the task to check status (recommended every 2–5s):
curl -s http://localhost:3000/tasks/task_a1b2c3d4e5f6 | jqWhen status becomes succeeded, failed, or needs_human, the run is
terminal — stop polling.
Start a new agent run. The server returns 202 Accepted once the Kernel
browser session is created (typically <2s); the agent then runs in the
background.
| Field | Type | Required | Notes |
|---|---|---|---|
task |
string | yes | Natural-language goal for the agent. |
variables |
object | yes | Map of name → { value, description }. See §3. |
merchant_url |
string | no | If set, the page is navigated here before the agent starts. Saves the agent one navigation step. |
recipes |
object | no | { host: string, flow_key?: string }. Opts into record/replay (§4). |
max_steps |
number | no | Agent step budget. Default 60. |
model |
string | no | Model id. Provider is inferred from the prefix (see Supported models below). Default = server's DEFAULT_MODEL. |
proxy |
object | no | { country?: string (ISO-2), state?: string, type?: "isp" | "residential" | "datacenter" }. Default: { country: "US", type: "isp" }. |
headless |
boolean | no | Default false so Kernel returns a preview_url. Set true for cheaper headless runs without preview. |
Response 202:
{
"task_id": "task_...",
"status": "running",
"preview_url": "https://...",
"kernel_session_id": "...",
"poll_url": "/tasks/task_..."
}Response 400 on validation errors ({ error: "invalid_input", message: "..." }).
Returns the full task record. Poll this for live progress — you don't need
to open the Kernel preview URL to see what the agent is doing. The response
includes a latest_step field that shows exactly what just happened, in
human-readable text. This is the same per-step commentary that the live
preview would show.
{
"task_id": "task_a1b2c3d4e5f6",
"status": "running", // queued | running | succeeded | failed | needs_human
"preview_url": "https://...",
"kernel_session_id": "...",
"step_count": 14,
"latest_step": { // freshest action — useful for live status
"step": 14,
"tool": "fill_card",
"args": { "field": "cvc", "value": "%cardCvv%" },
"result": "filled cvc via visible input[autocomplete=\"cc-csc\"]#0"
},
"steps": [
{ "step": 1, "tool": "navigate", "args": {...}, "result": "navigated to ..." },
{ "step": 2, "tool": "click", "args": {...}, "result": "clicked button \"Add to Cart\"" },
...
],
"result": { // populated when status is terminal
"success": true,
"summary": "Order placed",
"order_id": "AB12345",
"total": "$22.38",
"final_url": "https://www.allbirds.com/checkouts/.../thank-you",
"final_title": "Order confirmation — Allbirds"
},
"error": null,
"created_at": "2026-05-17T13:00:00.000Z",
"updated_at": "2026-05-17T13:02:14.391Z"
}Recommended polling: every 2–5 seconds. Display latest_step.result to your
user as the live status line. result is null until the run terminates.
If you'd rather not poll, subscribe to a live event stream. One persistent
HTTP connection delivers each step the moment it happens — same data as the
polling response's latest_step, but pushed.
curl -N http://localhost:3000/tasks/task_a1b2c3d4e5f6/eventsStream output (one event per \n\n block):
event: status
data: {"type":"status","status":"running","at":"2026-05-17T13:00:01.123Z"}
event: preview_ready
data: {"type":"preview_ready","preview_url":"https://proxy.iad-...","at":"2026-05-17T13:00:01.234Z"}
event: step
data: {"type":"step","step":{"step":1,"tool":"navigate","args":{"url":"..."},"result":"navigated to ..."},"at":"2026-05-17T13:00:02.000Z"}
event: step
data: {"type":"step","step":{"step":2,"tool":"click","args":{"index":29,...},"result":"clicked button \"Add to Cart\""},"at":"2026-05-17T13:00:04.500Z"}
...
event: end
data: {"type":"end","status":"succeeded","at":"2026-05-17T13:02:14.391Z"}
The server replays the full event history when you connect, then streams new
events live, then closes the connection on a terminal status. Heartbeats
(: heartbeat\n\n comment lines) are sent every 15s to keep proxies happy.
Event types: status (status_change), preview_ready (when the Kernel
session is up and preview_url is available), step (each agent action),
end (terminal — connection will close).
Client patterns:
- Node:
EventSourcefrom theeventsourcepackage, orfetch(...)with a streaming response body. - Browser:
new EventSource("/tasks/<id>/events"). - Shell:
curl -N(the-Nflag disables buffering).
Returns { "ok": true }. No auth.
The model field on POST /tasks accepts any id that maps to one of the
three providers via prefix inference, or an explicit provider/model form.
| Provider | Model id pattern | Examples |
|---|---|---|
| Anthropic | claude-* or anthropic/<id> |
claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5 |
| OpenAI | gpt-*, o\d, o\d-*, or openai/<id> |
gpt-5, gpt-4.1, o3, o4-mini |
gemini-* or google/<id> |
gemini-3.1-pro-preview-customtools, gemini-2.5-pro |
The provider's API key must be set in .env (see §1.2). If you submit a
task with a model whose provider key is missing, the task fails fast with
a runner_crash error and a clear message.
This agent does DOM-based browsing (snapshot interactive elements + function-calling tools). Pick the model variant tuned for that pattern:
| Provider | Recommended model id | Why |
|---|---|---|
| Anthropic | claude-opus-4-7 |
Smartest currently available; works well with our tool surface. ✓ verified end-to-end. |
| OpenAI | gpt-4.1 or o4-mini |
gpt-4.1 for accuracy, o4-mini for cost. Either uses standard function calling. |
gemini-3.1-pro-preview-customtools |
Gemini 3.1 has several preview variants; the -customtools one is tuned for custom function-calling workflows (which is what our agent uses). ✓ verified end-to-end. |
Two Gemini variants worth knowing about:
gemini-3.1-pro-preview— general-purpose Gemini 3.1 Pro. Works but the customtools variant is better when (as here) the workflow is custom function calls rather than free-form text.gemini-2.5-computer-use-preview-10-2025— not for this harness. That's Google's Computer-Use Agent (CUA) model which takes screenshots and emits coordinates. Our agent is DOM-based, not coordinate-based; using a CUA model here will give odd results.
# Anthropic (default)
curl -X POST http://localhost:3000/tasks -d '{
"task": "...",
"variables": {},
"model": "claude-opus-4-7"
}'
# OpenAI
curl -X POST http://localhost:3000/tasks -d '{
"task": "...",
"variables": {},
"model": "gpt-4.1"
}'
# Google
curl -X POST http://localhost:3000/tasks -d '{
"task": "...",
"variables": {},
"model": "gemini-3.1-pro-preview-customtools"
}'Behaviour is identical across providers: each adapter converts our internal tool/message format to its own native shape and forces a single function call per turn. Recipes recorded by one provider replay on another — the recipe stores tool calls in a provider-neutral format, so a trajectory recorded with Claude will replay correctly when the next run uses Gemini or GPT.
Personal and payment data is passed as variables. In your task prompt
and in tool arguments the agent references variables by name, never by
literal value. The harness substitutes the real value at execution time
inside the Playwright call — the substituted string never enters Claude's
context window or transcript.
In the task prompt:
"Fill the card details using the variables provided."
In tool calls Claude emits:
{ "tool": "fill_card", "args": { "field": "number", "value": "%cardNumber%" } }The runtime expands %cardNumber% to the actual digits just before calling
page.fill(). Logs and the steps array preserve the %name% placeholder.
Variable name rules: must match ^[a-zA-Z0-9_]+$. Each variable must
include a description — that's what the agent reads to decide which
variable goes where.
This is not a PCI substitute. The values still pass over HTTPS to the server and through Playwright into the browser. The substitution only keeps them out of the LLM transcript. If the API key for this service leaks, an attacker can submit tasks that exfiltrate the variables they POST. Treat the service the same as you would any payment-handling endpoint.
The agent records a successful trajectory once and replays it from then on.
This is opt-in per request via the recipes field:
"recipes": { "host": "allbirds.com" }When set, the harness:
- Loads
./recipes/<host>/<flow_key>.jsonif present. - Loads
./recipes/<host>/playbook.mdif present (hints prepended to the agent's system prompt — speeds up the first run too). - Replay mode (recipe
trust == "validated"): walks the recorded steps; re-snapshots the page before each step and resolves the target element bytag + name. On a miss, issues a single-step heal LLM call. - Record mode (no recipe, or
trust == "draft", ortrust == "healing"): runs the full LLM loop and records every successful tool call.
Trust transitions:
| State | After successful run | After failed run |
|---|---|---|
draft (success_count < 3) |
success_count++; if ≥3 → validated |
stays draft |
validated |
success_count++; stays validated |
→ healing |
healing |
→ validated |
stays healing |
If too many steps (default >3) drift in one replay, the recipe is auto-dropped and the next run re-records.
File layout:
recipes/<host>/default.json ← runtime state, gitignored
recipes/<host>/playbook.md ← hand-tuned hints, committed
Example playbook lives at recipes/allbirds.com/playbook.md.
There are two ways to watch a running task. Pick whichever fits your UX:
| Method | Use when | Visual? |
|---|---|---|
Polling GET /tasks/:id |
You want to render the agent's status in your own UI as plain text. The latest_step.result field is a human-readable description of every action. |
No — text only |
SSE GET /tasks/:id/events |
Same content as polling but pushed instead of pulled — lower latency, lower request cost. Best for dashboards. | No — text only |
Kernel preview_url |
You want a visual stream of the actual browser. Useful for demos, first runs on a new merchant, or debugging tricky UI state. | Yes — a live video of Chrome |
The preview_url returned by POST /tasks is a Kernel-hosted live view of
the cloud Chrome. Open it in any browser to watch the agent visually. It's
only present when the session is headful (headless: false, the
default). It's valid for the lifetime of the Kernel session and stops
working when the task terminates.
You do not need to open the preview URL to know what's happening. The polling and SSE channels carry the full per-step commentary. Use the preview when you want pixels; use polling/SSE for everything else.
npm run dev # watch-mode server (tsx watch)
npm run smoke # tiny end-to-end smoke against example.com
npm run allbirds # standalone allbirds checkout script
npm run peek # screenshot whatever the KEEP_ALIVE session shows
npm run kernel:close # tear down the KEEP_ALIVE session
npm run diagnose:card # dump every iframe's inputs (debug payment forms)
npm run typecheck # tsc --noEmitPicking the model for CLI scripts: smoke and allbirds honour the
MODEL env var. The server, by contrast, uses DEFAULT_MODEL from .env
for any request that doesn't pass model. They're separate so you can
keep the server on Claude as the default while spot-testing other
providers from the CLI:
MODEL=gpt-4.1 npm run smoke
MODEL=gemini-3.1-pro-preview-customtools npm run smoke
KEEP_ALIVE=1 MODEL=gpt-4.1 npm run allbirdsFor interactive iteration on a single merchant: set KEEP_ALIVE=1 and run
scripts/allbirds.ts. The Kernel session is persisted to
.kernel-session.json and reattached on the next run — no fresh browser
spin-up, no lost cart/cookies state. Use npm run peek to inspect, and
npm run kernel:close when done.
| Var | Required | Default | Notes |
|---|---|---|---|
KERNEL_API_KEY |
yes | — | Kernel cloud Chrome account. |
ANTHROPIC_API_KEY |
conditional | — | Required when using claude-* models. |
OPENAI_API_KEY |
conditional | — | Required when using gpt-* / o*-* models. |
GOOGLE_API_KEY |
conditional | — | Required when using gemini-* models. GEMINI_API_KEY is accepted as a synonym. |
DEFAULT_MODEL |
no | claude-opus-4-7 |
Used when a request doesn't pass model. |
MODEL |
no | claude-opus-4-7 |
Used by npm run smoke / npm run allbirds standalone scripts. |
PORT |
no | 3000 | HTTP port. |
HOST |
no | 0.0.0.0 | Bind interface. |
LOG_LEVEL |
no | info | Pino level. |
RECIPE_DIR |
no | ./recipes | Where recipes + playbooks live. |
SCREENSHOT_DIR |
no | ./screenshots | Where per-run screenshots go. |
KERNEL_PROXY_TYPE |
no | residential | Default proxy type for standalone scripts. |
KERNEL_HEADLESS |
no | unset | Set to 1 to default to headless (no preview URL). |
KEEP_ALIVE |
no | unset | Set to 1 in local scripts to persist the Kernel session between runs. |
src/
├── agent/
│ ├── loop.ts ← runAgent: dispatches replay vs record, persists recipe
│ ├── replay.ts ← deterministic step execution + single-step heal
│ ├── playbook.ts ← loads recipes/<host>/playbook.md
│ ├── elements.ts ← cross-frame DOM snapshot with stable resolve hints
│ ├── tools.ts ← click / fill / fill_card / navigate / scroll / etc.
│ └── llm/
│ ├── types.ts ← provider-neutral types (LLMClient, ChatMessage, ToolDef)
│ ├── anthropic.ts ← Claude adapter
│ ├── openai.ts ← GPT / o-series adapter
│ ├── google.ts ← Gemini adapter
│ └── index.ts ← createLLMClient(modelId) factory
├── recipes/
│ ├── types.ts ← Recipe, RecipeStep, Trust transitions
│ └── store.ts ← DiskRecipeStore (JSON files)
├── kernel/session.ts ← Kernel lifecycle + KEEP_ALIVE
├── obs/logger.ts ← pino structured logging
└── server/
├── index.ts ← entrypoint
├── api.ts ← Fastify routes
├── store.ts ← in-memory TaskStore
└── runner.ts ← drives one task end-to-end
scripts/
├── allbirds.ts ← standalone demo (KEEP_ALIVE-friendly)
├── smoke.ts ← ~30-line agent smoke
├── peek.ts ← screenshot the live KEEP_ALIVE session
├── close.ts ← tear down KEEP_ALIVE session
└── diagnose-card.ts ← dump iframe inputs (payment-form debugging)
recipes/<host>/
├── default.json ← recorded action trajectory (runtime state, gitignored)
└── playbook.md ← hand-tuned merchant hints (tracked in git)
This is a v1. Before deploying:
- No authentication on the API. Put it behind a reverse proxy / auth layer of your choice.
- In-memory task store. Tasks are lost on restart. Swap for Redis or Postgres if you need persistence or multi-instance scale.
- No queue, no concurrency limit. Each
POST /tasksimmediately spins a Kernel session. Add a queue + worker pool for fairness and to bound Kernel cost. - Screenshots stay on the server's disk under
./screenshots/. If you need them remotely, layer an S3 uploader on top of the runner. - 3DS / CAPTCHA: the agent is instructed to call
done(success=false, failure_reason="three_ds_required" | "captcha_required")rather than attempt to solve them. Handle these statuses on your end. - Geo / proxy: default is US-ISP. Set
proxy.country/proxy.stateper request for geo-specific runs. Residential rotates per TCP connection; prefer ISP for checkout-style flows where stickiness matters.
For reference, here's what the agent loop does on every iteration in record mode:
- Snapshot: enumerate visible interactive elements across the main page
and all iframes; tag each with a unique
data-agent-id. - Call the LLM: send the page state + history + variable legend + tool
definitions through the active
LLMClientadapter; force exactly one tool call per response. The same internal call shape works against any of the three providers; the adapter translates to provider-native format. - Substitute variables: replace
%name%tokens with real values just before the Playwright call. - Execute: click / fill / navigate / scroll / etc. via Playwright.
- Record (if recipes opt-in): append the tool call to the recipe with element resolve hints (tag + name + frame).
- Loop until
done(...)is called ormax_stepsis hit.
In replay mode, steps 1, 3, 4 are deterministic; the LLM call (step 2) is only made for steps that fail to resolve or execute — at most 3 per run.
Each adapter lives in src/agent/llm/:
anthropic.ts—@anthropic-ai/sdk. Tools forwarded as Anthropic's native shape; forced single-call viatool_choice: { type: "any", disable_parallel_tool_use: true }.openai.ts—openaiSDK. Tools wrapped asfunction; forced single-call viatool_choice: "required"+parallel_tool_calls: false. Tool results split into role:"tool" messages per OpenAI's spec.google.ts—@google/genaiSDK. Tools wrapped asfunctionDeclarations; forced single-call viatoolConfig: { functionCallingConfig: { mode: "ANY" } }. Fabricates tool-use ids (Gemini doesn't emit them) and maps id→name so tool results can be matched by function name on the way back.
Adding a fourth provider is small: implement the LLMClient interface from
src/agent/llm/types.ts and add a prefix branch in
src/agent/llm/index.ts:parseModelId.