All notable changes to Clawd Cursor will be documented in this file.
User reported Outlook launching repeatedly during a test. Root-cause diagnosis traced to three compounding failures: (1) PlatformAdapter.openApp spawned a new instance even when the app was already running, (2) the escalation ladder (router → blind → hybrid → vision) re-ran open_app at each rung because earlier rungs couldn't verify success through New Outlook's sparse WebView2 accessibility tree, (3) clawdcursor stop only killed the start process on port 3847, missing serve (different port / same port different process) and mcp (stdio, no port) entirely. A stale serve kept receiving MCP traffic after the user thought they'd stopped everything.
openApp/launchAppidempotency (Windows + macOS + Linux). When the target app already has a visible window AND the caller didn't setalwaysNewInstance: trueAND nourlis passed, the adapter now focuses the existing window and returns its pid instead of spawning another instance. Match policy: case-insensitive exact processName → processName substring → title substring → UWP AppId tail. Closes the "N windows of Outlook stacking up" class of bug under any retry loop.src/v2/platform/{windows,macos,linux}.ts.- Agent runaway guard — if the agent calls the same tool + identical args ≥ 3 times within the last 6 turns, the loop exits with
give_upand a targeted message suggestingdetect_webview_appswhen the target is likely Electron/WebView2. Prevents the generalized "retry-loop-because-a11y-is-opaque" anti-pattern.src/pipeline/agent/agent.ts. clawdcursor stopnow sweeps all modes. After the graceful/stopon port 3847, iterates every pidfile in~/.clawdcursor/*.pid, SIGTERMs any live pid, SIGKILLs after 500ms if still running, and unlinks the pidfile. Catchesmcp(stdio-only), zombieserve, and any start/serve on a non-default port.src/index.ts.
- Stale-pidfile cleanup at startup was already correct via
claimPidFile(checksisProcessAlive(existingPid)and overwrites when dead) — no code change needed there; the issue was exclusivelystop. - Tests: 429 / 430 pass (1 skipped, same as 0.8.2). No schema snapshot change — these are behavioral fixes, not catalog changes.
First-time-user review surfaced six concrete pain points. This release fixes every one.
- Silent 401 mid-session (the session-killer). Previous versions compared the incoming Bearer token against an in-memory
SERVER_TOKENonly. A second clawdcursor process (stale pidfile takeover, or a concurrent mode) rewrote the token FILE without updating the first server's in-memory copy — clients reading the file silently lost auth./healthkept returning 200 so the failure was invisible. Fix:requireAuthnow accepts EITHER the in-memory token OR the current on-disk token (mtime-cached, ~free). Drift is logged once with a recovery hint.src/server.ts. focus_windowforce-to-front on Windows. Previous implementation calledSetForegroundWindowwhich the OS blocks when the caller isn't the current foreground process. New implementation uses the full sequence:ShowWindow(SW_RESTORE)→ topmost-toggle →AttachThreadInputwith the current foreground thread →AllowSetForegroundWindow(ASFW_ANY)→BringWindowToTop→SetForegroundWindow, with an Alt-key synthetic fallback. Raises any window through Windows' foreground lock.scripts/ps-bridge.ps1.- Richer validation errors. REST
/executerejections now carry the full expected tool signature. A missing param returnsMissing required parameter "target". Expected smart_click(target: string, processId?: number).— agents no longer have to roundtrip to/docs.src/tool-server.ts.
- Electron / WebView2 detection. New MCP tools
detect_webview_appsandrelaunch_with_cdp(also exposed via compactsystem({"action":"detect_webview"})/system({"action":"relaunch_with_cdp"})). Recognises olk (New Outlook), Teams, Discord, Slack, VS Code, GitHub Desktop, Notion, Obsidian, Spotify. When detected, probes ports 9222/9223/9229/8315 for a live CDP endpoint; if found, tells the agent to attach viabrowser({"action":"connect"}). If not, shows the exact relaunch command (e.g.discord --remote-debugging-port=9222) so CDP can be enabled and the sparse UIA tree bypassed entirely.src/tools/electron_bridge.ts. drag_pathdocumentation clarity. Existingmouse_drag_stepped/ compactcomputer({"action":"drag_path","path":"[...]"})now explicitly documented for freehand curve drawing (Paint, Figma, canvas apps). SKILL.md "Quick reference" covers when to usedrag_pathvsdrag.
- SKILL.md pushes compact mode harder. Top of doc now carries a directive callout: "If you are an LLM reading this: YOU SHOULD BE USING COMPACT MODE." with MCP config + REST URL. Granular stays available but is explicitly labeled the power-user / larger-prompt option.
- SKILL.md web-app keyboard warning. Web-wrapped apps (Outlook, Teams, Gmail) treat
Escapeas "close dialog/modal" — sometimes closing the compose window. Documented: do not use Escape to dismiss autocompletes in web apps; use arrow keys + Enter or click-away. - Error-recovery table expanded with Electron-vs-true-canvas split, v0.8.2 auth recovery, v0.8.2 force-focus note, and the
drag_pathvsdragdistinction.
- 429 / 430 passing (one skipped, same as 0.8.0).
- Schema snapshot regenerated → 74 granular tools (72 + 2 Electron bridge).
- Live smoke: token auth survives a second
clawdcursor serve;focus_windowraises Paint through a full-screen window;detect_webview_appscorrectly flags Outlook / Teams / VS Code when any are open.
0.8.1-alpha.0 through -alpha.N shipped unified-pipeline + compact-MCP + Linux AT-SPI + Wayland routing on the feature branch. They roll into 0.8.2 as a single stable release. See the v0.8.1-alpha tag range in the git history for per-tranche detail; headline features:
- Unified blind/hybrid/vision agent — one loop, three modes. Replaces the v0.8.0 split
text-agent+vision-agentwith a single harness using nativetool_use(Anthropic) /tool_calls(OpenAI) / prose-JSON fallback. - Compact MCP surface — 6 compound tools (
computer,accessibility,window,system,browser,task) that collapse the full capability into ~1,500 tokens of catalog. Anthropic-Computer-Use shape extended across the whole product.clawdcursor mcp --compactorGET /tools?mode=compact. - PlatformAdapter widened —
mouseDown/Up,keyDown/Up,setWindowState,setWindowBounds,listDisplays,waitForElement, widenedInvokeAction(expand/collapse/toggle/select/get-value), richerUiElementstate flags. - Linux AT-SPI bridge — read-only first pass via
python3-gi+gir1.2-atspi-2.0. Linux a11y methods (getUiTree,findElements,getFocusedElement,waitForElement) now return real data on boxes where the bridge dependencies are present.invokeElementstill stubbed — tracked for a follow-up pass. - Linux Wayland input routing —
ydotool(mouse + keyboard) orwtype(keyboard fallback) detected at init. X11 path unchanged; Wayland no longer silently mis-fires through nut-js. - Per-capability palettes + compound vision tools — text-agent turns now see a 6-10 tool scoped palette based on the subtask's capability (
app_launch/text_input/navigation/form_fill/spatial/file_ops/window_mgmt/general). Vision-agent turns see 3 compoundmouse/keyboard/windowtools with action enums. ~12× fewer catalog tokens per turn. - Pretty TTY logs with HH:MM:SS timestamps — layer-tagged (
[router],[blind],[vision],[safety], etc.), no per-line repetition,CLAWD_LOG=prettydefault on TTY. - SKILL.md rewrite — reviewed by a Sonnet subagent against legacy v0.6.3/v0.7.14 tone, verified model-agnostic + OS-agnostic, restored "USE AS A FALLBACK" + "IMPORTANT — READ THIS BEFORE ANYTHING ELSE" directive callouts and Sensitive App Policy.
A ground-up reimagining of the internal pipeline. Opt in with clawdcursor start --v2. The legacy pipeline is unchanged and remains the default.
--v2flag onclawdcursor start— activates the new 3-layer architecture: Router → VisionAgent → Verifier. No effect on MCP,serve, or legacystart.src/v2/platform/— platform abstraction. SinglePlatformAdapterinterface withmacos.ts,windows.ts,linux.tsimplementations. Replaces 142+ scatteredif (process.platform === 'darwin')branches across 34 files. Business logic no longer seesprocess.platform. Adding a new OS = one file.src/v2/verifier/—GroundTruthVerifier. Six independent signals decide whether a task actually completed: pixel diff, window change, focus change, OCR delta, task-specific assertions (send_email,navigate_url,open_app,type_text,search,compose_message,create_file), and anti-patterns (error dialogs, "cannot send", "draft saved", invalid recipient, auth failed). Weighted voting with hard-fail rules on anti-patterns. Cannot be fooled by LLM self-reported "done".src/v2/agent/—VisionAgent: a single vision-first tool-use loop. 16 tools (screenshot,read_screen,list_windows,click,drag,scroll,type,key,invoke_element,set_field_value,open_app,focus_window,read_clipboard,write_clipboard,wait,done). 6-rule system prompt (down from 36). Model-agnostic via existingcallVisionLLM.src/v2/orchestrator.ts—PipelineV2wires Router → VisionAgent → Verifier with before/after state capture.- Hardened JSON parser — tolerates trailing braces, markdown code fences, and other common LLM malformations. Balanced-brace extraction as fallback.
- False positives — legacy pipeline reports
UNVERIFIED_SUCCESSwhen the agent claims "done" but the screen didn't change. V2 verifier catches this class: in a live email-send test the agent said "Email sent" but a "Cannot send" dialog was on screen. V2 correctly rejected the claim. (Legacy still does what it does; this fix only applies when--v2is set.)
Smoke-tested on macOS with Anthropic Claude Haiku (text) + Sonnet (vision):
| Task | Time | Verdict |
|---|---|---|
| Open TextEdit and type | 30s | ✅ (4/6 signals) |
| Calculator: 47+53=100 | 65s | ✅ (5/6 signals, zero parse errors) |
| Safari → github.com | 45s | ✅ (6/6 signals) |
| Notes: create note | 182s | ✅ (6/6 signals) |
| Email send (failing server) | 86s | ❌ Correctly rejected — legacy would have reported success |
No legacy code modified. Windows, Linux, and MCP paths untouched. v2 code is entirely under src/v2/.
- macOS keystrokes silently dropped — root cause:
CGEvent.post()from the Swift helper is blocked by macOS TCC when the helper is spawned as a child of Node.js.keyPress()andtypeText()on macOS now route throughosascript+ System Events (the Apple-sanctioned method). All keyboard shortcuts (Cmd+V, Cmd+N, Shift+Cmd+D, etc.) now work correctly. - Single-char keys losing modifiers —
keycodeForCharacter()lookup added toClawdCursorHelper; modifiers are no longer discarded for Cmd+letter combos. asDouble()coercion — click/drag coordinates sent as integers (common from some LLMs) no longer fail with a type mismatch in the Swift helper.keycodeForCharacterfallback — now returns an error for unmapped characters instead of silently falling back to the 'v' keycode.- Permission check inconsistency —
doctor,status, andreadiness.tsall now query the same canonical path: Host/status→permission-checkbinary → direct fallback. No more false "granted" reports. - Screenshot capture CPU spin — replaced
CGWindowListCreateImage(triggers ReplayKit CPU spin bug on macOS 14+) with a delegatedscreenshot-helpersubprocess. - A11y false positive —
isShellAvailable()now tests actual window access (p.windows.length) instead ofprocesses.length, which worked without Accessibility permission. - Node.js v25 crash —
EINVAL/setTypeOfServicesocket error from undici's internal QoS call is now caught and suppressed (non-fatal). - Dock click zone — reduced from 60px to 30px on macOS (Dock is thinner than the Windows taskbar).
- Browser URL bar shortcut —
Cmd+Lused on macOS (wasCtrl+L, which does nothing in macOS browsers).
macMailEmailFlow— deterministic email flow for macOS Mail.app (Cmd+N, Tab to subject/body, Cmd+Shift+D to send).clawdcursor grantcommand — triggers macOS system permission dialogs directly from the CLI.- 115 Apple shortcuts — Mail, Safari, Notes, Messages, Terminal added to the shortcut database.
scripts/test-macos-fixes.sh— one-shot E2E verification script: rebuild, binary check, permission consistency, screenshot capture, doctor cross-check.--request-screen-recordingflag onpermission-checkbinary — optional TCC dialog trigger for Screen Recording.processPath+bundleIdin all permission check responses — aids TCC debugging.- 30s TTL cache on A11y shell availability — permission grants mid-session are now detected without restart.
- macOS native binary verification in
scripts/verify-install.js— warns on missing binaries atnpm installtime. setupscript auto-builds native binaries on macOS (insidenpm run setup).
build.sh— marked executable in git, fails fast on missing binaries (was silently warning), better error guidance.- Installer — verifies all 4 required binaries (not just
ClawdCursorHost), usesbash ./build.shfor portability. doctor.ts— permission check unified vianative-helpermodule; triggers system permission dialogs if denied.- Email flow keyboard shortcuts — platform-aware:
Ctrl+Enter→Shift+Cmd+Don macOS,Ctrl+H→Cmd+Option+Ffor Find & Replace. sharpbumped^0.33.0→^0.33.5.
No Windows or Linux code paths affected. All macOS changes are gated behind IS_MAC / process.platform === 'darwin' / isMacOS().
- Permission check fragmentation — doctor, status, and readiness each used different permission APIs, producing contradictory results. All now route through
ClawdCursorHost /status→permission-checkbinary → directAXIsProcessTrustedfallback. - Screenshot CPU spin — delegated
takeScreenshot()toscreenshot-helpersubprocess, eliminating the ReplayKit CPU spike on macOS 14+. - Installer binary verification — now checks all 4 required binaries (
ClawdCursorHost,clawdcursor-helper,screenshot-helper,permission-check) instead of justClawdCursorHost. build.shsilent failures —swift builderrors now fail the build immediately with actionable guidance.
clawdcursor grantcommand — triggers macOS system permission dialogs for Accessibility and Screen Recording.processPath+bundleIdin permission check responses for TCC debugging.--request-screen-recordingflag onpermission-checkbinary.
- Bash pipeline bug —
set -o pipefailadded; build failures now properly detected (was silently passing due to pipeline exit status bug) - Ad-hoc signing by default — build.sh now always signs the app (required for TCC on macOS 26+ Tahoe where unsigned binaries don't appear in privacy settings)
- Build error capture — uses temp file instead of pipe to properly capture exit status
- TCC permission check — runs permission-check after build to show current accessibility/screen recording status
- build.sh rewritten — cleaner structure, ad-hoc signing is default (not optional), signature verification added
- Codesign uses --deep — ensures all nested binaries are signed
- Installer shows TCC status — tells user exactly which permissions need to be granted and where
The core issue was TCC (Transparency, Consent, and Control) on macOS binds permissions to the code signing identity. Without signing:
- On macOS 26+ (Tahoe), unsigned binaries don't appear in System Settings privacy panels at all
- Users saw "ClawdCursorHost binary not found" errors even though install appeared to succeed
Reference: mediar-ai/mcp-server-macos-use for TCC permission handling patterns.
- macOS installer now fails loudly if native host build fails — was silently swallowing build errors and claiming "optional fallback" that doesn't exist
- Added verification step — installer explicitly checks ClawdCursorHost binary exists before declaring success
- Show build output — Swift build errors are now visible instead of redirected to /dev/null
- Clear error messages — tells users exactly what went wrong and how to fix it (xcode-select --install, manual rebuild, etc.)
- macOS native host is now correctly marked as REQUIRED, not optional
- Installer exits with error code 1 if native build fails on macOS
- Installer shows next steps — after install, displays clear guidance:
clawdcursor doctor→clawdcursor start - Doctor shows run options — after passing all checks, shows both
start(full agent) andserve(tools-only) modes - Consent shows next step — after granting consent, directs users to
clawdcursor doctor
- macOS permission messages — now direct users to enable "ClawdCursor" instead of "Terminal/Node"
- Screen Recording path — updated to "Screen & System Audio Recording" (macOS Sequoia naming)
- Installer comments updated — example version references now point to v0.7.8
- Installers default to main branch — install.sh and install.ps1 now use
maininstead of hardcoded non-existent tag - macOS installer builds native helper — install.sh now runs
./native/build.shon Darwin if Swift is available - Version override support —
VERSION=v0.7.7 curl ... | bashor$env:VERSION='v0.7.7'to install specific release - Auto-pull on update — installers now run
git pullafter checkout to get latest changes
- macOS Host App (ClawdCursorHost) — new native Swift executable that runs as the app bundle's main process, owning all TCC permissions (Accessibility, Screen Recording) under a single app identity
- Localhost IPC server — host app exposes
GET /health,GET /status,POST /rpcon127.0.0.1:3848for CLI→host communication - Token-based authentication —
~/.clawdcursor/host-token(mode 0600) secures the IPC channel - Auto-launch/stop —
clawdcursor startensures host is running;clawdcursor stopgracefully quits it - New Swift helper methods —
moveMouse,dragMouse,captureScreenfor smoother native macOS automation - Menu bar presence — host app shows 🐾 icon in menu bar for visibility
- Localhost-only binding — IPC server uses
NWParameters.requiredLocalEndpointto bind to127.0.0.1only, rejecting connections from other machines - Token file permissions — host-token created with mode 0600 (owner read/write only)
src/native-helper.ts— routes all macOS desktop operations through host IPC instead of direct stdiosrc/native-desktop.ts— 11 platform-guarded code paths delegate to host on macOSsrc/index.ts— start/stop commands manage host app lifecyclenative/ClawdCursor.app/Contents/Info.plist— bundle identifier changed tocom.clawdcursor.app, executable toClawdCursorHost
- Windows/Linux — all macOS code behind
IS_MAC && this.helperguards; no behavior changes on other platforms - 172 tests pass — full test suite unchanged
- LLM-based universal task pre-processor — one cheap text LLM call decomposes any natural language into
{app, navigate, task, contextHints}, replacing brittle regex parsing - Multi-app workflow support — copy/paste between apps (e.g. Wikipedia → Notepad) with 6-checkpoint tracking: first_app_focused → first_app_action_done → content_copied → second_app_opened → content_pasted → result_visible
- Site-specific keyboard shortcuts — Reddit (j/k/a/c), Twitter/X (j/k/l/t/r), YouTube (Space/f/m), Gmail (j/k/e/r/c), GitHub (s/t/l), Slack (Ctrl+k), plus generic hints
- OS-level default browser detection — reads Windows registry (HKCU ProgId) or macOS LaunchServices instead of hardcoded Edge/Safari
- 3 verification retries with step log analysis — when verification fails, builds a digest of recent actions + checkpoint status so the vision LLM can fix the specific missed step
- Mixed-provider pipeline support — e.g. kimi for text, anthropic for Computer Use, with per-layer API key resolution from OpenClaw auth-profiles
ComputerUseOverridesinterface — apiKey, model, baseUrl per-layer for mixed-provider setupsresolveProviderApiKey()helper — reads OpenClaw auth-profiles to find the right API key per provider
- Checkpoint system overhaul — removed auto-termination (completionRatio ≥ 0.90 early exit and isComplete() mid-loop kill), strict detection: content_pasted requires Ctrl+V, content_copied requires Ctrl+C, second_app_opened detects any window switch universally
- Pipeline context passing —
priorContext[]accumulator flows from pre-processing through to Computer Use (no more amnesia between layers) - Credential resolution order — .clawdcursor-config → auth-profiles.json → openclaw.json (with template expansion) → env vars
loadPipelineConfig()path resolution — checks package dir first, then cwd (fixes global npm installs)- Smart Interaction model lookup — uses
PROVIDERSregistry instead of hardcoded model/baseUrl maps; fixes staleclaude-haiku-3-5-20241022fallback - Scroll behavior — system prompts instruct PageDown/Space instead of tiny mouse scrolls; default scroll delta 3 → 15
- Provider-agnostic internals — all comments and logs say "vision LLM" instead of "Claude"
- Verification retry limit — max 3 retries prevents infinite verification loops
- Universal checkpoint detection — no hardcoded app lists;
detectTaskType()uses action patterns only
- Pipeline architecture: LLM Pre-processor → Pre-open app + navigate → L0 Browser → L1 Action Router + Shortcuts → L1.5 Smart Interaction → L2 A11y Reasoner → L3 Computer Use
- Pre-processor prompt hardened with NEVER rules (never summarize, never drop steps) and VALIDATION RULE
- MULTI-APP WORKFLOWS section added to both Mac and Windows Computer Use system prompts
- Checkpoint thresholds tightened: early completion 75% → 90%, skip-verification 50% → 80%
- Checkpoint-based task completion — Computer Use tracks milestones (compose opened → fields filled → send pressed → compose closed) and stops when all checkpoints are met. No more wasted calls after successful completion.
- Task type detection — auto-classifies tasks (email, form, navigate, draw, file_save) and applies appropriate checkpoint templates.
- Smart early termination — when Claude says "done" and ≥75% checkpoints confirmed, accepts completion immediately.
- Auto-config on first run —
clawdcursor startauto-detects providers without needingclawdcursor doctor. - Universal provider support — any OpenAI-compatible endpoint works via
--base-url. - CLI model selection —
--text-modeland--vision-modelflags.
- Email domain extraction bug — "send to user@hotmail.com" no longer navigates to hotmail.com. Email addresses are stripped before URL matching.
- Verification override bug — verification no longer contradicts confirmed checkpoint completion. Skipped when ≥50% checkpoints met.
- Context loss between layers — Computer Use now receives full context of what pre-processing already did.
- Drawing quality — minimum 50px drag distances enforced via system prompt.
- OpenClaw credential discovery — multi-provider scan, template variable resolution, no false overrides.
- Pipeline gate — Action Router always runs, shortcuts work everywhere.
- Pipeline pre-processes "open X and Y" tasks — opens app via Action Router (free), then hands remaining task to deeper layers.
- Smart Interaction detects visual loop tasks (draw, paint) and skips to Computer Use.
- Computer Use system prompt includes Snap Assist handling and drawing guidelines.
- Auto-config on first run —
clawdcursor startauto-detects and configures providers without needingclawdcursor doctorfirst. Doctor is now optional for fine-tuning. - Universal provider support — any OpenAI-compatible endpoint works. Not limited to 7 hardcoded providers. Use
--base-url+--api-keyfor custom endpoints. - CLI model selection —
--text-modeland--vision-modelflags on start command. - Dynamic OpenClaw provider mapping — reads ALL providers from OpenClaw config, not just known ones. NVIDIA, Fireworks, Mistral, etc. work automatically.
clawdcursor startnow auto-runs setup if no config exists (non-interactive)- Provider detection accepts any provider name, falling back to OpenAI-compatible API
detectProvider()returns 'generic' for unknown providers instead of defaulting to 'openai'
- Keyboard shortcuts registry (
src/shortcuts.ts) — 30+ common actions mapped to direct keystrokes. Scroll, copy, paste, undo, reddit upvote/downvote, browser shortcuts, and more. Zero LLM calls. - Fuzzy shortcut matching — "scroll the page down" fuzzy-matches to scroll-down shortcut. Context-aware matching for social media actions.
- Router telemetry — Action Router now logs match type, confidence, and shortcut hits.
- CDP→UIDriver fallback — Smart Interaction falls back to accessibility tree automation when browser CDP path fails.
- Gmail, Outlook, Hotmail added to Browser Layer site map.
- Pipeline gate bug — Action Router was gated behind
!isBrowserTask, causing shortcuts to be skipped for browser-context tasks (e.g., "reddit upvote" matched browser regex but should use shortcut). Action Router now always runs after Browser Layer. - URL extraction false positives — "open gmail and send email to foo@bar.com" no longer extracts
bar.com. URL extraction now isolates the navigation clause before matching. - Reliable force-stop —
clawdcursor stopnow force-kills lingering processes via PID file. - Provider label inference — startup logs now clearly show text and vision provider names separately.
- Pipeline order: Browser Layer (L0) → Action Router + Shortcuts (L1) → Smart Interaction (L1.5) → A11y Reasoner (L2) → Vision (L3). Action Router no longer gated.
extractUrl()uses navigation clause isolation instead of matching against full task text.
- OpenClaw credential integration — auto-discovers all configured providers from OpenClaw's
auth-profiles.jsonandopenclaw.json. No separate API key needed when running as an OpenClaw skill. - Universal provider support — added Groq, Together AI, DeepSeek as first-class providers with profiles, env var detection, and key prefix recognition.
- Auto-detection as default — provider defaults to
autoinstead of hardcoding Anthropic. Doctor picks the best available provider automatically. - Mixed provider pipelines — use Ollama for text (free) + any cloud provider for vision (best quality). Vision credentials preserved when brain reconfigures for text.
- Dynamic Ollama model selection — doctor picks the best available Ollama model instead of hardcoding
qwen2.5:7b. - Anthropic vision routing fix — detects Anthropic vision by key prefix (
sk-ant-) independently of the main provider field, so split-provider setups work correctly.
- Default config no longer assumes any specific provider or model
- Provider scan loop iterates all registered providers dynamically
- Help text and doctor output are provider-agnostic
--providerCLI flag accepts any string (not limited to 4 providers)- README updated with 7-provider compatibility table
- SKILL.md hardened — removed aggressive autonomy language ("use without asking", "be independent")
- Sensitive App Policy — agents must ask the user before accessing email, banking, messaging, or password managers
- Safety tiers as hard rules — 🔴 Confirm actions must never be self-approved by agents
- Data flow transparency — expanded security section documents network isolation, per-provider data flow, and Ollama = fully offline
- No credentials in skill directory — OpenClaw users get auto-discovery from local config; no keys stored in skill files
- Vision model crash when main provider set to Ollama but vision uses Anthropic (
model not founderror) - Brain reconfiguration was wiping vision credentials — now preserved
- Fluid LLM task decomposition — decompose prompt now tells the LLM to reason about what ANY app needs. No more hardcoded examples. "Write me a sentence about dogs" generates actual content instead of typing the literal instruction.
- Interactive doctor onboarding — after scanning providers, doctor shows all working TEXT and VISION LLM options with ★ recommendations. User picks by number, Enter for default. Shows GPU info (VRAM via nvidia-smi) to help decide local vs cloud.
- Cloud provider guidance — doctor shows unconfigured providers with signup URLs and lets you paste an API key inline (auto-detects provider, saves to .env).
- Smart vision fallback for compound tasks — when Router or Reasoner handles part of a multi-step task but fails midway, ALL remaining subtasks are bundled and handed to Computer Use (vision). Prevents false-success trapping in cheap layers.
- Ollama auto-detection — brain auto-reconfigures to use local Ollama for decomposition when no cloud API key is set.
hasApiKeynow recognizes local LLMs. - Compound task guard — action router detects multi-step/compound tasks (commas, "then", "and then") and skips to deeper layers.
- Case-preserving action router — all regex matches against raw (unmodified) task text. Typed text and URLs no longer get lowercased.
- Flexible click matching —
click Blank documentworks without quotes (was requiringclick "Blank document"). Single unified regex for quoted and unquoted element names. - PowerShell encoding — replaced emoji (🐾) and em dash (—) in task console title that broke on Windows PowerShell due to encoding.
- Stale config —
.clawdcursor-config.jsonnow correctly reflects Ollama when doctor detects it (was stuck on Anthropic). - Brain provider mismatch — decomposition no longer calls Anthropic API when only Ollama is available.
npm run setup— new script that builds and registersclawdcursoras a global command vianpm link. Works on Windows, macOS, and Linux.- Stop/kill port validation — port input is now sanitized (parseInt + range check 1-65535) to prevent command injection
- Kill health verification — kill command now verifies
/healthreturns a Clawd Cursor response before force-killing - Install instructions updated — README and docs now use
npm run setup
| Task | Pipeline Path | Steps | LLM Calls | Time | Result |
|---|---|---|---|---|---|
| Open Notepad | Action Router | 1 | 0 | 1.5s | ✅ |
| Open Notepad + write haiku | Router → Smart Interaction → Computer Use | 6 | 7 | 58.8s | ✅ Verified |
| Open Google Doc in Edge + write sentence | Browser → Computer Use | 17 | 9 | 78.8s | ✅ Verified |
clawdcursor install— one command to set up API key, configure pipeline, and register as OpenClaw skillclawdcursor uninstall— clean removal of all config, data, and OpenClaw skill registration- Doctor auto-registers as OpenClaw skill — symlinks into
~/.openclaw/workspace/skills/clawdcursor - Doctor quick fix commands — shows exact commands for missing text LLM and vision LLM in summary
- Dashboard favorites — star commands to save them, click to re-run, persists across server restarts
- Credential detection — warns when starring tasks that contain API keys or passwords
- OS tabs on website — Windows/macOS/Linux with auto-detect
- Post-build help message — shows all available commands after
npm run build - Dynamic OS detection — system prompt uses actual OS instead of hardcoded "Windows 11" (thanks @molty)
- Windows skill detection — removed
requires.binsfrom SKILL.md; OpenClaw'shasBinary()doesn't handle Windows PATHEXT (.exe/.cmd), causing the skill to show as "missing" even when node is installed
- SKILL.md rewritten — agent identity shift framing, trigger lists, CDP direct path, async polling, error recovery
- Security hardened — agents cannot self-approve confirm-tier actions, autonomous use scoped to read-only
- Privacy language clarified — explicit per-provider data flow
- Website Get Started simplified — 3 lines, commands shown in terminal post-build
- Anthropic text model updated —
claude-haiku-4-5(wasclaude-3-5-haiku-20241022)
- Privacy language clarified — explicit per-provider data flow (Ollama = fully local, cloud = data to that API only)
- Added homepage and source URLs to skill metadata
- Removed hard-coded paths from SKILL.md
- Security section expanded — includes localhost bind verification command
- Security scan addressed — all flagged documentation gaps resolved
- SKILL.md rewritten — agents now understand they have full desktop control and stop asking users to do things they can do themselves
- Agent identity shift framing — blockquote at top overrides default "I can't do desktop things" behavior
- "When to Use This" trigger list — comprehensive decision framework for when to reach for Clawd Cursor
- Two paths documented — REST API (port 3847) for full desktop control, CDP Direct (port 9222) for fast browser reads
- Async flow clarified — concrete polling pattern agents can follow step-by-step
- Error recovery table — 8 common problems with exact solutions
- Expanded task examples — cross-app workflows, data extraction, verification scenarios
- README — added OpenClaw Integration section
- Web Dashboard — full single-page UI served at
GET /(port 3847). Task submission, real-time logs, status indicators, approve/reject for safety confirmations, kill switch. Dark theme, fully responsive, zero external dependencies. clawdcursor dashboard— CLI command to open the dashboard in your default browserclawdcursor kill— CLI command to send a stop signal to the running serverGET /logs— API endpoint returning last 200 log entries with timestamps and levels- Browser foreground focus — Playwright navigation now brings Chrome to the front via
page.bringToFront()+ OS-level window activation (PowerShellSetForegroundWindowon Windows,osascripton macOS). The AI acts like a visible cursor — you see everything it does. - Console hook —
hookConsole()intercepts all server logs for the dashboard log feed with auto-classification (error/success/warn/info)
- Smart task handoff — Browser layer no longer uses regex word lists to detect multi-step tasks. Pure navigation ("open youtube") completes in browser layer; anything more complex falls through to SmartInteraction where the LLM plans the steps. No more missed verbs.
Layer 0: Browser (Playwright) — navigate + foreground focus
↓ more than navigation? → fall through
Layer 1: Action Router — regex patterns, zero LLM calls
↓ no match? → fall through
Layer 1.5: Smart Interaction — 1 LLM call plans steps, CDP/UIDriver executes
↓ failed? → fall through
Layer 2: Accessibility Reasoner — reads UI tree, cheap LLM
↓ failed? → fall through
Layer 3: Screenshot + Vision — full screenshot, Computer Use API
- HD screenshots — LLM resolution increased from 1024px to 1280px (scale 2x instead of 2.5x). Claude can now reliably identify toolbar icons, buttons, and small UI elements.
- JPEG quality — bumped from 55 to 65 for clearer icon identification
- Window focus stability —
Win+Dminimizes all windows before task execution, preventing the Clawd terminal from stealing focus from target apps - Paint drawing reliability — pencil tool guidance in system prompt, mandatory checkpoint after tool selection
- Stale file cleanup — restored
get-windows.ps1shim (still referenced by accessibility.ts), removed deadsetup.ps1andget-ui-tree.ps1
| Metric | v0.5.0 | v0.5.1 |
|---|---|---|
| Time | ~250s | 55s |
| API calls | 30 | 6 |
| Success rate | ~50% | ~90% |
clawdcursor doctor— auto-diagnoses setup, tests models, configures optimal pipeline- 3-layer pipeline — Action Router → Accessibility Reasoner → Screenshot fallback
- Layer 2: Accessibility Reasoner (
src/a11y-reasoner.ts) — text-only LLM reads the UI tree, no screenshots needed. Uses cheap models (Haiku, Qwen, GPT-4o-mini). - Batch action execution — Claude returns multiple actions per response (3.6 avg), skipping screenshots between batched actions. Drawing tasks execute 10+ actions in a single API call.
- Focus hints — each screenshot includes a FOCUS directive telling Claude where to look, reducing output tokens and decision time
- Auto-maximize — apps launched via Action Router are automatically maximized (
Win+Up) for consistent layout - Region capture —
captureRegionForLLM()crops screenshots to specific areas (2-30KB vs 58KB full) - Checkpoint strategy — screenshots only after critical state changes (app open, dialog appear), not after every action
- Multi-provider support — Anthropic, OpenAI, Ollama (local/free), Kimi. Same codebase, auto-detected.
- Provider model map (
src/providers.ts) — auto-selects cheap/expensive models per provider - Self-healing — doctor falls back if a model is unavailable (e.g., Haiku → Qwen). Circuit breaker disables failing layers at runtime.
- Streaming LLM responses — early JSON return saves 1-3s per call
- Combined accessibility script (
scripts/get-screen-context.ps1) — 1 PowerShell spawn instead of 3 - Benchmark harness (
test-perf-comparison.ts)
- Screenshots: 120KB → ~80KB, 1280px target (HD for reliable icon identification)
- JPEG quality: 70 → 65
- Delays: 200-1500ms → 50-600ms across the board
- System prompts: ~60% smaller (fewer tokens per call)
- Accessibility tree: filtered to interactive elements only, 3000 char cap
- Taskbar cache: 30s TTL (was queried every call)
- Screen context cache: 500ms → 2s TTL
| Task | v0.4 | v0.5 (Ollama, $0) | v0.5 (Anthropic) | v0.5 + Batch |
|---|---|---|---|---|
| Calculator | 43s | 2.6s | 20.1s | — |
| Notepad | 73s | 2.0s | 54.2s | — |
| File Explorer | 53s | 1.9s | 22.1s | — |
| Paint stickman | ~250s (30 calls) | — | ~124s (19 calls) | 101s (11 calls) |
| GitHub profile | — | — | ~106s (15 calls) | — |
VNC removed. Clawd Cursor now controls the desktop natively via @nut-tree-fork/nut-js. No VNC server required.
--vnc-host,--vnc-port,--vnc-passwordCLI flags removedVNC_PASSWORD,VNC_HOST,VNC_PORTenvironment variables no longer usedrfb2dependency removedsetup.ps1no longer installs TightVNC
NativeDesktopclass (src/native-desktop.ts) — drop-in replacement for VNCClient- Direct screen capture via @nut-tree-fork/nut-js (~50ms vs ~850ms)
- Direct mouse/keyboard control via OS-level APIs
- Simplified onboarding:
npm install && npm start
- Screenshots: ~850ms → ~50ms (17× faster)
- Connect time: ~200ms → ~38ms (5× faster)
- Simple task (Google Docs sentence): ~120s → ~102s
- Complex task (GitHub → Notepad → save): ~200s → ~156s
- VNC server dependency (TightVNC)
rfb2npm package- VNC-related CLI flags and environment variables
- BGRA→RGBA color swap (nut-js returns RGBA natively)
- setup.ps1 now completes end-to-end in a single run on fresh systems, even in non-interactive/headless AI agent shells
- Generate random VNC password when
--vnc-passwordnot provided non-interactively - Replace
Start-Process -NoNewWindow -Waitwith-PassThru -WindowStyle Hidden+ try/catch (msiexec crash fix) - Wrap
Start-Servicein its own try/catch (post-install crash fix) - Replace all emoji with ASCII tags for cp1252 headless terminal compatibility
- Added YAML frontmatter, explicit credential declarations, privacy disclosure, and security considerations for ClaWHub publishing.
- Screenshot hash cache — skips LLM calls when the screen hasn't changed
- Adaptive VNC frame wait — captures in ~200ms instead of fixed 800ms
- Parallel screenshot + accessibility fetch — runs concurrently via Promise.all
- Accessibility context cache — 500ms TTL eliminates redundant PowerShell queries
- Async debug writes — no longer blocks the event loop
- Exponential backoff with jitter — better retry resilience for API calls
Clawd Cursor now supports Anthropic's native Computer Use API (computer_20250124) as the primary execution path. This is a fundamentally different approach — the full task goes directly to Claude with native computer use tools. No decomposition, no routing. Claude sees screenshots, plans, and executes natively.
The agent now has two separate code paths selected by provider:
- Path A — Computer Use API (
--provider anthropic): Full task sent to Claude withcomputer_20250124tool. Claude sees the screen, plans multi-step sequences, and executes them natively. Handles complex, multi-app workflows reliably. - Path B — Decompose + Action Router (
--provider openai/ offline): Original approach from v0.1.0. Parse task → subtasks → Action Router (UI Automation, zero LLM) → Vision fallback. Faster and cheaper for simple tasks, works without an API key.
- Anthropic Computer Use integration — native
computer_20250124tool type withanthropic-beta: computer-use-2025-01-24header - Adaptive delays — per-action timing: 1000ms for app launch, 800ms for navigation, 100ms for typing, 300ms default
- Verification hints — post-action verification prompts after each Computer Use step
- Mouse drag —
mouseDrag,mouseDown,mouseUpwith smooth interpolation between points - Bulletproof system prompt — planning rules, ctrl+l for URL navigation, recovery strategies for failed actions
- Display scaling — automatic resolution scaling to 1280×720 for Computer Use API compatibility
- Vision model —
claude-sonnet-4-20250514for Computer Use path
| Task | Time | API Calls | Result |
|---|---|---|---|
| Google Docs: open Chrome, go to Docs, write a paragraph | 187s | 14 | ✅ All succeeded |
| GitHub: open Chrome, navigate to profile, screenshot | 102s | — | ✅ All succeeded |
| Notepad: open, write haiku, save to desktop | ~180s | — | ✅ File saved correctly |
| Paint: draw a stick figure | ~90s | 16 | ✅ Drawing completed |
- Provider selection now determines execution path.
--provider anthropicuses Computer Use API (Path A).--provider openaior no provider uses the original Decompose + Action Router pipeline (Path B). This is a fundamental change in behavior — the same task will execute via completely different code paths depending on the provider.
| Path A (Computer Use) | Path B (Action Router) | |
|---|---|---|
| Best for | Complex multi-step tasks | Simple single-action tasks |
| Reliability | Very high | Good for supported patterns |
| Speed | ~90–190s for complex tasks | ~2s for simple tasks |
| Cost | Higher (multiple API calls with screenshots) | Lower (1 text call or zero) |
| Offline | No | Yes (for common patterns) |
- Action Router with Windows UI Automation — 80% of common tasks with zero LLM calls
- Vision fallback for complex/unfamiliar UI
- Smart task decomposition (single text-only LLM call)
- Three-tier safety system (Auto / Preview / Confirm)
- REST API and CLI interface
- Windows setup script