Skip to content

Latest commit

 

History

History
601 lines (464 loc) · 46.4 KB

File metadata and controls

601 lines (464 loc) · 46.4 KB

Changelog

All notable changes to Clawd Cursor will be documented in this file.

[0.8.3] - 2026-04-19 — Hotfix: "Outlook keeps opening" + runaway guard

User reported Outlook launching repeatedly during a test. Root-cause diagnosis traced to three compounding failures: (1) PlatformAdapter.openApp spawned a new instance even when the app was already running, (2) the escalation ladder (router → blind → hybrid → vision) re-ran open_app at each rung because earlier rungs couldn't verify success through New Outlook's sparse WebView2 accessibility tree, (3) clawdcursor stop only killed the start process on port 3847, missing serve (different port / same port different process) and mcp (stdio, no port) entirely. A stale serve kept receiving MCP traffic after the user thought they'd stopped everything.

Fixed

  • openApp / launchApp idempotency (Windows + macOS + Linux). When the target app already has a visible window AND the caller didn't set alwaysNewInstance: true AND no url is passed, the adapter now focuses the existing window and returns its pid instead of spawning another instance. Match policy: case-insensitive exact processName → processName substring → title substring → UWP AppId tail. Closes the "N windows of Outlook stacking up" class of bug under any retry loop. src/v2/platform/{windows,macos,linux}.ts.
  • Agent runaway guard — if the agent calls the same tool + identical args ≥ 3 times within the last 6 turns, the loop exits with give_up and a targeted message suggesting detect_webview_apps when the target is likely Electron/WebView2. Prevents the generalized "retry-loop-because-a11y-is-opaque" anti-pattern. src/pipeline/agent/agent.ts.
  • clawdcursor stop now sweeps all modes. After the graceful /stop on port 3847, iterates every pidfile in ~/.clawdcursor/*.pid, SIGTERMs any live pid, SIGKILLs after 500ms if still running, and unlinks the pidfile. Catches mcp (stdio-only), zombie serve, and any start/serve on a non-default port. src/index.ts.

Notes

  • Stale-pidfile cleanup at startup was already correct via claimPidFile (checks isProcessAlive(existingPid) and overwrites when dead) — no code change needed there; the issue was exclusively stop.
  • Tests: 429 / 430 pass (1 skipped, same as 0.8.2). No schema snapshot change — these are behavioral fixes, not catalog changes.

[0.8.2] - 2026-04-19 — Session reliability, force-focus, Electron bridge

First-time-user review surfaced six concrete pain points. This release fixes every one.

Fixed

  • Silent 401 mid-session (the session-killer). Previous versions compared the incoming Bearer token against an in-memory SERVER_TOKEN only. A second clawdcursor process (stale pidfile takeover, or a concurrent mode) rewrote the token FILE without updating the first server's in-memory copy — clients reading the file silently lost auth. /health kept returning 200 so the failure was invisible. Fix: requireAuth now accepts EITHER the in-memory token OR the current on-disk token (mtime-cached, ~free). Drift is logged once with a recovery hint. src/server.ts.
  • focus_window force-to-front on Windows. Previous implementation called SetForegroundWindow which the OS blocks when the caller isn't the current foreground process. New implementation uses the full sequence: ShowWindow(SW_RESTORE) → topmost-toggle → AttachThreadInput with the current foreground thread → AllowSetForegroundWindow(ASFW_ANY)BringWindowToTopSetForegroundWindow, with an Alt-key synthetic fallback. Raises any window through Windows' foreground lock. scripts/ps-bridge.ps1.
  • Richer validation errors. REST /execute rejections now carry the full expected tool signature. A missing param returns Missing required parameter "target". Expected smart_click(target: string, processId?: number). — agents no longer have to roundtrip to /docs. src/tool-server.ts.

Added

  • Electron / WebView2 detection. New MCP tools detect_webview_apps and relaunch_with_cdp (also exposed via compact system({"action":"detect_webview"}) / system({"action":"relaunch_with_cdp"})). Recognises olk (New Outlook), Teams, Discord, Slack, VS Code, GitHub Desktop, Notion, Obsidian, Spotify. When detected, probes ports 9222/9223/9229/8315 for a live CDP endpoint; if found, tells the agent to attach via browser({"action":"connect"}). If not, shows the exact relaunch command (e.g. discord --remote-debugging-port=9222) so CDP can be enabled and the sparse UIA tree bypassed entirely. src/tools/electron_bridge.ts.
  • drag_path documentation clarity. Existing mouse_drag_stepped / compact computer({"action":"drag_path","path":"[...]"}) now explicitly documented for freehand curve drawing (Paint, Figma, canvas apps). SKILL.md "Quick reference" covers when to use drag_path vs drag.

Changed

  • SKILL.md pushes compact mode harder. Top of doc now carries a directive callout: "If you are an LLM reading this: YOU SHOULD BE USING COMPACT MODE." with MCP config + REST URL. Granular stays available but is explicitly labeled the power-user / larger-prompt option.
  • SKILL.md web-app keyboard warning. Web-wrapped apps (Outlook, Teams, Gmail) treat Escape as "close dialog/modal" — sometimes closing the compose window. Documented: do not use Escape to dismiss autocompletes in web apps; use arrow keys + Enter or click-away.
  • Error-recovery table expanded with Electron-vs-true-canvas split, v0.8.2 auth recovery, v0.8.2 force-focus note, and the drag_path vs drag distinction.

Tests

  • 429 / 430 passing (one skipped, same as 0.8.0).
  • Schema snapshot regenerated → 74 granular tools (72 + 2 Electron bridge).
  • Live smoke: token auth survives a second clawdcursor serve; focus_window raises Paint through a full-screen window; detect_webview_apps correctly flags Outlook / Teams / VS Code when any are open.

Consolidates v0.8.1 (never tagged)

0.8.1-alpha.0 through -alpha.N shipped unified-pipeline + compact-MCP + Linux AT-SPI + Wayland routing on the feature branch. They roll into 0.8.2 as a single stable release. See the v0.8.1-alpha tag range in the git history for per-tranche detail; headline features:

  • Unified blind/hybrid/vision agent — one loop, three modes. Replaces the v0.8.0 split text-agent + vision-agent with a single harness using native tool_use (Anthropic) / tool_calls (OpenAI) / prose-JSON fallback.
  • Compact MCP surface — 6 compound tools (computer, accessibility, window, system, browser, task) that collapse the full capability into ~1,500 tokens of catalog. Anthropic-Computer-Use shape extended across the whole product. clawdcursor mcp --compact or GET /tools?mode=compact.
  • PlatformAdapter widenedmouseDown/Up, keyDown/Up, setWindowState, setWindowBounds, listDisplays, waitForElement, widened InvokeAction (expand/collapse/toggle/select/get-value), richer UiElement state flags.
  • Linux AT-SPI bridge — read-only first pass via python3-gi + gir1.2-atspi-2.0. Linux a11y methods (getUiTree, findElements, getFocusedElement, waitForElement) now return real data on boxes where the bridge dependencies are present. invokeElement still stubbed — tracked for a follow-up pass.
  • Linux Wayland input routingydotool (mouse + keyboard) or wtype (keyboard fallback) detected at init. X11 path unchanged; Wayland no longer silently mis-fires through nut-js.
  • Per-capability palettes + compound vision tools — text-agent turns now see a 6-10 tool scoped palette based on the subtask's capability (app_launch / text_input / navigation / form_fill / spatial / file_ops / window_mgmt / general). Vision-agent turns see 3 compound mouse / keyboard / window tools with action enums. ~12× fewer catalog tokens per turn.
  • Pretty TTY logs with HH:MM:SS timestamps — layer-tagged ([router], [blind], [vision], [safety], etc.), no per-line repetition, CLAWD_LOG=pretty default on TTY.
  • SKILL.md rewrite — reviewed by a Sonnet subagent against legacy v0.6.3/v0.7.14 tone, verified model-agnostic + OS-agnostic, restored "USE AS A FALLBACK" + "IMPORTANT — READ THIS BEFORE ANYTHING ELSE" directive callouts and Sensitive App Policy.

[0.8.0] - 2026-04-16 — V2 Architecture (opt-in)

A ground-up reimagining of the internal pipeline. Opt in with clawdcursor start --v2. The legacy pipeline is unchanged and remains the default.

Added

  • --v2 flag on clawdcursor start — activates the new 3-layer architecture: Router → VisionAgent → Verifier. No effect on MCP, serve, or legacy start.
  • src/v2/platform/ — platform abstraction. Single PlatformAdapter interface with macos.ts, windows.ts, linux.ts implementations. Replaces 142+ scattered if (process.platform === 'darwin') branches across 34 files. Business logic no longer sees process.platform. Adding a new OS = one file.
  • src/v2/verifier/GroundTruthVerifier. Six independent signals decide whether a task actually completed: pixel diff, window change, focus change, OCR delta, task-specific assertions (send_email, navigate_url, open_app, type_text, search, compose_message, create_file), and anti-patterns (error dialogs, "cannot send", "draft saved", invalid recipient, auth failed). Weighted voting with hard-fail rules on anti-patterns. Cannot be fooled by LLM self-reported "done".
  • src/v2/agent/VisionAgent: a single vision-first tool-use loop. 16 tools (screenshot, read_screen, list_windows, click, drag, scroll, type, key, invoke_element, set_field_value, open_app, focus_window, read_clipboard, write_clipboard, wait, done). 6-rule system prompt (down from 36). Model-agnostic via existing callVisionLLM.
  • src/v2/orchestrator.tsPipelineV2 wires Router → VisionAgent → Verifier with before/after state capture.
  • Hardened JSON parser — tolerates trailing braces, markdown code fences, and other common LLM malformations. Balanced-brace extraction as fallback.

Fixed

  • False positives — legacy pipeline reports UNVERIFIED_SUCCESS when the agent claims "done" but the screen didn't change. V2 verifier catches this class: in a live email-send test the agent said "Email sent" but a "Cannot send" dialog was on screen. V2 correctly rejected the claim. (Legacy still does what it does; this fix only applies when --v2 is set.)

Testing

Smoke-tested on macOS with Anthropic Claude Haiku (text) + Sonnet (vision):

Task Time Verdict
Open TextEdit and type 30s ✅ (4/6 signals)
Calculator: 47+53=100 65s ✅ (5/6 signals, zero parse errors)
Safari → github.com 45s ✅ (6/6 signals)
Notes: create note 182s ✅ (6/6 signals)
Email send (failing server) 86s Correctly rejected — legacy would have reported success

Platform Safety

No legacy code modified. Windows, Linux, and MCP paths untouched. v2 code is entirely under src/v2/.

[0.7.14] - 2026-04-13 — Full macOS Keyboard Automation + Platform-Aware Pipeline

Fixed

  • macOS keystrokes silently dropped — root cause: CGEvent.post() from the Swift helper is blocked by macOS TCC when the helper is spawned as a child of Node.js. keyPress() and typeText() on macOS now route through osascript + System Events (the Apple-sanctioned method). All keyboard shortcuts (Cmd+V, Cmd+N, Shift+Cmd+D, etc.) now work correctly.
  • Single-char keys losing modifierskeycodeForCharacter() lookup added to ClawdCursorHelper; modifiers are no longer discarded for Cmd+letter combos.
  • asDouble() coercion — click/drag coordinates sent as integers (common from some LLMs) no longer fail with a type mismatch in the Swift helper.
  • keycodeForCharacter fallback — now returns an error for unmapped characters instead of silently falling back to the 'v' keycode.
  • Permission check inconsistencydoctor, status, and readiness.ts all now query the same canonical path: Host /statuspermission-check binary → direct fallback. No more false "granted" reports.
  • Screenshot capture CPU spin — replaced CGWindowListCreateImage (triggers ReplayKit CPU spin bug on macOS 14+) with a delegated screenshot-helper subprocess.
  • A11y false positiveisShellAvailable() now tests actual window access (p.windows.length) instead of processes.length, which worked without Accessibility permission.
  • Node.js v25 crashEINVAL/setTypeOfService socket error from undici's internal QoS call is now caught and suppressed (non-fatal).
  • Dock click zone — reduced from 60px to 30px on macOS (Dock is thinner than the Windows taskbar).
  • Browser URL bar shortcutCmd+L used on macOS (was Ctrl+L, which does nothing in macOS browsers).

Added

  • macMailEmailFlow — deterministic email flow for macOS Mail.app (Cmd+N, Tab to subject/body, Cmd+Shift+D to send).
  • clawdcursor grant command — triggers macOS system permission dialogs directly from the CLI.
  • 115 Apple shortcuts — Mail, Safari, Notes, Messages, Terminal added to the shortcut database.
  • scripts/test-macos-fixes.sh — one-shot E2E verification script: rebuild, binary check, permission consistency, screenshot capture, doctor cross-check.
  • --request-screen-recording flag on permission-check binary — optional TCC dialog trigger for Screen Recording.
  • processPath + bundleId in all permission check responses — aids TCC debugging.
  • 30s TTL cache on A11y shell availability — permission grants mid-session are now detected without restart.
  • macOS native binary verification in scripts/verify-install.js — warns on missing binaries at npm install time.
  • setup script auto-builds native binaries on macOS (inside npm run setup).

Changed

  • build.sh — marked executable in git, fails fast on missing binaries (was silently warning), better error guidance.
  • Installer — verifies all 4 required binaries (not just ClawdCursorHost), uses bash ./build.sh for portability.
  • doctor.ts — permission check unified via native-helper module; triggers system permission dialogs if denied.
  • Email flow keyboard shortcuts — platform-aware: Ctrl+EnterShift+Cmd+D on macOS, Ctrl+HCmd+Option+F for Find & Replace.
  • sharp bumped ^0.33.0^0.33.5.

Platform Safety

No Windows or Linux code paths affected. All macOS changes are gated behind IS_MAC / process.platform === 'darwin' / isMacOS().

[0.7.13] - 2026-04-10 — Unified Permission Checks + Screenshot Helper

Fixed

  • Permission check fragmentation — doctor, status, and readiness each used different permission APIs, producing contradictory results. All now route through ClawdCursorHost /statuspermission-check binary → direct AXIsProcessTrusted fallback.
  • Screenshot CPU spin — delegated takeScreenshot() to screenshot-helper subprocess, eliminating the ReplayKit CPU spike on macOS 14+.
  • Installer binary verification — now checks all 4 required binaries (ClawdCursorHost, clawdcursor-helper, screenshot-helper, permission-check) instead of just ClawdCursorHost.
  • build.sh silent failuresswift build errors now fail the build immediately with actionable guidance.

Added

  • clawdcursor grant command — triggers macOS system permission dialogs for Accessibility and Screen Recording.
  • processPath + bundleId in permission check responses for TCC debugging.
  • --request-screen-recording flag on permission-check binary.

[0.7.12] - 2026-04-09 — Comprehensive macOS TCC Fix

Fixed

  • Bash pipeline bugset -o pipefail added; build failures now properly detected (was silently passing due to pipeline exit status bug)
  • Ad-hoc signing by default — build.sh now always signs the app (required for TCC on macOS 26+ Tahoe where unsigned binaries don't appear in privacy settings)
  • Build error capture — uses temp file instead of pipe to properly capture exit status
  • TCC permission check — runs permission-check after build to show current accessibility/screen recording status

Changed

  • build.sh rewritten — cleaner structure, ad-hoc signing is default (not optional), signature verification added
  • Codesign uses --deep — ensures all nested binaries are signed
  • Installer shows TCC status — tells user exactly which permissions need to be granted and where

Technical Details

The core issue was TCC (Transparency, Consent, and Control) on macOS binds permissions to the code signing identity. Without signing:

  • On macOS 26+ (Tahoe), unsigned binaries don't appear in System Settings privacy panels at all
  • Users saw "ClawdCursorHost binary not found" errors even though install appeared to succeed

Reference: mediar-ai/mcp-server-macos-use for TCC permission handling patterns.

[0.7.11] - 2026-04-09 — macOS Installer Fix

Fixed

  • macOS installer now fails loudly if native host build fails — was silently swallowing build errors and claiming "optional fallback" that doesn't exist
  • Added verification step — installer explicitly checks ClawdCursorHost binary exists before declaring success
  • Show build output — Swift build errors are now visible instead of redirected to /dev/null
  • Clear error messages — tells users exactly what went wrong and how to fix it (xcode-select --install, manual rebuild, etc.)

Changed

  • macOS native host is now correctly marked as REQUIRED, not optional
  • Installer exits with error code 1 if native build fails on macOS

[0.7.10] - 2026-04-08 — Guided Setup Flow

Changed

  • Installer shows next steps — after install, displays clear guidance: clawdcursor doctorclawdcursor start
  • Doctor shows run options — after passing all checks, shows both start (full agent) and serve (tools-only) modes
  • Consent shows next step — after granting consent, directs users to clawdcursor doctor

[0.7.9] - 2026-04-08 — UX Improvements

Changed

  • macOS permission messages — now direct users to enable "ClawdCursor" instead of "Terminal/Node"
  • Screen Recording path — updated to "Screen & System Audio Recording" (macOS Sequoia naming)

[0.7.8] - 2026-04-08 — Documentation Fix

Fixed

  • Installer comments updated — example version references now point to v0.7.8

[0.7.7] - 2026-04-08 — Installer Fixes

Fixed

  • Installers default to main branch — install.sh and install.ps1 now use main instead of hardcoded non-existent tag
  • macOS installer builds native helper — install.sh now runs ./native/build.sh on Darwin if Swift is available
  • Version override supportVERSION=v0.7.7 curl ... | bash or $env:VERSION='v0.7.7' to install specific release
  • Auto-pull on update — installers now run git pull after checkout to get latest changes

[0.7.6] - 2026-04-08 — macOS Native Host App

Added

  • macOS Host App (ClawdCursorHost) — new native Swift executable that runs as the app bundle's main process, owning all TCC permissions (Accessibility, Screen Recording) under a single app identity
  • Localhost IPC server — host app exposes GET /health, GET /status, POST /rpc on 127.0.0.1:3848 for CLI→host communication
  • Token-based authentication~/.clawdcursor/host-token (mode 0600) secures the IPC channel
  • Auto-launch/stopclawdcursor start ensures host is running; clawdcursor stop gracefully quits it
  • New Swift helper methodsmoveMouse, dragMouse, captureScreen for smoother native macOS automation
  • Menu bar presence — host app shows 🐾 icon in menu bar for visibility

Security

  • Localhost-only binding — IPC server uses NWParameters.requiredLocalEndpoint to bind to 127.0.0.1 only, rejecting connections from other machines
  • Token file permissions — host-token created with mode 0600 (owner read/write only)

Changed

  • src/native-helper.ts — routes all macOS desktop operations through host IPC instead of direct stdio
  • src/native-desktop.ts — 11 platform-guarded code paths delegate to host on macOS
  • src/index.ts — start/stop commands manage host app lifecycle
  • native/ClawdCursor.app/Contents/Info.plist — bundle identifier changed to com.clawdcursor.app, executable to ClawdCursorHost

Unchanged

  • Windows/Linux — all macOS code behind IS_MAC && this.helper guards; no behavior changes on other platforms
  • 172 tests pass — full test suite unchanged

[0.6.3] - 2026-03-01 — Universal Pipeline, Multi-App Workflows, Provider-Agnostic

Added

  • LLM-based universal task pre-processor — one cheap text LLM call decomposes any natural language into {app, navigate, task, contextHints}, replacing brittle regex parsing
  • Multi-app workflow support — copy/paste between apps (e.g. Wikipedia → Notepad) with 6-checkpoint tracking: first_app_focused → first_app_action_done → content_copied → second_app_opened → content_pasted → result_visible
  • Site-specific keyboard shortcuts — Reddit (j/k/a/c), Twitter/X (j/k/l/t/r), YouTube (Space/f/m), Gmail (j/k/e/r/c), GitHub (s/t/l), Slack (Ctrl+k), plus generic hints
  • OS-level default browser detection — reads Windows registry (HKCU ProgId) or macOS LaunchServices instead of hardcoded Edge/Safari
  • 3 verification retries with step log analysis — when verification fails, builds a digest of recent actions + checkpoint status so the vision LLM can fix the specific missed step
  • Mixed-provider pipeline support — e.g. kimi for text, anthropic for Computer Use, with per-layer API key resolution from OpenClaw auth-profiles
  • ComputerUseOverrides interface — apiKey, model, baseUrl per-layer for mixed-provider setups
  • resolveProviderApiKey() helper — reads OpenClaw auth-profiles to find the right API key per provider

Fixed

  • Checkpoint system overhaul — removed auto-termination (completionRatio ≥ 0.90 early exit and isComplete() mid-loop kill), strict detection: content_pasted requires Ctrl+V, content_copied requires Ctrl+C, second_app_opened detects any window switch universally
  • Pipeline context passingpriorContext[] accumulator flows from pre-processing through to Computer Use (no more amnesia between layers)
  • Credential resolution order — .clawdcursor-config → auth-profiles.json → openclaw.json (with template expansion) → env vars
  • loadPipelineConfig() path resolution — checks package dir first, then cwd (fixes global npm installs)
  • Smart Interaction model lookup — uses PROVIDERS registry instead of hardcoded model/baseUrl maps; fixes stale claude-haiku-3-5-20241022 fallback
  • Scroll behavior — system prompts instruct PageDown/Space instead of tiny mouse scrolls; default scroll delta 3 → 15
  • Provider-agnostic internals — all comments and logs say "vision LLM" instead of "Claude"
  • Verification retry limit — max 3 retries prevents infinite verification loops
  • Universal checkpoint detection — no hardcoded app lists; detectTaskType() uses action patterns only

Changed

  • Pipeline architecture: LLM Pre-processor → Pre-open app + navigate → L0 Browser → L1 Action Router + Shortcuts → L1.5 Smart Interaction → L2 A11y Reasoner → L3 Computer Use
  • Pre-processor prompt hardened with NEVER rules (never summarize, never drop steps) and VALIDATION RULE
  • MULTI-APP WORKFLOWS section added to both Mac and Windows Computer Use system prompts
  • Checkpoint thresholds tightened: early completion 75% → 90%, skip-verification 50% → 80%

[0.6.5] - 2026-02-28 — Checkpoint System, Task Completion Detection

Added

  • Checkpoint-based task completion — Computer Use tracks milestones (compose opened → fields filled → send pressed → compose closed) and stops when all checkpoints are met. No more wasted calls after successful completion.
  • Task type detection — auto-classifies tasks (email, form, navigate, draw, file_save) and applies appropriate checkpoint templates.
  • Smart early termination — when Claude says "done" and ≥75% checkpoints confirmed, accepts completion immediately.
  • Auto-config on first runclawdcursor start auto-detects providers without needing clawdcursor doctor.
  • Universal provider support — any OpenAI-compatible endpoint works via --base-url.
  • CLI model selection--text-model and --vision-model flags.

Fixed

  • Email domain extraction bug — "send to user@hotmail.com" no longer navigates to hotmail.com. Email addresses are stripped before URL matching.
  • Verification override bug — verification no longer contradicts confirmed checkpoint completion. Skipped when ≥50% checkpoints met.
  • Context loss between layers — Computer Use now receives full context of what pre-processing already did.
  • Drawing quality — minimum 50px drag distances enforced via system prompt.
  • OpenClaw credential discovery — multi-provider scan, template variable resolution, no false overrides.
  • Pipeline gate — Action Router always runs, shortcuts work everywhere.

Changed

  • Pipeline pre-processes "open X and Y" tasks — opens app via Action Router (free), then hands remaining task to deeper layers.
  • Smart Interaction detects visual loop tasks (draw, paint) and skips to Computer Use.
  • Computer Use system prompt includes Snap Assist handling and drawing guidelines.

[0.6.2] - 2026-02-28 — Universal Provider Support, Auto-Config

Added

  • Auto-config on first runclawdcursor start auto-detects and configures providers without needing clawdcursor doctor first. Doctor is now optional for fine-tuning.
  • Universal provider support — any OpenAI-compatible endpoint works. Not limited to 7 hardcoded providers. Use --base-url + --api-key for custom endpoints.
  • CLI model selection--text-model and --vision-model flags on start command.
  • Dynamic OpenClaw provider mapping — reads ALL providers from OpenClaw config, not just known ones. NVIDIA, Fireworks, Mistral, etc. work automatically.

Changed

  • clawdcursor start now auto-runs setup if no config exists (non-interactive)
  • Provider detection accepts any provider name, falling back to OpenAI-compatible API
  • detectProvider() returns 'generic' for unknown providers instead of defaulting to 'openai'

[0.6.1] - 2026-02-28 — Keyboard Shortcuts, Pipeline Fixes

Added

  • Keyboard shortcuts registry (src/shortcuts.ts) — 30+ common actions mapped to direct keystrokes. Scroll, copy, paste, undo, reddit upvote/downvote, browser shortcuts, and more. Zero LLM calls.
  • Fuzzy shortcut matching — "scroll the page down" fuzzy-matches to scroll-down shortcut. Context-aware matching for social media actions.
  • Router telemetry — Action Router now logs match type, confidence, and shortcut hits.
  • CDP→UIDriver fallback — Smart Interaction falls back to accessibility tree automation when browser CDP path fails.
  • Gmail, Outlook, Hotmail added to Browser Layer site map.

Fixed

  • Pipeline gate bug — Action Router was gated behind !isBrowserTask, causing shortcuts to be skipped for browser-context tasks (e.g., "reddit upvote" matched browser regex but should use shortcut). Action Router now always runs after Browser Layer.
  • URL extraction false positives — "open gmail and send email to foo@bar.com" no longer extracts bar.com. URL extraction now isolates the navigation clause before matching.
  • Reliable force-stopclawdcursor stop now force-kills lingering processes via PID file.
  • Provider label inference — startup logs now clearly show text and vision provider names separately.

Changed

  • Pipeline order: Browser Layer (L0) → Action Router + Shortcuts (L1) → Smart Interaction (L1.5) → A11y Reasoner (L2) → Vision (L3). Action Router no longer gated.
  • extractUrl() uses navigation clause isolation instead of matching against full task text.

[0.6.0] - 2026-02-28 — Universal Provider Support, OpenClaw Integration

Added

  • OpenClaw credential integration — auto-discovers all configured providers from OpenClaw's auth-profiles.json and openclaw.json. No separate API key needed when running as an OpenClaw skill.
  • Universal provider support — added Groq, Together AI, DeepSeek as first-class providers with profiles, env var detection, and key prefix recognition.
  • Auto-detection as default — provider defaults to auto instead of hardcoding Anthropic. Doctor picks the best available provider automatically.
  • Mixed provider pipelines — use Ollama for text (free) + any cloud provider for vision (best quality). Vision credentials preserved when brain reconfigures for text.
  • Dynamic Ollama model selection — doctor picks the best available Ollama model instead of hardcoding qwen2.5:7b.
  • Anthropic vision routing fix — detects Anthropic vision by key prefix (sk-ant-) independently of the main provider field, so split-provider setups work correctly.

Changed

  • Default config no longer assumes any specific provider or model
  • Provider scan loop iterates all registered providers dynamically
  • Help text and doctor output are provider-agnostic
  • --provider CLI flag accepts any string (not limited to 4 providers)
  • README updated with 7-provider compatibility table

Security

  • SKILL.md hardened — removed aggressive autonomy language ("use without asking", "be independent")
  • Sensitive App Policy — agents must ask the user before accessing email, banking, messaging, or password managers
  • Safety tiers as hard rules — 🔴 Confirm actions must never be self-approved by agents
  • Data flow transparency — expanded security section documents network isolation, per-provider data flow, and Ollama = fully offline
  • No credentials in skill directory — OpenClaw users get auto-discovery from local config; no keys stored in skill files

Fixed

  • Vision model crash when main provider set to Ollama but vision uses Anthropic (model not found error)
  • Brain reconfiguration was wiping vision credentials — now preserved

[0.5.6] - 2026-02-27 — Fluid Decomposition, Interactive Doctor, Smart Vision Fallback

Added

  • Fluid LLM task decomposition — decompose prompt now tells the LLM to reason about what ANY app needs. No more hardcoded examples. "Write me a sentence about dogs" generates actual content instead of typing the literal instruction.
  • Interactive doctor onboarding — after scanning providers, doctor shows all working TEXT and VISION LLM options with ★ recommendations. User picks by number, Enter for default. Shows GPU info (VRAM via nvidia-smi) to help decide local vs cloud.
  • Cloud provider guidance — doctor shows unconfigured providers with signup URLs and lets you paste an API key inline (auto-detects provider, saves to .env).
  • Smart vision fallback for compound tasks — when Router or Reasoner handles part of a multi-step task but fails midway, ALL remaining subtasks are bundled and handed to Computer Use (vision). Prevents false-success trapping in cheap layers.
  • Ollama auto-detection — brain auto-reconfigures to use local Ollama for decomposition when no cloud API key is set. hasApiKey now recognizes local LLMs.
  • Compound task guard — action router detects multi-step/compound tasks (commas, "then", "and then") and skips to deeper layers.

Fixed

  • Case-preserving action router — all regex matches against raw (unmodified) task text. Typed text and URLs no longer get lowercased.
  • Flexible click matchingclick Blank document works without quotes (was requiring click "Blank document"). Single unified regex for quoted and unquoted element names.
  • PowerShell encoding — replaced emoji (🐾) and em dash (—) in task console title that broke on Windows PowerShell due to encoding.
  • Stale config.clawdcursor-config.json now correctly reflects Ollama when doctor detects it (was stuck on Anthropic).
  • Brain provider mismatch — decomposition no longer calls Anthropic API when only Ollama is available.

Changed

  • npm run setup — new script that builds and registers clawdcursor as a global command via npm link. Works on Windows, macOS, and Linux.
  • Stop/kill port validation — port input is now sanitized (parseInt + range check 1-65535) to prevent command injection
  • Kill health verification — kill command now verifies /health returns a Clawd Cursor response before force-killing
  • Install instructions updated — README and docs now use npm run setup

Test Results

Task Pipeline Path Steps LLM Calls Time Result
Open Notepad Action Router 1 0 1.5s
Open Notepad + write haiku Router → Smart Interaction → Computer Use 6 7 58.8s ✅ Verified
Open Google Doc in Edge + write sentence Browser → Computer Use 17 9 78.8s ✅ Verified

[0.5.5] - 2026-02-26 — Install/Uninstall, OpenClaw Auto-Registration, Doctor UX

Added

  • clawdcursor install — one command to set up API key, configure pipeline, and register as OpenClaw skill
  • clawdcursor uninstall — clean removal of all config, data, and OpenClaw skill registration
  • Doctor auto-registers as OpenClaw skill — symlinks into ~/.openclaw/workspace/skills/clawdcursor
  • Doctor quick fix commands — shows exact commands for missing text LLM and vision LLM in summary
  • Dashboard favorites — star commands to save them, click to re-run, persists across server restarts
  • Credential detection — warns when starring tasks that contain API keys or passwords
  • OS tabs on website — Windows/macOS/Linux with auto-detect
  • Post-build help message — shows all available commands after npm run build
  • Dynamic OS detection — system prompt uses actual OS instead of hardcoded "Windows 11" (thanks @molty)

Fixed

  • Windows skill detection — removed requires.bins from SKILL.md; OpenClaw's hasBinary() doesn't handle Windows PATHEXT (.exe/.cmd), causing the skill to show as "missing" even when node is installed

Changed

  • SKILL.md rewritten — agent identity shift framing, trigger lists, CDP direct path, async polling, error recovery
  • Security hardened — agents cannot self-approve confirm-tier actions, autonomous use scoped to read-only
  • Privacy language clarified — explicit per-provider data flow
  • Website Get Started simplified — 3 lines, commands shown in terminal post-build
  • Anthropic text model updatedclaude-haiku-4-5 (was claude-3-5-haiku-20241022)

[0.5.4] - 2026-02-25 — SKILL.md Rewrite + Security Hardening

Changed

  • Privacy language clarified — explicit per-provider data flow (Ollama = fully local, cloud = data to that API only)
  • Added homepage and source URLs to skill metadata
  • Removed hard-coded paths from SKILL.md
  • Security section expanded — includes localhost bind verification command
  • Security scan addressed — all flagged documentation gaps resolved

[0.5.3] - 2026-02-25 — SKILL.md Rewrite for Agent Autonomy

Changed

  • SKILL.md rewritten — agents now understand they have full desktop control and stop asking users to do things they can do themselves
  • Agent identity shift framing — blockquote at top overrides default "I can't do desktop things" behavior
  • "When to Use This" trigger list — comprehensive decision framework for when to reach for Clawd Cursor
  • Two paths documented — REST API (port 3847) for full desktop control, CDP Direct (port 9222) for fast browser reads
  • Async flow clarified — concrete polling pattern agents can follow step-by-step
  • Error recovery table — 8 common problems with exact solutions
  • Expanded task examples — cross-app workflows, data extraction, verification scenarios
  • README — added OpenClaw Integration section

[0.5.2] - 2026-02-25 — Web Dashboard + Browser Foreground Focus

Added

  • Web Dashboard — full single-page UI served at GET / (port 3847). Task submission, real-time logs, status indicators, approve/reject for safety confirmations, kill switch. Dark theme, fully responsive, zero external dependencies.
  • clawdcursor dashboard — CLI command to open the dashboard in your default browser
  • clawdcursor kill — CLI command to send a stop signal to the running server
  • GET /logs — API endpoint returning last 200 log entries with timestamps and levels
  • Browser foreground focus — Playwright navigation now brings Chrome to the front via page.bringToFront() + OS-level window activation (PowerShell SetForegroundWindow on Windows, osascript on macOS). The AI acts like a visible cursor — you see everything it does.
  • Console hookhookConsole() intercepts all server logs for the dashboard log feed with auto-classification (error/success/warn/info)

Changed

  • Smart task handoff — Browser layer no longer uses regex word lists to detect multi-step tasks. Pure navigation ("open youtube") completes in browser layer; anything more complex falls through to SmartInteraction where the LLM plans the steps. No more missed verbs.

Architecture

Layer 0: Browser (Playwright) — navigate + foreground focus
    ↓ more than navigation? → fall through
Layer 1: Action Router — regex patterns, zero LLM calls
    ↓ no match? → fall through
Layer 1.5: Smart Interaction — 1 LLM call plans steps, CDP/UIDriver executes
    ↓ failed? → fall through
Layer 2: Accessibility Reasoner — reads UI tree, cheap LLM
    ↓ failed? → fall through
Layer 3: Screenshot + Vision — full screenshot, Computer Use API

[0.5.1] - 2026-02-23 — HD Screenshots + Focus Stability

Fixed

  • HD screenshots — LLM resolution increased from 1024px to 1280px (scale 2x instead of 2.5x). Claude can now reliably identify toolbar icons, buttons, and small UI elements.
  • JPEG quality — bumped from 55 to 65 for clearer icon identification
  • Window focus stabilityWin+D minimizes all windows before task execution, preventing the Clawd terminal from stealing focus from target apps
  • Paint drawing reliability — pencil tool guidance in system prompt, mandatory checkpoint after tool selection
  • Stale file cleanup — restored get-windows.ps1 shim (still referenced by accessibility.ts), removed dead setup.ps1 and get-ui-tree.ps1

Performance (Paint stickman benchmark)

Metric v0.5.0 v0.5.1
Time ~250s 55s
API calls 30 6
Success rate ~50% ~90%

[0.5.0] - 2026-02-23 — Smart Pipeline + Doctor + Batch Execution

Added

  • clawdcursor doctor — auto-diagnoses setup, tests models, configures optimal pipeline
  • 3-layer pipeline — Action Router → Accessibility Reasoner → Screenshot fallback
  • Layer 2: Accessibility Reasoner (src/a11y-reasoner.ts) — text-only LLM reads the UI tree, no screenshots needed. Uses cheap models (Haiku, Qwen, GPT-4o-mini).
  • Batch action execution — Claude returns multiple actions per response (3.6 avg), skipping screenshots between batched actions. Drawing tasks execute 10+ actions in a single API call.
  • Focus hints — each screenshot includes a FOCUS directive telling Claude where to look, reducing output tokens and decision time
  • Auto-maximize — apps launched via Action Router are automatically maximized (Win+Up) for consistent layout
  • Region capturecaptureRegionForLLM() crops screenshots to specific areas (2-30KB vs 58KB full)
  • Checkpoint strategy — screenshots only after critical state changes (app open, dialog appear), not after every action
  • Multi-provider support — Anthropic, OpenAI, Ollama (local/free), Kimi. Same codebase, auto-detected.
  • Provider model map (src/providers.ts) — auto-selects cheap/expensive models per provider
  • Self-healing — doctor falls back if a model is unavailable (e.g., Haiku → Qwen). Circuit breaker disables failing layers at runtime.
  • Streaming LLM responses — early JSON return saves 1-3s per call
  • Combined accessibility script (scripts/get-screen-context.ps1) — 1 PowerShell spawn instead of 3
  • Benchmark harness (test-perf-comparison.ts)

Performance

  • Screenshots: 120KB → ~80KB, 1280px target (HD for reliable icon identification)
  • JPEG quality: 70 → 65
  • Delays: 200-1500ms → 50-600ms across the board
  • System prompts: ~60% smaller (fewer tokens per call)
  • Accessibility tree: filtered to interactive elements only, 3000 char cap
  • Taskbar cache: 30s TTL (was queried every call)
  • Screen context cache: 500ms → 2s TTL

Benchmarks

Task v0.4 v0.5 (Ollama, $0) v0.5 (Anthropic) v0.5 + Batch
Calculator 43s 2.6s 20.1s
Notepad 73s 2.0s 54.2s
File Explorer 53s 1.9s 22.1s
Paint stickman ~250s (30 calls) ~124s (19 calls) 101s (11 calls)
GitHub profile ~106s (15 calls)

[0.4.0] - 2026-02-22 — Native Desktop Control

VNC removed. Clawd Cursor now controls the desktop natively via @nut-tree-fork/nut-js. No VNC server required.

Breaking Changes

  • --vnc-host, --vnc-port, --vnc-password CLI flags removed
  • VNC_PASSWORD, VNC_HOST, VNC_PORT environment variables no longer used
  • rfb2 dependency removed
  • setup.ps1 no longer installs TightVNC

Added

  • NativeDesktop class (src/native-desktop.ts) — drop-in replacement for VNCClient
  • Direct screen capture via @nut-tree-fork/nut-js (~50ms vs ~850ms)
  • Direct mouse/keyboard control via OS-level APIs
  • Simplified onboarding: npm install && npm start

Performance

  • Screenshots: ~850ms → ~50ms (17× faster)
  • Connect time: ~200ms → ~38ms (5× faster)
  • Simple task (Google Docs sentence): ~120s → ~102s
  • Complex task (GitHub → Notepad → save): ~200s → ~156s

Removed

  • VNC server dependency (TightVNC)
  • rfb2 npm package
  • VNC-related CLI flags and environment variables
  • BGRA→RGBA color swap (nut-js returns RGBA natively)

[0.3.3] - 2025-03-15

Bulletproof Headless Setup

  • setup.ps1 now completes end-to-end in a single run on fresh systems, even in non-interactive/headless AI agent shells
  • Generate random VNC password when --vnc-password not provided non-interactively
  • Replace Start-Process -NoNewWindow -Wait with -PassThru -WindowStyle Hidden + try/catch (msiexec crash fix)
  • Wrap Start-Service in its own try/catch (post-install crash fix)
  • Replace all emoji with ASCII tags for cp1252 headless terminal compatibility

[0.3.1] - 2025-03-10

SKILL.md Security Hardening

  • Added YAML frontmatter, explicit credential declarations, privacy disclosure, and security considerations for ClaWHub publishing.

[0.3.0] - 2025-03-01

Performance Optimizations (~70% faster)

  • Screenshot hash cache — skips LLM calls when the screen hasn't changed
  • Adaptive VNC frame wait — captures in ~200ms instead of fixed 800ms
  • Parallel screenshot + accessibility fetch — runs concurrently via Promise.all
  • Accessibility context cache — 500ms TTL eliminates redundant PowerShell queries
  • Async debug writes — no longer blocks the event loop
  • Exponential backoff with jitter — better retry resilience for API calls

[0.2.0] - 2025-02-21

🚀 Major: Anthropic Computer Use API

Clawd Cursor now supports Anthropic's native Computer Use API (computer_20250124) as the primary execution path. This is a fundamentally different approach — the full task goes directly to Claude with native computer use tools. No decomposition, no routing. Claude sees screenshots, plans, and executes natively.

Dual Execution Paths

The agent now has two separate code paths selected by provider:

  • Path A — Computer Use API (--provider anthropic): Full task sent to Claude with computer_20250124 tool. Claude sees the screen, plans multi-step sequences, and executes them natively. Handles complex, multi-app workflows reliably.
  • Path B — Decompose + Action Router (--provider openai / offline): Original approach from v0.1.0. Parse task → subtasks → Action Router (UI Automation, zero LLM) → Vision fallback. Faster and cheaper for simple tasks, works without an API key.

Added

  • Anthropic Computer Use integration — native computer_20250124 tool type with anthropic-beta: computer-use-2025-01-24 header
  • Adaptive delays — per-action timing: 1000ms for app launch, 800ms for navigation, 100ms for typing, 300ms default
  • Verification hints — post-action verification prompts after each Computer Use step
  • Mouse dragmouseDrag, mouseDown, mouseUp with smooth interpolation between points
  • Bulletproof system prompt — planning rules, ctrl+l for URL navigation, recovery strategies for failed actions
  • Display scaling — automatic resolution scaling to 1280×720 for Computer Use API compatibility
  • Vision modelclaude-sonnet-4-20250514 for Computer Use path

Test Results

Task Time API Calls Result
Google Docs: open Chrome, go to Docs, write a paragraph 187s 14 ✅ All succeeded
GitHub: open Chrome, navigate to profile, screenshot 102s ✅ All succeeded
Notepad: open, write haiku, save to desktop ~180s ✅ File saved correctly
Paint: draw a stick figure ~90s 16 ✅ Drawing completed

Breaking Changes

  • Provider selection now determines execution path. --provider anthropic uses Computer Use API (Path A). --provider openai or no provider uses the original Decompose + Action Router pipeline (Path B). This is a fundamental change in behavior — the same task will execute via completely different code paths depending on the provider.

Performance Characteristics

Path A (Computer Use) Path B (Action Router)
Best for Complex multi-step tasks Simple single-action tasks
Reliability Very high Good for supported patterns
Speed ~90–190s for complex tasks ~2s for simple tasks
Cost Higher (multiple API calls with screenshots) Lower (1 text call or zero)
Offline No Yes (for common patterns)

[0.1.0] - 2025-01-15

Initial Release

  • Action Router with Windows UI Automation — 80% of common tasks with zero LLM calls
  • Vision fallback for complex/unfamiliar UI
  • Smart task decomposition (single text-only LLM call)
  • Three-tier safety system (Auto / Preview / Confirm)
  • REST API and CLI interface
  • Windows setup script