Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a28496b
v0.7.0: Architecture overhaul — multi-layer pipeline, universal tool …
AmrDab Mar 10, 2026
045ab74
docs: rewrite README for v0.7.0 — model-agnostic, 3 transport modes, …
AmrDab Mar 10, 2026
1d4d757
docs: add v0.6.3 vs v0.7.0 comparison table and "glove for any AI han…
AmrDab Mar 10, 2026
2a29dc4
Add ASCII art banner and styled onboarding disclaimer
AmrDab Mar 10, 2026
75ee114
fix: consent gate for MCP, --accept flag, consent subcommand
AmrDab Mar 10, 2026
afd451f
docs: v0.7.0 website — glove for any AI hand
AmrDab Mar 10, 2026
adab060
docs: simplify v0.7.0 website — match v0.6.3 style exactly
AmrDab Mar 10, 2026
4205a8f
fix: doctor silent skill registration + website plain-language copy
AmrDab Mar 10, 2026
7805e15
fix: remove built-in agent language from modes section
AmrDab Mar 10, 2026
3018e6b
fix: add consent --accept to SKILL.md install steps for non-interacti…
AmrDab Mar 10, 2026
ae3c30c
feat: LLM-backed semantic verifier with full attempt logging
AmrDab Mar 10, 2026
47f1676
fix: banner uses unicode escapes to survive Windows code-page encoding
AmrDab Mar 11, 2026
a4cffa1
feat: move big banner to onboarding (first-run only)
AmrDab Mar 11, 2026
866cd00
feat: production hardening — auth, CORS, timeout, terminal safety, mu…
AmrDab Mar 11, 2026
304c2c3
test: 55 passing unit tests for action router, safety, verifier, coor…
AmrDab Mar 11, 2026
a9401ba
feat: universal L3 vision fallback for all providers (GPT-4o, Gemini,…
AmrDab Mar 11, 2026
2585aaa
fix: CDP tab selection skips OEM widget pages, opens new tab as last …
AmrDab Mar 11, 2026
ba8a1b0
feat: add cdp_scroll action to L2 a11y-reasoner
AmrDab Mar 11, 2026
4d19d0f
docs: add v0.8.0 Claude Code spec — OCR-first, skill cache, vision LL…
AmrDab Mar 14, 2026
c792589
feat: add OCR engine bridge (v0.8.0 Step 1) — Windows.Media.Ocr via P…
AmrDab Mar 14, 2026
ba419ff
fix: sanitize control characters in OCR output before JSON parsing
AmrDab Mar 14, 2026
ae1d8c6
feat: expose shortcuts database via MCP tools (shortcuts_list, shortc…
AmrDab Mar 14, 2026
fd38767
feat: wire OCR-first pipeline with skill cache (v0.8.0 Steps 2-5)
AmrDab Mar 14, 2026
f537855
feat: add smart MCP tools for blind agent operation (smart_click, sma…
AmrDab Mar 14, 2026
84dcb24
refactor: OCR-first pipeline for smart tools — OCR primary, a11y supp…
AmrDab Mar 15, 2026
9280fb3
fix: audit hardening — input validation, safe JSON, silent catches, p…
AmrDab Mar 15, 2026
04d276e
fix: doctor smoke tests — real verification, not just pings
AmrDab Mar 15, 2026
89c46d3
docs: rewrite SKILL.md with agent-grade decision trees, remove test r…
AmrDab Mar 15, 2026
d70f7de
feat: cross-platform OCR and focused element support (macOS, Linux, A…
AmrDab Mar 15, 2026
bf111f2
docs: merge v0.6.3 clarity with v0.7.0 tool precision in SKILL.md
AmrDab Mar 16, 2026
123b4d4
fix: audit blockers — CLI auth, serve auth, emoji gate, v0.8.0 rebran…
AmrDab Mar 17, 2026
b71d014
rename: clawd-cursor → clawdcursor (remove hyphen everywhere)
AmrDab Mar 17, 2026
f2066d7
fix: remove all remaining hyphens from clawd-cursor identifiers
AmrDab Mar 17, 2026
27c6895
fix: auth hardening — protect /execute endpoints, lazy token init, de…
AmrDab Mar 17, 2026
cc29414
fix: coerce number/boolean tool params and fix server.tool() overload
itsuzef Mar 18, 2026
51ecf33
Stored the timer handle in timeoutHandle
itsuzef Mar 18, 2026
e720984
fix: add single-instance pidfile lock and SIGINT/SIGTERM teardown
itsuzef Mar 18, 2026
723b803
fix: release sharp RGBA buffers after processing and clear EventEmitt…
itsuzef Mar 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,7 @@ dist/
.DS_Store
Thumbs.db
debug/
.clawdcursor-config.json
.clawd-config.json
qa-tests/
.claude/
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ All notable changes to Clawd Cursor will be documented in this file.
### Fixed
- **Checkpoint system overhaul** — removed auto-termination (completionRatio ≥ 0.90 early exit and isComplete() mid-loop kill), strict detection: content_pasted requires Ctrl+V, content_copied requires Ctrl+C, second_app_opened detects any window switch universally
- **Pipeline context passing**`priorContext[]` accumulator flows from pre-processing through to Computer Use (no more amnesia between layers)
- **Credential resolution order** — .clawd-config → auth-profiles.json → openclaw.json (with template expansion) → env vars
- **Credential resolution order** — .clawdcursor-config → auth-profiles.json → openclaw.json (with template expansion) → env vars
- **`loadPipelineConfig()` path resolution** — checks package dir first, then cwd (fixes global npm installs)
- **Smart Interaction model lookup** — uses `PROVIDERS` registry instead of hardcoded model/baseUrl maps; fixes stale `claude-haiku-3-5-20241022` fallback
- **Scroll behavior** — system prompts instruct PageDown/Space instead of tiny mouse scrolls; default scroll delta 3 → 15
Expand Down Expand Up @@ -130,7 +130,7 @@ All notable changes to Clawd Cursor will be documented in this file.
- **Case-preserving action router** — all regex matches against raw (unmodified) task text. Typed text and URLs no longer get lowercased.
- **Flexible click matching**`click Blank document` works without quotes (was requiring `click "Blank document"`). Single unified regex for quoted and unquoted element names.
- **PowerShell encoding** — replaced emoji (🐾) and em dash (—) in task console title that broke on Windows PowerShell due to encoding.
- **Stale config**`.clawd-config.json` now correctly reflects Ollama when doctor detects it (was stuck on Anthropic).
- **Stale config**`.clawdcursor-config.json` now correctly reflects Ollama when doctor detects it (was stuck on Anthropic).
- **Brain provider mismatch** — decomposition no longer calls Anthropic API when only Ollama is available.

### Changed
Expand Down Expand Up @@ -235,7 +235,7 @@ Layer 3: Screenshot + Vision — full screenshot, Computer Use API
## [0.5.0] - 2026-02-23 — Smart Pipeline + Doctor + Batch Execution

### Added
- **`clawd-cursor doctor`** — auto-diagnoses setup, tests models, configures optimal pipeline
- **`clawdcursor doctor`** — auto-diagnoses setup, tests models, configures optimal pipeline
- **3-layer pipeline** — Action Router → Accessibility Reasoner → Screenshot fallback
- **Layer 2: Accessibility Reasoner** (`src/a11y-reasoner.ts`) — text-only LLM reads the UI tree, no screenshots needed. Uses cheap models (Haiku, Qwen, GPT-4o-mini).
- **Batch action execution** — Claude returns multiple actions per response (3.6 avg), skipping screenshots between batched actions. Drawing tasks execute 10+ actions in a single API call.
Expand Down
578 changes: 246 additions & 332 deletions README.md

Large diffs are not rendered by default.

769 changes: 484 additions & 285 deletions SKILL.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/ACCESSIBILITY-RESEARCH.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Windows UI Automation from Node.js — Research Report

**Date:** 2026-02-19
**Context:** clawd-cursor desktop AI agent currently uses VNC + screenshot + vision for every action.
**Context:** clawdcursor desktop AI agent currently uses VNC + screenshot + vision for every action.
**Goal:** Add a Windows accessibility layer to enumerate UI elements, read properties, and interact by reference (not pixel coordinates).

---
Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/MACOS-SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,8 @@ If it works, you'll see something like: `{"name":"Terminal","pid":12345}`
## 2. Install & Build

```bash
git clone https://github.com/AmrDab/clawd-cursor.git
cd clawd-cursor && npm install && npm run build
git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor && npm install && npm run build
```

### Make macOS scripts executable
Expand Down
205 changes: 205 additions & 0 deletions docs/agent-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# clawdcursor Agent Guide

> This document teaches AI models how to use clawdcursor tools effectively.
> Include this in your system prompt or reference it when connecting to the tool server.

## What is clawdcursor?

clawdcursor is an OS-level desktop automation server. It gives you (the AI model)
eyes, hands, and ears on a real computer desktop. You can see the screen, click,
type, read UI elements, interact with browsers, and control any application.

**You are the brain. clawdcursor is the body.**

## Quick Start

```
1. read_screen → See what's on screen (text, fast, structured)
2. Decide what to do → Your reasoning
3. Execute an action → mouse_click, key_press, type_text, cdp_click, etc.
4. read_screen again → Verify the action worked
5. Repeat until done
```

## Core Principles

### 1. Text First, Vision Second
Always call `read_screen` before `desktop_screenshot`. The accessibility tree is:
- **Fast**: ~100ms vs ~500ms for screenshot
- **Structured**: Named buttons, input fields, text values
- **Small**: A few KB of text vs a large image

Only use `desktop_screenshot` when:
- You need to see visual layout (charts, images, colors)
- The accessibility tree is empty or unhelpful (canvas apps, games)
- You need to verify visual state

### 2. CDP for Browsers, A11y for Native Apps
When working with a browser (Edge, Chrome):
- Call `navigate_browser` to open a URL with CDP enabled
- Call `cdp_connect` to establish the connection
- Use `cdp_click`, `cdp_type`, `cdp_read_text` for all interactions
- CDP is faster and more reliable than mouse clicks for web pages

When working with native apps (Notepad, Excel, File Explorer):
- Use `read_screen` to see the UI tree
- Use `mouse_click` at coordinates from the accessibility tree
- Use `key_press` for keyboard shortcuts
- Use `type_text` for entering text

### 3. Verify Every Action
After every action, read the screen again to confirm it worked.
Don't assume success — verify it.

```
Bad: click "Save" → assume saved → done
Good: click "Save" → read_screen → confirm save dialog closed → done
```

### 4. Use Keyboard Shortcuts
Keyboard shortcuts are faster and more reliable than clicking:
- `ctrl+s` to save
- `ctrl+a` to select all
- `ctrl+c` / `ctrl+v` for copy/paste
- `ctrl+n` for new document
- `alt+tab` to switch windows
- `ctrl+w` to close tab
- `Return` to confirm dialogs

## Tool Categories

### Perception (see the screen)
| Tool | When to use |
|------|-------------|
| `read_screen` | Always start here. Returns accessibility tree. |
| `desktop_screenshot` | When you need visual confirmation or a11y tree is empty. |
| `desktop_screenshot_region` | Zoom into a specific area for detail. |
| `get_screen_size` | Get screen dimensions and DPI info. |
| `get_windows` | List all open windows. |
| `get_active_window` | Check which window has focus. |
| `get_focused_element` | Check which UI element has keyboard focus. |

### Actions (control the computer)
| Tool | When to use |
|------|-------------|
| `mouse_click` | Click a UI element at image-space coordinates. |
| `mouse_double_click` | Open files, select words. |
| `mouse_right_click` | Open context menus. |
| `mouse_scroll` | Scroll pages, lists, documents. |
| `mouse_drag` | Select text, move objects, resize. |
| `mouse_hover` | Reveal tooltips or hover menus. |
| `key_press` | Keyboard shortcuts and special keys. |
| `type_text` | Enter text into focused input. |

### Window Management
| Tool | When to use |
|------|-------------|
| `focus_window` | Bring a window to front (by name, PID, or title). |
| `find_element` | Search for a specific UI element by name or type. |
| `open_app` | Launch an application. |

### Browser (CDP)
| Tool | When to use |
|------|-------------|
| `navigate_browser` | Open a URL (launches browser with CDP). |
| `cdp_connect` | Connect to browser's DevTools Protocol. |
| `cdp_page_context` | List interactive elements (buttons, inputs, links). |
| `cdp_read_text` | Extract text from a page or element. |
| `cdp_click` | Click by CSS selector or visible text. |
| `cdp_type` | Type into input by selector or label. |
| `cdp_select_option` | Select dropdown option. |
| `cdp_evaluate` | Run arbitrary JavaScript. |
| `cdp_wait_for_selector` | Wait for element to appear. |
| `cdp_list_tabs` | List open browser tabs. |
| `cdp_switch_tab` | Switch to a different tab. |

### Clipboard
| Tool | When to use |
|------|-------------|
| `read_clipboard` | Read clipboard contents. |
| `write_clipboard` | Write text to clipboard. |

### Orchestration
| Tool | When to use |
|------|-------------|
| `delegate_to_agent` | Hand off complex task to autonomous pipeline. |
| `wait` | Pause after animations, page loads, transitions. |

## Common Patterns

### Open an app and type something
```
1. open_app("notepad")
2. wait(2)
3. type_text("Hello, world!")
```

### Search the web
```
1. navigate_browser("https://google.com")
2. cdp_connect()
3. cdp_type(selector: "textarea[name='q']", text: "clawdcursor")
4. key_press("Return")
5. wait(2)
6. cdp_read_text() → extract search results
```

### Copy text between apps
```
1. focus_window(processName: "msedge")
2. key_press("ctrl+a") → select all
3. key_press("ctrl+c") → copy
4. read_clipboard() → verify content
5. focus_window(processName: "notepad")
6. key_press("ctrl+v") → paste
```

### Fill out a web form
```
1. navigate_browser("https://example.com/form")
2. cdp_connect()
3. cdp_page_context() → see all inputs
4. cdp_type(label: "Name", text: "John Doe")
5. cdp_type(label: "Email", text: "john@example.com")
6. cdp_select_option(selector: "#country", value: "US")
7. cdp_click(text: "Submit")
```

### Multi-app workflow (web research → document)
```
1. navigate_browser("https://en.wikipedia.org/wiki/Tokyo")
2. cdp_connect()
3. cdp_read_text(selector: "#mw-content-text") → extract info
4. open_app("notepad")
5. wait(2)
6. type_text("Tokyo Research Notes\n\n" + extracted_info)
7. key_press("ctrl+s")
```

## Coordinate System

All mouse tools use **image-space coordinates** — these match the 1280px-wide
screenshots from `desktop_screenshot`. The server automatically converts to
the correct OS coordinates (handling DPI scaling).

You do NOT need to worry about DPI, physical pixels, or logical pixels.
Just use the coordinates you see in screenshots.

## Safety

- `alt+f4`, `ctrl+alt+delete` are blocked
- The server only binds to localhost (127.0.0.1)
- `type_text` uses clipboard paste (reliable, no dropped characters)
- All actions are logged

## Error Handling

If a tool returns `isError: true`:
1. Read the error message
2. Try an alternative approach
3. Don't repeat the same failing action more than twice

Common errors:
- "Not connected to CDP" → call `cdp_connect` first
- "No window found" → check `get_windows` for the correct process name
- "Click failed" → verify coordinates with `read_screen` or `desktop_screenshot`
Loading
Loading