Skip to content

feat: add open-operator — AI agent browser automation#614

Open
jackwener wants to merge 1 commit intomainfrom
open-operator
Open

feat: add open-operator — AI agent browser automation#614
jackwener wants to merge 1 commit intomainfrom
open-operator

Conversation

@jackwener
Copy link
Copy Markdown
Owner

Summary

Add opencli operate command — an AI agent that autonomously controls the browser to complete tasks described in natural language, then saves successful operations as reusable TypeScript CLI skills.

End-to-end verified: Successfully tested on example.com (data extraction) and Twitter/X (DM listing with login state reuse).

Key capabilities

  • opencli operate "task" — LLM-driven browser control loop (observe → reason → act → repeat)
  • opencli operate --save-as site/name — Saves successful operations as .ts adapters via LLM code generation
  • Native CDP inputInput.dispatchMouseEvent/KeyEvent for isTrusted:true events, with automatic JS injection fallback
  • Rich trace capture — Network interceptor captures API responses for intelligent strategy selection
  • API discovery — Analyzes captured requests to find the "golden API", recommends optimal strategy (PUBLIC/COOKIE/UI)
  • Self-repair — Generated adapters are syntax-validated; errors are fed back to LLM for auto-fix

Architecture

CLI → Agent Loop → DOM Snapshot + LLM → Execute Actions → Observe → Repeat
                                                              ↓
                                              Rich Trace (actions + network + auth)
                                                              ↓
                                              API Discovery → LLM Code Gen → .ts Adapter

New files

File Purpose
src/agent/agent-loop.ts Core LLM-driven control loop
src/agent/action-executor.ts Action dispatch with CDP→JS fallback
src/agent/dom-context.ts DOM snapshot + element coordinate map
src/agent/prompts.ts System prompt & step message builder
src/agent/llm-client.ts Anthropic SDK wrapper (supports ANTHROPIC_BASE_URL)
src/agent/types.ts Zod schemas for actions & responses
src/agent/trace-recorder.ts Rich context capture (network + auth + thinking)
src/agent/api-discovery.ts API scoring & strategy recommendation
src/agent/skill-saver.ts LLM-powered TS adapter generation + validation

Security

  • CDP passthrough uses a method allowlist (22 permitted methods)
  • Network response bodies are sanitized before writing to disk (tokens/passwords redacted)
  • Auth tokens stored as boolean flags only, never actual values
  • --save-as names validated against path traversal ([a-zA-Z0-9_-] only)

Dependencies

  • zod — Runtime validation of LLM structured output
  • @anthropic-ai/sdk — Anthropic API client

Test plan

  • opencli operate --help shows correct usage
  • opencli operate --url https://example.com "extract heading" — completes in 1 step
  • opencli operate --url https://x.com "get my DMs" — navigates, extracts 10 DMs with login state
  • opencli operate --save-as test/heading "extract heading" — generates .ts adapter
  • TypeScript compilation passes (main project + extension)
  • npm run build succeeds (392 manifest entries)
  • Replay generated adapter: opencli test heading

@jackwener jackwener force-pushed the open-operator branch 4 times, most recently from 68e5e94 to 7fde454 Compare April 1, 2026 17:15
Complete AI agent browser automation system for OpenCLI:

Core Agent (src/agent/):
- LLM-driven browser control loop with 14 action types
- Planning system with plan CRUD and replan nudges
- Sliding-window loop detection with page fingerprinting
- Autocomplete field detection with 3-layer protocol
- Value mismatch detection (post-type readback)
- Final response after failure (captures partial results)
- Message compaction with epistemic guard (unverified labels)
- Sensitive data masking before LLM

Multi-Provider LLM:
- OPENCLI_PROVIDER (anthropic/openai) + OPENCLI_MODEL + OPENCLI_API_KEY
- Model aliases: sonnet, opus, haiku, gpt-5.4, gpt-4.1, o3
- HTML proxy detection with actionable error messages
- Prompt caching for Anthropic, token tracking for both

Browser Control:
- CDP passthrough with method allowlist (22 permitted methods)
- scrollIntoView before click/type (fixes off-viewport elements)
- Three-strategy click fallback: CDP → page.click → evaluate JS
- Two-layer retry for extension interference (operate: 5x/1.5s, normal: 2x/500ms)
- about:blank for blank tabs (prevents extension hijacking)
- opencli browse commands for Claude Code skill integration

Skill Generation:
- Rich trace capture (network interceptor + auth context + thinking log)
- API discovery: score requests to find golden API, recommend strategy
- LLM-powered TS adapter generation with validation + self-repair
- node_modules symlink for user TS adapter package resolution

AutoResearch:
- 59-task eval framework (49 train + 10 test, 8 categories)
- Claude Code research instructions (program.md)
- Baseline: 56/59 (95%), train 48/49 (98%)

Documentation:
- OPERATE.md: full user guide with configuration, costs, troubleshooting
- SKILL.md split into 3 specialized skills (cli, operate, adapter-dev)
- README updated with AI Agent section

New CLI commands:
- opencli operate|op <task> — AI agent browser automation
- opencli browse open/state/click/type/eval/screenshot/scroll/back/close
- opencli doctor — now shows LLM provider status

Dependencies: zod, @anthropic-ai/sdk, openai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant