feat(accessibility): vision-click fallback for Electron/partial-AX apps (8/8 of #3307)#3362
Conversation
📝 WalkthroughWalkthroughThis PR adds a vision-based fallback for automate: screenshot capture, vision-model locate parsing, pixel→screen mapping, main-thread guarded absolute clicks, integration into the automate loop with new Action.description and backend hooks, comprehensive tests, and feature-tracker documentation updates. ChangesVision-Click Accessibility Fallback
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
…fallback) in the tracker header
…trator Registers the AutomateTool (multi-step UI flows in one call) and the ax_interact denylist/opt-in plumbing; adds the catalog toggle, tool definition, and orchestrator prompt guidance (automate + screenshot/ mouse/keyboard fallback for Electron apps with empty AX trees). Slice 3/7 of tinyhumansai#3307 (tool wiring + prompts).
Continuous cpal mic → VAD segmenter → STT → agent with no hotkey, opt-in via voice_server.always_on_enabled, 'Hey Tiny' wake word (English-forced STT + fuzzy match), and screen-lock privacy pause. Adds the config schema, live-apply on the settings RPC, start_if_enabled wiring, and a JSON-RPC roundtrip E2E. Slice 4/7 of tinyhumansai#3307 (always-on core).
Surfaces the always-on listening toggle in the reachable Voice panel, adds the VoiceDebugPanel, the voice tauri-command wrapper, and the RPC client method. Adds all voice.debug.* and notch.* i18n keys across the 14 locales (notch keys land here as inert strings; the notch UI that consumes them ships in slice 6). Slice 5/7 of tinyhumansai#3307 (always-on frontend).
Transparent NSPanel + WKWebView anchored at the top-centre of the primary screen showing live Ready/Listening/Processing state; automate streams step progress to it via the overlay:attention socket bridge. macOS only; no-op elsewhere. Slice 6/7 of tinyhumansai#3307 (notch status pill).
Routes always-on utterances through a fast intent classifier before the chat model, wired into always-on delivery; ties the notch indicator visibility to always-on listening. Adds the window tauri-command wrapper and the core-process permission entry. Slice 7/7 of tinyhumansai#3307 (Phase 3 fast routing).
…ps (Phase 1.5)
Adds a model-chosen `vision_click { description }` action to the `automate`
loop for apps that expose no usable accessibility tree (Slack, Discord,
VS Code). Flow: screenshot the app window -> ask the main vision model for the
target's pixel coordinates (via the existing `[IMAGE:]` marker path) -> map
image pixels to absolute screen points -> guarded left-click.
- New `accessibility/vision_click.rs`: pure `image_to_screen` coordinate
transform (folds in the deferred F2 mapping -- the px->pt ratio absorbs the
capture downscale + Retina backing scale), tolerant locate-response parser,
capture geometry, and the main-thread guarded click (`run_input_on_main`,
Change 1.15).
- Section 1.8 safety guard: only clicks when the target app is frontmost;
refuses on positive evidence another app is focused, so synthetic input never
lands on OpenHuman's own CEF window.
- Reuses the main `chat` vision provider -- no new inference API, no new tool,
no new approval surface (inherits `automate`'s Dangerous + mutations gate).
- 19 new unit tests (pure transform/parse + scripted-backend loop integration,
incl. the frontmost-refusal guard). All 25 automate + vision_click tests green.
Closes the last open Phase 1.5 item (tinyhumansai#3148). Stacks on tinyhumansai#3340-tinyhumansai#3346.
…fallback) in the tracker header
919b3d1 to
8828ce2
Compare
Take main's merged slice-1..7 versions; keep slice-8's vision_click work (automate vision verb + accessibility/vision_click.rs) and the forward Phase 1.5/Change 1.16 docs. Drop the duplicated desktop-control prompt section + the spurious agent.toml re-add (vision_click is an automate action, not a named tool).
Independent review (beyond the CodeRabbit pass)Reviewed the vision-click fallback — the Reviewed clean
The OS-bound No correctness issues. LGTM once CI is green. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/openhuman/accessibility/automate_tests.rs (1)
356-376: 💤 Low valueTest assertion for re-foreground attempt is missing.
The test verifies that no screenshot or click happens when another app is frontmost, which is correct. However, it doesn't verify that the implementation attempted to re-foreground the target app (via
act_launch) before refusing. This would confirm the full guard flow: detect mismatch → re-foreground → check again → refuse.💡 Optional: assert re-foreground attempt
let acts = backend.acts(); assert!( !acts.iter().any(|a| a.starts_with("click:")), "must not click into a non-target app: {acts:?}" ); assert!( !acts.iter().any(|a| a.starts_with("screenshot:")), "must not even screenshot when refused: {acts:?}" ); + // The guard should have attempted to re-foreground before refusing. + assert!( + acts.iter().filter(|a| *a == "launch:Slack").count() >= 2, + "expected re-foreground attempt after initial launch: {acts:?}" + ); let _ = out; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/accessibility/automate_tests.rs` around lines 356 - 376, Add an assertion in the vision_click_refused_when_other_app_frontmost test to verify the implementation attempted to re-foreground the target app before refusing; after collecting acts from ScriptedBackend::acts() assert that there's an entry indicating a launch/re-foreground attempt (e.g. an act that starts with "act_launch:" or "launch:") so the sequence is detected (detect mismatch → re-foreground attempt via act_launch → final refusal with no screenshot/click).src/openhuman/accessibility/automate.rs (1)
458-520: 💤 Low valueConsider adding debug-level entry/exit logging for the vision_click handler.
The vision_click handler logs warnings for edge cases (re-foreground) but lacks debug-level logging at the handler entry point. This would help with grep-friendly tracing during development and debugging. As per coding guidelines for
src/**/*.rs: "Add substantial debug-level logs... at entry/exit points, branch decisions".🔧 Optional: add entry debug log
"vision_click" => { let description = action.description.trim(); + log::debug!( + "{LOG_PREFIX} vision_click: app={target_app:?} description={description:?}" + ); if description.is_empty() { steps.push("vision_click skipped: empty description".to_string()); continue; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/accessibility/automate.rs` around lines 458 - 520, Add debug-level entry/exit and key-branch logs for the "vision_click" match arm: at the start of the "vision_click" handler log a debug message (using log::debug! and LOG_PREFIX) that includes target_app and trimmed description, and on exit log success/failure with the final step reason; also add debug logs before/after the frontmost_app check (including when re-foregrounding), before/after backend.screenshot(), before/after backend.locate() (including locate None vs Some), and before/after backend.click() so the flow through backend.frontmost_app(), backend.act_launch(), backend.screenshot(), backend.locate(), and backend.click() is traceable for grep-friendly debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/voice-system-actions.md`:
- Around line 365-372: Update the contradictory section to state the blocker is
resolved: replace the present-tense "Fix required (not yet done)" paragraph
about enigo/TSMGetInputSourceProperty with a past-tense note that the issue was
fixed by dispatching enigo calls to the Tauri main thread via the new
run_input_on_main helper (implemented in main_thread.rs and exposed through the
native-registry handler), and mention that keyboard/mouse tools and vision_click
now run safely without TSM traps.
---
Nitpick comments:
In `@src/openhuman/accessibility/automate_tests.rs`:
- Around line 356-376: Add an assertion in the
vision_click_refused_when_other_app_frontmost test to verify the implementation
attempted to re-foreground the target app before refusing; after collecting acts
from ScriptedBackend::acts() assert that there's an entry indicating a
launch/re-foreground attempt (e.g. an act that starts with "act_launch:" or
"launch:") so the sequence is detected (detect mismatch → re-foreground attempt
via act_launch → final refusal with no screenshot/click).
In `@src/openhuman/accessibility/automate.rs`:
- Around line 458-520: Add debug-level entry/exit and key-branch logs for the
"vision_click" match arm: at the start of the "vision_click" handler log a debug
message (using log::debug! and LOG_PREFIX) that includes target_app and trimmed
description, and on exit log success/failure with the final step reason; also
add debug logs before/after the frontmost_app check (including when
re-foregrounding), before/after backend.screenshot(), before/after
backend.locate() (including locate None vs Some), and before/after
backend.click() so the flow through backend.frontmost_app(),
backend.act_launch(), backend.screenshot(), backend.locate(), and
backend.click() is traceable for grep-friendly debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 38870e05-56d2-45a2-825c-10a924d267c7
📒 Files selected for processing (6)
docs/voice-system-actions.mdsrc/openhuman/accessibility/automate.rssrc/openhuman/accessibility/automate_tests.rssrc/openhuman/accessibility/mod.rssrc/openhuman/accessibility/vision_click.rssrc/openhuman/accessibility/vision_click_tests.rs
Re-apply the tinyhumansai#3346 reconciliation lost when taking slice-8's docs: the 'Fix required (not yet done) / keep disabled' paragraph contradicted the '✅ Crash fixed' status. Now past-tense root-cause-fixed (run_input_on_main on the main thread + catch_unwind); covers vision_click too. Tag the trace fence as text (MD040).
Summary
vision_click { description }action to theautomateloop so it can drive apps that expose no usable accessibility tree — Electron/Chromium apps (Slack, Discord, VS Code) where the perceive→press loop had nothing to act on.image_to_screentransform whose px→pt ratio absorbs both the capture downscale and the Retina backing scale (no explicit scale factor needed).[IMAGE:]marker → image part, Attachments trigger Something went wrong in chat #3205) and the mainchatvision provider — no new inference API, no new tool, no new approval surface (inheritsautomate's Dangerous + mutations gate).This is the last open Phase 1.5 item of the voice→system-action feature; see
docs/voice-system-actions.mdChange 1.16.Problem
automate's inner loop reads a filtered accessibility snapshot and presses elements by label. Electron/Chromium apps expose little or no AX/UIA, soperceivereturns an empty list and the loop is stuck — there's no label to press. The planned answer (tracker §1.5) was screenshot → vision-locate → guarded click, blocked by two things: the fast inner-loop model is text-only, and the screenshot is windowed + downscaled (Retina 2×) whilemouseexpects absolute screen points (the deferred F2 mapping gap), so a vision-returned coordinate would click the wrong spot.Solution
src/openhuman/accessibility/vision_click.rs(new):CaptureGeometry+ pureimage_to_screen(the coordinate transform), tolerant locate-response parser,capture_window_geometry,locate_via_vision, and the main-threadguarded_click(run_input_on_main, Change 1.15 — off-thread enigo traps TSM).automate.rs: newAction.description, avision_clicksystem-prompt verb ("use when the element list is EMPTY"), the no-progress signature extended withdescription, andRealBackend::{screenshot, locate, frontmost_app, click}. Vision-locate usescreate_chat_provider("chat", …)and embeds the screenshot via the[IMAGE:<data-uri>]marker.vision_clickre-foregrounds the target once if it isn't frontmost and refuses if it still isn't — never clicking into a non-target window.None(can't determine) is best-effort since the loop already foregrounded the app at start.Design decisions (agreed before implementation): reuse the main vision model rather than the fast tier or a new config knob (fallback fires rarely, so latency is fine); fold the F2 coordinate-transform into this PR since safe clicks depend on it.
Submission Checklist
image_to_screen(downscale / Retina 2× / origin offset / out-of-range + negative clamp / zero-dim),parse_locate_response(found / not-found / fenced / prose / garbage), marker build, PNG-dims round-trip; loop integration via the scripted backend incl. the frontmost-refusal guard, not-found (no click), and empty-description skip. 25/25automate+vision_clicktests green.vision_clickloop dispatch are fully covered; the OS-boundRealBackendglue (capture/vision/click) and the native capture/click helpers are integration-only and exercised manually on macOS (screen-recording + AX dependent, not CI-runnable). Same untestable-native-glue shape as the rest of this stack.N/A: adds a fallback path within the existingautomatetool, no new feature row.chatprovider via the existing mock-backed factory.N/A: macOS AX/screen-dependent; not a release-cut surface.Impact
screencapture,foreground_context, main-thread enigo). All cross-platform-compiles:capture/foreground_contextreturn a clean runtime error /Noneoff-macOS; non-vision/headless backends opt out via the new trait defaults.automate's mutations gate + sensitive-app denylist. No new approval surface, no new tool, no new injected JS.Related
docs/voice-system-actions.mdChange 1.16.Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests