Skip to content

feat(accessibility): vision-click fallback for Electron/partial-AX apps (8/8 of #3307)#3362

Merged
M3gA-Mind merged 9 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-split-8-vision-fallback
Jun 4, 2026
Merged

feat(accessibility): vision-click fallback for Electron/partial-AX apps (8/8 of #3307)#3362
M3gA-Mind merged 9 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-split-8-vision-fallback

Conversation

@M3gA-Mind

@M3gA-Mind M3gA-Mind commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds a model-chosen vision_click { description } action to the automate loop so it can drive apps that expose no usable accessibility tree — Electron/Chromium apps (Slack, Discord, VS Code) where the perceive→press loop had nothing to act on.
  • Flow: screenshot the app window → ask the main vision model for the target's pixel coordinates → map image pixels to absolute screen points → guarded left-click.
  • Folds in the deferred F2 coordinate fix: a pure image_to_screen transform whose px→pt ratio absorbs both the capture downscale and the Retina backing scale (no explicit scale factor needed).
  • §1.8 safety guard: only clicks when the target app is frontmost; refuses on positive evidence another app is focused, so synthetic input never lands on OpenHuman's own CEF window.
  • Reuses the existing multimodal path ([IMAGE:] marker → image part, Attachments trigger Something went wrong in chat #3205) and the main chat vision provider — no new inference API, no new tool, no new approval surface (inherits automate's Dangerous + mutations gate).

This is the last open Phase 1.5 item of the voice→system-action feature; see docs/voice-system-actions.md Change 1.16.

Problem

automate's inner loop reads a filtered accessibility snapshot and presses elements by label. Electron/Chromium apps expose little or no AX/UIA, so perceive returns an empty list and the loop is stuck — there's no label to press. The planned answer (tracker §1.5) was screenshot → vision-locate → guarded click, blocked by two things: the fast inner-loop model is text-only, and the screenshot is windowed + downscaled (Retina 2×) while mouse expects absolute screen points (the deferred F2 mapping gap), so a vision-returned coordinate would click the wrong spot.

Solution

  • src/openhuman/accessibility/vision_click.rs (new): CaptureGeometry + pure image_to_screen (the coordinate transform), tolerant locate-response parser, capture_window_geometry, locate_via_vision, and the main-thread guarded_click (run_input_on_main, Change 1.15 — off-thread enigo traps TSM).
  • automate.rs: new Action.description, a vision_click system-prompt verb ("use when the element list is EMPTY"), the no-progress signature extended with description, and RealBackend::{screenshot, locate, frontmost_app, click}. Vision-locate uses create_chat_provider("chat", …) and embeds the screenshot via the [IMAGE:<data-uri>] marker.
  • Guard: a vision_click re-foregrounds the target once if it isn't frontmost and refuses if it still isn't — never clicking into a non-target window. None (can't determine) is best-effort since the loop already foregrounded the app at start.

Design decisions (agreed before implementation): reuse the main vision model rather than the fast tier or a new config knob (fallback fires rarely, so latency is fine); fold the F2 coordinate-transform into this PR since safe clicks depend on it.

Submission Checklist

  • Tests added or updated (happy path + failure/edge) — 19 new unit tests: pure image_to_screen (downscale / Retina 2× / origin offset / out-of-range + negative clamp / zero-dim), parse_locate_response (found / not-found / fenced / prose / garbage), marker build, PNG-dims round-trip; loop integration via the scripted backend incl. the frontmost-refusal guard, not-found (no click), and empty-description skip. 25/25 automate + vision_click tests green.
  • Diff coverage ≥ 80% — pure transform/parse logic and the vision_click loop dispatch are fully covered; the OS-bound RealBackend glue (capture/vision/click) and the native capture/click helpers are integration-only and exercised manually on macOS (screen-recording + AX dependent, not CI-runnable). Same untestable-native-glue shape as the rest of this stack.
  • Coverage matrix — N/A: adds a fallback path within the existing automate tool, no new feature row.
  • No new external network dependencies — vision-locate rides the already-configured chat provider via the existing mock-backed factory.
  • Manual smoke checklist — N/A: macOS AX/screen-dependent; not a release-cut surface.
  • Linked issue — completes Phase 1.5 of feat: always-on voice command → system action (listen, understand, execute) #3148 (other phases remain, so not a closing keyword).

Impact

  • Desktop macOS for the live path (windowed screencapture, foreground_context, main-thread enigo). All cross-platform-compiles: capture/foreground_context return a clean runtime error / None off-macOS; non-vision/headless backends opt out via the new trait defaults.
  • Opt-in: inherits automate's mutations gate + sensitive-app denylist. No new approval surface, no new tool, no new injected JS.

Related

Summary by CodeRabbit

  • New Features

    • Vision-based automation fallback for locating and clicking on screen elements when standard accessibility data is missing.
    • Frontmost-app safety guard and reliable coordinate mapping for safer visual clicks.
  • Bug Fixes

    • Resolved a computer-control crash affecting click dispatch.
  • Documentation

    • Updated feature tracker: multiple automation phases marked shipped; Music fast-path, progress streaming, and local routing noted complete.
  • Tests

    • Added tests covering vision-based locate/click behaviors and geometry parsing.

@M3gA-Mind M3gA-Mind requested a review from a team June 4, 2026 11:31
@M3gA-Mind M3gA-Mind marked this pull request as draft June 4, 2026 11:32
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a vision-based fallback for automate: screenshot capture, vision-model locate parsing, pixel→screen mapping, main-thread guarded absolute clicks, integration into the automate loop with new Action.description and backend hooks, comprehensive tests, and feature-tracker documentation updates.

Changes

Vision-Click Accessibility Fallback

Layer / File(s) Summary
Vision-click data structures and coordinate mapping
src/openhuman/accessibility/vision_click.rs, src/openhuman/accessibility/mod.rs
CaptureGeometry encodes window rect and screenshot pixel dims. image_to_screen clamps pixel coords into screen space. image_dims_from_data_uri decodes PNG dims and parsing of locate responses tolerates surrounding text.
Screenshot capture, vision locate, and guarded click primitives
src/openhuman/accessibility/vision_click.rs
capture_window_geometry captures data URI + geometry. locate_via_vision builds prompts, calls provider, and parses JSON locate responses. guarded_click dispatches enigo absolute clicks on the main thread.
Automate loop action contract and vision-click dispatcher
src/openhuman/accessibility/automate.rs
Adds Action.description, expands AutomateBackend with screenshot/locate/frontmost_app/click defaults, updates fast-model prompt for vision_click, extends no-progress signature to include description, and handles vision_click with frontmost-app guard → screenshot → locate → click flow and step outcomes.
Test backend and vision-click scenario coverage
src/openhuman/accessibility/automate_tests.rs
Extends ScriptedBackend with frontmost and locate_coord, implements scripted vision methods and dummy_geom(), and adds async tests for frontmost matching, unknown frontmost, refusal when another app is frontmost, locate not-found, and empty-description skipping.
Vision-click coordinate and parsing unit tests
src/openhuman/accessibility/vision_click_tests.rs
Unit tests for image_to_screen transforms, parse_locate_response found/not-found/error cases and code-fence tolerance, build_locate_user content, and image_dims_from_data_uri using generated PNG data URIs.
Feature tracker status and vision-click documentation
docs/voice-system-actions.md
Refreshes header metadata and updates phases: automate M1–M5 shipped, Phase 1.5 vision fallback (vision_click) marked complete with main-thread click fix and frontmost-app guard, Phase 3 routing wired, and consolidated checklist updated.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 A vision of clicks so clear and true,
From screenshots mapped through models new,
The fallback path when AX won't play,
Makes mouse moves work the vision way! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature being added: a vision-click fallback for Electron/partial-AX apps, and contextualizes it as the final part of a multi-PR series (#3307).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

📚 Stacked PR series (8 total) — split from #3307

Merge bottom-up; each builds on the one above it:

  1. feat(computer): main-thread synthetic-input executor + CEF crash fix (1/8 of #3307) #3340 — main-thread synthetic-input executor + CEF crash fix
  2. feat(accessibility): AX/UIA perception + automate engine (2/8 of #3307) #3341 — AX/UIA perception + automate engine
  3. feat(agent): wire automate/ax_interact computer tools (3/8 of #3307) #3342 — wire automate/ax_interact computer tools
  4. feat(voice): Phase 2 always-on listening engine + RPC (4/8 of #3307) #3343 — Phase 2 always-on listening engine + RPC
  5. feat(voice): always-on Settings toggle + debug panel + i18n (5/8 of #3307) #3344 — always-on Settings toggle + debug panel + i18n
  6. feat(notch): always-visible macOS notch status pill (6/8 of #3307) #3345 — always-visible macOS notch status pill
  7. feat(voice): Phase 3 fast command router (7/8 of #3307) #3346 — Phase 3 fast command router
  8. feat(accessibility): vision-click fallback for Electron/partial-AX apps (8/8 of #3307) #3362 — vision-click fallback for Electron/partial-AX apps (Phase 1.5 complete)

Tracker: docs/voice-system-actions.md.

M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jun 4, 2026
M3gA-Mind added 7 commits June 4, 2026 18:45
…trator

Registers the AutomateTool (multi-step UI flows in one call) and the
ax_interact denylist/opt-in plumbing; adds the catalog toggle, tool
definition, and orchestrator prompt guidance (automate + screenshot/
mouse/keyboard fallback for Electron apps with empty AX trees).

Slice 3/7 of tinyhumansai#3307 (tool wiring + prompts).
Continuous cpal mic → VAD segmenter → STT → agent with no hotkey, opt-in
via voice_server.always_on_enabled, 'Hey Tiny' wake word (English-forced
STT + fuzzy match), and screen-lock privacy pause. Adds the config schema,
live-apply on the settings RPC, start_if_enabled wiring, and a JSON-RPC
roundtrip E2E.

Slice 4/7 of tinyhumansai#3307 (always-on core).
Surfaces the always-on listening toggle in the reachable Voice panel,
adds the VoiceDebugPanel, the voice tauri-command wrapper, and the RPC
client method. Adds all voice.debug.* and notch.* i18n keys across the
14 locales (notch keys land here as inert strings; the notch UI that
consumes them ships in slice 6).

Slice 5/7 of tinyhumansai#3307 (always-on frontend).
Transparent NSPanel + WKWebView anchored at the top-centre of the primary
screen showing live Ready/Listening/Processing state; automate streams
step progress to it via the overlay:attention socket bridge. macOS only;
no-op elsewhere.

Slice 6/7 of tinyhumansai#3307 (notch status pill).
Routes always-on utterances through a fast intent classifier before the
chat model, wired into always-on delivery; ties the notch indicator
visibility to always-on listening. Adds the window tauri-command wrapper
and the core-process permission entry.

Slice 7/7 of tinyhumansai#3307 (Phase 3 fast routing).
…ps (Phase 1.5)

Adds a model-chosen `vision_click { description }` action to the `automate`
loop for apps that expose no usable accessibility tree (Slack, Discord,
VS Code). Flow: screenshot the app window -> ask the main vision model for the
target's pixel coordinates (via the existing `[IMAGE:]` marker path) -> map
image pixels to absolute screen points -> guarded left-click.

- New `accessibility/vision_click.rs`: pure `image_to_screen` coordinate
  transform (folds in the deferred F2 mapping -- the px->pt ratio absorbs the
  capture downscale + Retina backing scale), tolerant locate-response parser,
  capture geometry, and the main-thread guarded click (`run_input_on_main`,
  Change 1.15).
- Section 1.8 safety guard: only clicks when the target app is frontmost;
  refuses on positive evidence another app is focused, so synthetic input never
  lands on OpenHuman's own CEF window.
- Reuses the main `chat` vision provider -- no new inference API, no new tool,
  no new approval surface (inherits `automate`'s Dangerous + mutations gate).
- 19 new unit tests (pure transform/parse + scripted-backend loop integration,
  incl. the frontmost-refusal guard). All 25 automate + vision_click tests green.

Closes the last open Phase 1.5 item (tinyhumansai#3148). Stacks on tinyhumansai#3340-tinyhumansai#3346.
@M3gA-Mind M3gA-Mind force-pushed the feat/voice-split-8-vision-fallback branch from 919b3d1 to 8828ce2 Compare June 4, 2026 13:17
Take main's merged slice-1..7 versions; keep slice-8's vision_click work
(automate vision verb + accessibility/vision_click.rs) and the forward
Phase 1.5/Change 1.16 docs. Drop the duplicated desktop-control prompt
section + the spurious agent.toml re-add (vision_click is an automate
action, not a named tool).
@M3gA-Mind M3gA-Mind marked this pull request as ready for review June 4, 2026 18:49
@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Independent review (beyond the CodeRabbit pass)

Reviewed the vision-click fallback — the image_to_screen transform (the F2 fix), parse_locate_response, the vision_click loop verb, and the §1.8 frontmost guard.

Reviewed clean

  • Coordinate transformimage_to_screen expresses the sampled pixel as a 0..1 fraction of the image dims and maps it onto the window rect in points, so the px→pt ratio absorbs both the capture downscale and the Retina backing scale with no explicit scale factor. Clamps the result strictly inside the rect (rect_x..=rect_x+rect_w-1), and guards div-by-zero (img_w.max(1)) + negatives. Exhaustively covered (downscale / Retina 2× / origin offset / out-of-range / negative / zero-dim).
  • §1.8 safety guardvision_click only clicks when the target app is frontmost. It re-foregrounds once on a mismatch, re-checks, and refuses (no click) on positive evidence another app is focused; None (can't tell) proceeds best-effort since the loop already foregrounded. So synthetic input can never land on OpenHuman's own CEF window. Covered by the frontmost-refusal integration test.
  • Locate parser — tolerant (raw JSON → first {..} span → found flag → coords or None; garbage → Err), so an unparseable vision response fails safe rather than clicking a guess.
  • Reuse — rides the existing chat vision provider + [IMAGE:] multimodal marker (Attachments trigger Something went wrong in chat #3205); no new tool / inference API / approval surface (inherits automate's Dangerous + mutations gate + sensitive-app denylist).
  • Cross-platform — native capture/click/frontmost are macOS-gated; non-macOS returns clean errors / None.

The OS-bound RealBackend glue (screencapture / osascript / enigo) is integration-only (not CI-runnable) — same untestable-native shape as the rest of the stack; the pure transform/parse/loop-dispatch logic is fully unit-tested (16 vision_click + 20 automate green).

No correctness issues. LGTM once CI is green.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/openhuman/accessibility/automate_tests.rs (1)

356-376: 💤 Low value

Test assertion for re-foreground attempt is missing.

The test verifies that no screenshot or click happens when another app is frontmost, which is correct. However, it doesn't verify that the implementation attempted to re-foreground the target app (via act_launch) before refusing. This would confirm the full guard flow: detect mismatch → re-foreground → check again → refuse.

💡 Optional: assert re-foreground attempt
     let acts = backend.acts();
     assert!(
         !acts.iter().any(|a| a.starts_with("click:")),
         "must not click into a non-target app: {acts:?}"
     );
     assert!(
         !acts.iter().any(|a| a.starts_with("screenshot:")),
         "must not even screenshot when refused: {acts:?}"
     );
+    // The guard should have attempted to re-foreground before refusing.
+    assert!(
+        acts.iter().filter(|a| *a == "launch:Slack").count() >= 2,
+        "expected re-foreground attempt after initial launch: {acts:?}"
+    );
     let _ = out;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/accessibility/automate_tests.rs` around lines 356 - 376, Add an
assertion in the vision_click_refused_when_other_app_frontmost test to verify
the implementation attempted to re-foreground the target app before refusing;
after collecting acts from ScriptedBackend::acts() assert that there's an entry
indicating a launch/re-foreground attempt (e.g. an act that starts with
"act_launch:" or "launch:") so the sequence is detected (detect mismatch →
re-foreground attempt via act_launch → final refusal with no screenshot/click).
src/openhuman/accessibility/automate.rs (1)

458-520: 💤 Low value

Consider adding debug-level entry/exit logging for the vision_click handler.

The vision_click handler logs warnings for edge cases (re-foreground) but lacks debug-level logging at the handler entry point. This would help with grep-friendly tracing during development and debugging. As per coding guidelines for src/**/*.rs: "Add substantial debug-level logs... at entry/exit points, branch decisions".

🔧 Optional: add entry debug log
         "vision_click" => {
             let description = action.description.trim();
+            log::debug!(
+                "{LOG_PREFIX} vision_click: app={target_app:?} description={description:?}"
+            );
             if description.is_empty() {
                 steps.push("vision_click skipped: empty description".to_string());
                 continue;
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/accessibility/automate.rs` around lines 458 - 520, Add
debug-level entry/exit and key-branch logs for the "vision_click" match arm: at
the start of the "vision_click" handler log a debug message (using log::debug!
and LOG_PREFIX) that includes target_app and trimmed description, and on exit
log success/failure with the final step reason; also add debug logs before/after
the frontmost_app check (including when re-foregrounding), before/after
backend.screenshot(), before/after backend.locate() (including locate None vs
Some), and before/after backend.click() so the flow through
backend.frontmost_app(), backend.act_launch(), backend.screenshot(),
backend.locate(), and backend.click() is traceable for grep-friendly debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/voice-system-actions.md`:
- Around line 365-372: Update the contradictory section to state the blocker is
resolved: replace the present-tense "Fix required (not yet done)" paragraph
about enigo/TSMGetInputSourceProperty with a past-tense note that the issue was
fixed by dispatching enigo calls to the Tauri main thread via the new
run_input_on_main helper (implemented in main_thread.rs and exposed through the
native-registry handler), and mention that keyboard/mouse tools and vision_click
now run safely without TSM traps.

---

Nitpick comments:
In `@src/openhuman/accessibility/automate_tests.rs`:
- Around line 356-376: Add an assertion in the
vision_click_refused_when_other_app_frontmost test to verify the implementation
attempted to re-foreground the target app before refusing; after collecting acts
from ScriptedBackend::acts() assert that there's an entry indicating a
launch/re-foreground attempt (e.g. an act that starts with "act_launch:" or
"launch:") so the sequence is detected (detect mismatch → re-foreground attempt
via act_launch → final refusal with no screenshot/click).

In `@src/openhuman/accessibility/automate.rs`:
- Around line 458-520: Add debug-level entry/exit and key-branch logs for the
"vision_click" match arm: at the start of the "vision_click" handler log a debug
message (using log::debug! and LOG_PREFIX) that includes target_app and trimmed
description, and on exit log success/failure with the final step reason; also
add debug logs before/after the frontmost_app check (including when
re-foregrounding), before/after backend.screenshot(), before/after
backend.locate() (including locate None vs Some), and before/after
backend.click() so the flow through backend.frontmost_app(),
backend.act_launch(), backend.screenshot(), backend.locate(), and
backend.click() is traceable for grep-friendly debugging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38870e05-56d2-45a2-825c-10a924d267c7

📥 Commits

Reviewing files that changed from the base of the PR and between 769e8ef and d1a5aad.

📒 Files selected for processing (6)
  • docs/voice-system-actions.md
  • src/openhuman/accessibility/automate.rs
  • src/openhuman/accessibility/automate_tests.rs
  • src/openhuman/accessibility/mod.rs
  • src/openhuman/accessibility/vision_click.rs
  • src/openhuman/accessibility/vision_click_tests.rs

Comment thread docs/voice-system-actions.md Outdated
Re-apply the tinyhumansai#3346 reconciliation lost when taking slice-8's docs: the
'Fix required (not yet done) / keep disabled' paragraph contradicted the
'✅ Crash fixed' status. Now past-tense root-cause-fixed (run_input_on_main
on the main thread + catch_unwind); covers vision_click too. Tag the trace
fence as text (MD040).
@M3gA-Mind M3gA-Mind merged commit 3338582 into tinyhumansai:main Jun 4, 2026
23 of 25 checks passed
senamakel pushed a commit to senamakel/openhuman that referenced this pull request Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant