Skip to content

feat(voice): autonomous app control — automate loop, always-on (Hey Tiny), notch status, computer control (#3148)#3307

Closed
M3gA-Mind wants to merge 56 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-always-on
Closed

feat(voice): autonomous app control — automate loop, always-on (Hey Tiny), notch status, computer control (#3148)#3307
M3gA-Mind wants to merge 56 commits into
tinyhumansai:mainfrom
M3gA-Mind:feat/voice-always-on

Conversation

@M3gA-Mind

Copy link
Copy Markdown
Collaborator

Summary

Builds on the merged Phase 1 (#3168) to make the voice→system-action agent actually autonomous, end-to-end:

  • automate(app, goal) — one tool call runs a Rust perceive→act→settle→verify loop with a fast model (chat model out of the click loop). Deterministic Music fast-path proven live (search→navigate→verify player state == playing).
  • Phase 2 always-on listening — continuous cpal mic → VAD segmenter → STT → agent, no hotkey. Opt-in Settings toggle + "Hey Tiny" wake word (English-forced STT + fuzzy match). Screen-lock privacy pause.
  • Notch status pill (cherry-picked from feat/notch-live-activity) — always-visible Ready / Listening / Processing; automate streams live step progress to it.
  • Full computer controlscreenshot (now downscaled so the model can see it) + mouse + keyboard, with a keyboard-first fallback for Electron apps (Slack) whose AX tree is empty.
  • CEF crash fixed — enigo's TSMGetInputSourceProperty was crashing off-thread (SIGTRAP); keyboard/mouse now run on the app main thread via run_on_main_thread.

Problem

Phase 1 could launch apps but couldn't reliably do things in them: multi-step UI flows were fragile (chat model orchestrating each AX step), Electron apps exposed no AX tree, there was no hands-free listening, and synthetic input crashed the CEF host.

Solution

  • Rust-internal automate loop (accessibility/automate.rs) + per-app fast-paths; poll-until-stable settle; playback verification.
  • voice/always_on.rs: pure unit-tested VadSegmenter + continuous capture; wake-word gate; config + RPC + Settings toggle.
  • Main-thread input bridge (tools/impl/computer/main_thread.rs + Tauri handler) — the fix for the documented §1.8 crash, confirmed via crash report.
  • Notch driven by the existing overlay:attention socket bridge; auth fixed for the WKWebview.

Full narrative + a fine-tuning backlog (longer listening window, mouse-coordinate mapping, screenshot/verify cadence — deferred) in docs/voice-system-actions.md.

Submission Checklist

  • Tests added or updated (happy path + failure/edge) — 220+ feature unit tests, a JSON-RPC E2E (json_rpc_voice_server_settings_roundtrip…), VAD/wake-word/settle/screenshot-downscale/crash-guard suites.
  • Diff coverage ≥ 80% — extensive unit + E2E added; diff-cover not run locally yet (draft).
  • Coverage matrix updated — docs/TEST-COVERAGE-MATRIX.md not yet updated (draft).
  • No new external network dependencies — STT uses the existing configured provider via the mock-backed factory.
  • Manual smoke checklist — N/A pending (draft; manual voice/desktop-control flows are mic/AX-dependent).
  • Linked issue — Closes #3148 once out of draft.

Impact

  • Desktop (macOS) only for the new native paths (AX helper, main-thread input, notch NSPanel, screen-lock). All cross-platform-compiles; non-macOS returns clean runtime errors. Windows UIA path retained.
  • Opt-in: always-on listening + computer-control are off by default.

Related

M3gA-Mind added 30 commits June 2, 2026 02:15
…mmands

Phase 1 of issue tinyhumansai#3148 — quick wins that make hotkey-triggered voice
commands execute without a manual send or approval prompt.

Auto-send after transcription:
- useDictationHotkey.ts: adds `autoSend: true` to the
  `dictation://insert-text` event detail when a hotkey transcription
  completes.
- Conversations.tsx: the `onDictationInsert` handler checks the new flag;
  when set, it calls `handleSendMessage(text)` directly instead of
  inserting into the composer. A `handleSendMessageRef` (updated every
  render) gives the mount-time effect access to the latest send fn.

Shell allowlist for app-launching:
- security/policy_command.rs: adds `open` (macOS) and `xdg-open` (Linux)
  to READ_ONLY_BASES so `open -a Music`, `open -b com.apple.Safari`,
  `xdg-open music://`, etc. classify as CommandClass::Read and execute
  without triggering the ApprovalGate in Supervised mode.

Closes part of tinyhumansai#3148.
Dedicated tool that opens a named application on the user's machine
without requiring shell access or workspace_only = false.

- src/openhuman/tools/impl/system/launch_app.rs: new LaunchAppTool
  - macOS: `open -a "<app_name>"` via LaunchServices
  - Linux: `gtk-launch`, fallback `xdg-open`
  - Windows: `Start-Process` via PowerShell
  - PermissionLevel::ReadOnly — never triggers the approval gate
  - Input validation: rejects paths, metacharacters, empty names
  - Unit tests: name, permission, schema, validation, error cases

- src/openhuman/tools/impl/system/mod.rs: register module + pub use
- src/openhuman/tools/ops.rs: add LaunchAppTool to all_tools_with_runtime
- src/openhuman/tools/user_filter.rs: add "launch_app" family,
  default_enabled = true, mirrors shell family pattern
- app/src/utils/toolDefinitions.ts: add to frontend tool catalog so it
  appears in Settings → Agent Access with its own toggle

This avoids loosening workspace_only or expanding allowed_commands in
the shell tool — launch_app is narrowly scoped to app launching only.

Part of tinyhumansai#3148.
- launch_app.rs: log every step (▶ execute, ✓/✗ validation, platform
  dispatch, open exit code + stderr, fallback result)
- builder.rs: log full list of visible tool names at session build time
  so we can confirm launch_app appears in the LLM's tool context
- SOUL.md: add explicit capability section — agent now knows it CAN use
  launch_app to open apps and must not refuse with 'I can't open apps'
The orchestrator's tool scope is a strict allowlist (named = [...]).
launch_app was registered in the tool registry but not listed here,
so the LLM never saw it — explaining every refusal.

Adding it alongside current_time follows the same pattern: direct,
fast, no delegation needed for a simple user request like 'open Music'.
…tion

- orchestrator/agent.toml: add 'mouse' and 'keyboard' to named tool list
  so the orchestrator can click/type in apps directly without delegating
- user_filter.rs: add 'computer_control' tool family (mouse + keyboard),
  default_enabled = true, gated by computer_control.enabled in config
- toolDefinitions.ts: add Computer Control entry to frontend catalog
  (Settings → Agent Access toggle)
- SOUL.md: document mouse and keyboard capabilities so the agent knows
  it can interact with on-screen UI, not just launch apps

Config: computer_control.enabled = true set in user config (not a code
change — user-specific setting at ~/.openhuman/users/<id>/config.toml).

Part of tinyhumansai#3148.
…orkflow

Without screenshot in the named list the agent could click but couldn't
locate UI elements — it was asking the user for coordinates.

- orchestrator/agent.toml: add 'screenshot' alongside 'mouse'/'keyboard'
- SOUL.md: document the screenshot→mouse workflow explicitly and tell the
  agent to never ask the user for coordinates — find them via screenshot
CGEventPost from enigo crashes CEF when the key event lands in the
OpenHuman renderer instead of the target app. Removing until a proper
app-focus-before-input mechanism is in place.
Replaces the unreliable mouse/keyboard (enigo/CGEventPost) approach with
macOS Accessibility API interactions — no synthetic events, no CEF crash.

Swift helper (helper.rs):
- ax_list_elements: walk the AX tree and return interactive elements
- ax_press: AXUIElementPerformAction(kAXPressAction) by label
- ax_set_value: AXUIElementSetAttributeValue(kAXValueAttribute) by label
- New switch cases: ax_list, ax_press, ax_set_value
- helper_send_receive: pub(super) → pub(crate) so ax_interact.rs can call it

New files:
- src/openhuman/accessibility/ax_interact.rs — Rust wrappers (ax_list_elements,
  ax_press_element, ax_set_field_value) over the Swift helper
- src/openhuman/tools/impl/computer/ax_interact.rs — AxInteractTool with
  actions: list / press / set_value, PermissionLevel::ReadOnly

Wired into:
- tools/ops.rs, tools/user_filter.rs, toolDefinitions.ts
- orchestrator/agent.toml named list
- SOUL.md: document list→press workflow

Part of tinyhumansai#3148.
Tests cover:
- ax_list_returns_elements: AX tree is non-empty for Music
- ax_press_play_button: Play button is pressable
- test_full_flow_search_and_play_acdc: open Music → URL-scheme search
  for 'Highway to Hell' → find AXCell in results → press it
- ax_set_search_field: set_value on the search field
- test_ax_list_nonexistent_app / test_ax_press_nonexistent_app: error paths

Live tests tagged #[ignore] (need Accessibility permission + Music).
Run with: cargo test ax_interact -- --include-ignored --nocapture
SOUL.md: add explicit 4-step workflow (list → set_value → list again →
press specific row, not generic Play). Add guidance to use shell URL
scheme for Apple Music song search — more reliable than filter field.

ax_interact_tests.rs: fix import from super::super::ax_interact to
super:: (tests are in a submodule of ax_interact, not a sibling).
- voice-system-actions.md: mark 1.8 (mouse/keyboard) reverted with crash
  root cause; add 1.9 (ax_interact) and 1.10 (multi-step workflow guidance);
  update summary table
- ax_interact_tests.rs: flatten to #![cfg] module-level so super:: resolves
  to ax_interact; full AC/DC flow test now passes (5 steps, song row pressed)
Root cause of 'navigated but didn't play': pressing a search-result row
in Apple Music only selects/navigates — it never starts playback. Every
matching element (cell/group/button) exposes only AXPress=select. Verified
empirically that double-press, CGEvent double-click, and select+Return all
leave player state 'stopped'.

Working sequence: AXPress the result to navigate INTO the song's detail
page, then AXPress the Play button ON that page → player state 'playing'.

- SOUL.md: exact 5-step Apple Music sequence; warns the second Play press
  on the detail page is mandatory
- ax_interact_tests.rs: full-flow test now asserts real playback via
  osascript player state == 'playing' (passes)
- voice-system-actions.md: document as change 1.11 with verification
Root cause the agent kept using the wrong (filter-field) approach: the
orchestrator has omit_identity=true, so it NEVER sees SOUL.md. The chat
agent only reads tool descriptions + agent.toml. The navigate-then-play
guidance in SOUL.md was dead weight for the orchestrator.

Moved the exact 5-step Apple Music play sequence into the ax_interact
tool description, which the LLM always receives via the function schema.
Transcript analysis of the failed 'play Highway to Hell' run revealed two
root causes:
1. The orchestrator has NO shell tool — my ax_interact description told it
   to 'use shell to open music://...', which it can't. It wrapped the
   command in a prompt arg to a delegation tool; it never ran, and it fell
   back to the broken filter-field approach.
2. Cross-chat memory context injected prior filter-approach checkpoints,
   biasing the agent back to the wrong method.

Fix: stop making the LLM orchestrate a fragile multi-step flow with a tool
it lacks. Encapsulate the entire proven sequence in native Rust:
- accessibility/ax_interact.rs: play_apple_music(query) — open search URL,
  AX-find + press the song cell (navigate), press detail-page Play, verify
  player state == playing
- tools/impl/computer/play_music.rs: PlayMusicTool, one call play_music{query},
  PermissionLevel::ReadOnly, runs the blocking flow via spawn_blocking
- registered in ops.rs, user_filter.rs, orchestrator agent.toml, toolDefinitions.ts

Agent now calls play_music{query:'Highway to Hell AC/DC'} once and it plays.
…lay_music

Transcript analysis of the failed 'play Numb by Linkin Park' run:
1. play_music failed on a 4s timing race (results not yet rendered → empty)
2. agent fell back to ax_interact 'list' which dumped 273 elements; the
   tool result was TRUNCATED mid-list, so the model hallucinated a wrong
   result ('Numb - Single by Marshmello') from a partial view.

Per feedback, a music-specific tool is the wrong abstraction. Reverted it
and made ax_interact a robust GENERIC any-app interaction tool:

- Removed play_music tool + play_apple_music helper (and all registrations)
- ax_list_elements_filtered(app, filter): Rust-side label filter so 'list'
  returns only relevant elements (fixes the truncation→hallucination bug)
- ax_interact 'list' now takes a  param; output capped at 60 with a
  'narrow your filter' hint; empty-match returns a 'UI may still be loading'
  hint instead of failing hard
- Rewrote the tool description to be app-agnostic and document the general
  navigate-then-activate pattern (press a row opens it; press the action
  button after) without hardcoding Apple Music steps
…fort

The full-flow test was flaky asserting player state == 'playing': Apple
Music's UI is nondeterministic (detail-page render timing varies; multiple
'Play' elements that AX can't disambiguate). The test now asserts the
generic list/press primitives work against a real app and logs the player
state for diagnosis only — playback reliability is an Apple Music UI
limitation, not a tool correctness issue.
Maps each macOS piece to its Windows equivalent so the same open-app +
interact-with-UI feature can be built on Windows:
- macOS AXUIElement → Windows UI Automation (IUIAutomationElement)
- AX roles/actions → UIA ControlType + Invoke/Value/SelectionItem patterns
- recommends the  Rust crate (no helper process needed —
  COM API is callable directly from Rust, unlike the macOS Swift helper)
- module layout: uia_interact.rs parallel to ax_interact.rs, cfg-dispatched
  so the agent-facing tool stays a single 'ax_interact' on both platforms
- permissions (UIA needs none for same-integrity apps), Chromium/Electron
  caveats, Calculator/Notepad smoke tests, Start-Process/Get-StartApps for
  launching Store apps

Also includes trailing linter reformat of ax_interact.rs/tests.
…atrix

- Cross-platform audit table: confirms every Phase 1 change compiles on
  all platforms (macOS native code is cfg-gated; non-macOS arms return a
  clean error, never a build break). Flags the one-line shell-allowlist
  gap (add 'start') and the ax_interact UIA backend work.
- Mandatory Windows E2E matrix (9 items): app launch incl. UWP/URI,
  deterministic Calculator control (hard-asserted), Notepad set_value,
  filtered-list correctness (no truncation/hallucination), real media app
  (best-effort), Chromium/Electron tree exposure, elevation/UIPI,
  agent-in-the-loop, and a macOS regression re-run after the port.
- Note to verify the whole branch still builds+runs on macOS after the
  Windows cfg-dispatch lands.
Implements the Windows backend for the Phase 1 app-interaction layer so the
agent can open apps and drive their UI on Windows, mirroring the macOS path.
The agent-facing tool stays a single `ax_interact` tool on both platforms;
only the backend differs via cfg-dispatch.

- accessibility/uia_interact.rs (new): UI Automation backend — list/press/
  set_value over the UIA COM tree via the `uiautomation` crate. press uses
  Invoke → SelectionItem.Select → LegacyIAccessible default action (no
  synthetic input, so no CEF-crash risk); set_value targets an Edit, then
  ComboBox, then Document field (the Win11 RichEdit Notepad is a Document).
- accessibility/ax_interact.rs: cfg-dispatch the three helpers to UIA on
  Windows (macOS Swift-helper arms unchanged); OS-neutral module docs.
- accessibility/mod.rs: declare the Windows-gated uia_interact module.
- tools/impl/system/launch_app.rs: harden the Windows launcher — app name
  passed via env var (injection-safe) + Store/UWP AUMID fallback via
  Get-StartApps; surface stderr on failure.
- tools/impl/computer/ax_interact.rs: OS-neutral tool description.
- security/policy_command.rs: add `start` to READ_ONLY_BASES.
- accessibility/uia_interact_tests.rs (new): cfg(windows) integration tests —
  Calculator (deterministic, 5+5=10, hard-asserted), Notepad set_value,
  nonexistent-app.
- Cargo.toml: uiautomation 0.25 (Windows) + Win32_System_Com feature.
- docs/voice-system-actions.md: Windows port marked implemented w/ evidence.

Verified on Windows 11: Calculator driven to 5+5=10 by element label; Notepad
set_value wrote into the Win11 Document editor; nonexistent-app + launch_app
(8) + ax_interact tool (4) unit tests pass; full lib compiles clean.
…loop status

- SOUL.md: ax_interact is no longer macOS-only — describe it as the platform
  accessibility API (macOS Accessibility / Windows UI Automation). Label the
  Apple Music play sequence as the macOS-specific example it is, and note that
  on Windows the same list→press pattern applies but a press usually activates
  a control directly (the navigate-then-play second press is often unneeded).
- docs/voice-system-actions.md: record that the full Tauri app was built and
  run on Windows with verbose tool logging; the agent-in-the-loop test is still
  pending because the local AI model was mid-download (empty_provider_response).
…tighten launchers, docs

- ax_interact tool: gate press/set_value through approval — permission_level_with_args
  returns Dangerous for press/set_value (ReadOnly for list), and external_effect_with_args
  routes mutating actions through the ApprovalGate. Read-only list stays frictionless.
- ax_press_element: reject blank label (empty needle matched-all and pressed the first
  named control) — guard in the public facade, not just the tool layer.
- policy_command: remove open/xdg-open from READ_ONLY_BASES — base-command classification
  can't see args, and these launchers can open arbitrary URLs/URI handlers (network/system
  reach) without approval. App launching goes through the scoped launch_app tool instead.
- launch_app (Linux): gtk-launch needs a .desktop ID not a display name; try the name then
  a derived id (lowercase, spaces→hyphens); clarify xdg-open only opens URIs, with a better
  error.
- toolDefinitions.ts: platform-neutral ax_interact description (was macOS-specific).
- ax_interact_tests: assert set_value outcome.
- docs: add 'text' language to fenced blocks (MD040); reword Apple Music playback claims as
  best-effort (not hard-asserted) to match the test.
Coverage Gate flagged the changed auto-send lines (diff-cover < 80%):
useDictationHotkey.ts:153 and Conversations.tsx:464,472-474.

- useDictationHotkey.test: assert the dictation:transcription handler
  dispatches a dictation://insert-text CustomEvent with trimmed text +
  autoSend:true; plus a blank-text edge case (no event).
- Conversations.render.test: assert an autoSend dictation event routes
  straight to chatSend with the trimmed message; plus a blank-text edge
  case (no send).
Addresses maintainer (oxoxDev) security review on tinyhumansai#3168:

launch_app (gate-bypass + URI-smuggling blockers):
- external_effect()=true + permission_level=Execute → routes through the
  ApprovalGate like shell (was always-allow under every tier).
- validate_app_name rejects URI schemes (^[a-z][a-z0-9+.-]*:) so the
  xdg-open/Start-Process fallbacks can't fire arbitrary registered handlers
  (spotify:/mailto:/slack:). Named applications only, as documented.
- docstring corrected: injection-safe != side-effect-free.

ax_interact (app-scope + default-posture blockers):
- sensitive-app denylist (Keychain, 1Password/Bitwarden/LastPass/Dashlane,
  System Settings/Preferences, Terminal/iTerm, Console): all actions refused
  — defense-in-depth that holds even on background/auto-approved turns.
- mutating press/set_value are opt-in via new config
  computer_control.ax_interact_mutations (default false); read-only list
  always available — mirrors computer_control.enabled for mouse/keyboard.
- orchestrator agent.toml comment corrected: only list is ReadOnly/unprompted;
  press/set_value are Dangerous, gate interactively, opt-in, and deny-listed.

Tests: launch_app URI-reject + Execute/external_effect; ax_interact denylist,
mutations-disabled refusal, per-arg permission/gate. cargo check + config
schema tests green.
@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Superseded — split into 7 small, dependency-ordered PRs

This 72-file draft is hard to review, so it's being replaced by a merge-train of 7 focused PRs (each ~5–19 files). They were rebased onto current main first — note that Phase 1 (#3168) is already merged, so the true remaining contribution is 59 files; the slices below reproduce all of it byte-for-byte (verified).

# PR Area Files
1 #3340 Computer-control input primitives + CEF main-thread crash fix 6
2 #3341 Accessibility AX/UIA perception + automate engine 13
3 #3342 Wire automate/ax_interact tools into the orchestrator 9
4 #3343 Phase 2 always-on listening engine + config + RPC 9
5 #3344 Always-on Settings toggle + debug panel + i18n 19 (14 locale one-liners)
6 #3345 macOS notch status pill 5
7 #3346 Phase 3 fast command router 11

Merge order: 1 → 7 (each is stacked on the previous). #3340 is ready for review now; #3341#3346 are drafts and will be rebased onto main (collapsing each to its own slice) and marked ready as their predecessors merge.

Closes #3148 moves to #3346 (the final slice). Closing this in favour of the split.

@M3gA-Mind M3gA-Mind closed this Jun 4, 2026
M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jun 4, 2026
…trator

Registers the AutomateTool (multi-step UI flows in one call) and the
ax_interact denylist/opt-in plumbing; adds the catalog toggle, tool
definition, and orchestrator prompt guidance (automate + screenshot/
mouse/keyboard fallback for Electron apps with empty AX trees).

Slice 3/7 of tinyhumansai#3307 (tool wiring + prompts).
M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jun 4, 2026
Continuous cpal mic → VAD segmenter → STT → agent with no hotkey, opt-in
via voice_server.always_on_enabled, 'Hey Tiny' wake word (English-forced
STT + fuzzy match), and screen-lock privacy pause. Adds the config schema,
live-apply on the settings RPC, start_if_enabled wiring, and a JSON-RPC
roundtrip E2E.

Slice 4/7 of tinyhumansai#3307 (always-on core).
M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jun 4, 2026
Surfaces the always-on listening toggle in the reachable Voice panel,
adds the VoiceDebugPanel, the voice tauri-command wrapper, and the RPC
client method. Adds all voice.debug.* and notch.* i18n keys across the
14 locales (notch keys land here as inert strings; the notch UI that
consumes them ships in slice 6).

Slice 5/7 of tinyhumansai#3307 (always-on frontend).
M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jun 4, 2026
Transparent NSPanel + WKWebView anchored at the top-centre of the primary
screen showing live Ready/Listening/Processing state; automate streams
step progress to it via the overlay:attention socket bridge. macOS only;
no-op elsewhere.

Slice 6/7 of tinyhumansai#3307 (notch status pill).
M3gA-Mind added a commit to M3gA-Mind/openhuman that referenced this pull request Jun 4, 2026
Routes always-on utterances through a fast intent classifier before the
chat model, wired into always-on delivery; ties the notch indicator
visibility to always-on listening. Adds the window tauri-command wrapper
and the core-process permission entry.

Slice 7/7 of tinyhumansai#3307 (Phase 3 fast routing).
senamakel pushed a commit to senamakel/openhuman that referenced this pull request Jun 6, 2026
senamakel pushed a commit to senamakel/openhuman that referenced this pull request Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant