diff --git a/docs/voice-automate-plan.md b/docs/voice-automate-plan.md new file mode 100644 index 0000000000..3802a47696 --- /dev/null +++ b/docs/voice-automate-plan.md @@ -0,0 +1,151 @@ +# Phase 1.5 Implementation Plan — `automate(app, goal)` + +**Parent tracker:** [`voice-system-actions.md`](voice-system-actions.md) (Change 1.14 / Phase 1.5) +**Decided approach:** Rust inner loop + fast model (chat LLM out of the click loop) +**First proof target:** Music — "play ``" end-to-end +**Status:** Plan — awaiting approval before code + +--- + +## 1. Goal + +Turn a single high-level intent ("play Numb by Linkin Park") into a multi-step UI +automation that completes in **one tool call from the orchestrator**, runs fast, +and self-corrects — instead of N separate chat-LLM turns over the raw +`ax_interact` primitives (today's flow; see tracker §1.10–1.13 for why that's +slow and fragile). + +## 2. Architecture + +```text + orchestrator (chat LLM) + │ one call: automate{ app, goal } + ▼ + AutomateTool (tools/impl/computer/automate.rs) + │ delegates to + ▼ + accessibility::automate::run(app, goal) ← the inner loop (Rust) + │ + ├─ fast-path dispatch ── app_fastpaths/{music,spotify,slack}.rs + │ (deterministic; skip the loop entirely when available) + │ + └─ general loop ──► perceive → decide → act → settle → verify ──┐ + ▲ │ + └────────────── repeat until done / fail / budget ───────┘ + perceive: ax_list_elements_filtered (existing) + decide: create_chat_provider("automation", cfg) → JSON action + act: ax_press_element / ax_set_field_value / launch_app (existing) + settle: helper "ax_wait_settled" (new) — AXObserver, not sleep + verify: re-read state; confirm the action took effect +``` + +The **chat model is invoked once** (to pick `automate` and its `goal`). The +**fast model** runs the inner loop with a tiny context (goal + current filtered +snapshot + last result), so each step is ~0.5–1s and cheap. + +## 3. Inner-loop algorithm + +State carried across iterations: `goal`, `app`, `history: Vec`, `budget`. + +Each iteration: +1. **Perceive** — `ax_list_elements_filtered(app, last_filter_or_"")`, capped/filtered + exactly as the `ax_interact` tool does today (≤60 elements, never a raw dump). +2. **Decide** — call the fast model with a strict system prompt + the JSON action + schema (below). Parse one action. +3. **Act** — execute via existing helpers. `launch` → `launch_app`; `press` → + `ax_press_element`; `set_value` → `ax_set_field_value`; `list` → just re-perceive + with a new filter. +4. **Settle** — `ax_wait_settled(app, timeout)` (new helper): block until the AX + tree stops changing (debounced AXObserver notifications) or timeout. Removes the + timing-race class deterministically. +5. **Verify** — re-read; confirm the expected post-condition (e.g. a new control + appeared, focus changed, a value was set). Record success/failure in `history`. +6. **Loop** until the model emits `done`/`fail`, or the step budget (e.g. 12) is hit. + +### Action schema (fast model output — strict JSON) +```jsonc +{ + "thought": "short reasoning", + "action": "launch | list | press | set_value | done | fail", + "app": "Music", // optional override; defaults to the task app + "filter": "Highway", // for list + "label": "Play", // for press / set_value + "value": "Highway to Hell", // for set_value + "summary": "what happened / why done" // for done|fail +} +``` +Invalid JSON or unknown action → one repair retry, then `fail` with the raw text +logged (never act on a guess — this is the §1.13 hallucination lesson). + +## 4. New files & changes (grounded in current layout) + +**New** +- `src/openhuman/accessibility/automate.rs` — `run(app, goal, opts) -> Result`; the loop, action schema (serde), fast-model call, step budget, structured `history`. +- `src/openhuman/accessibility/app_fastpaths/mod.rs` + `music.rs` (Spotify/Slack land later) — `try_fastpath(app, goal) -> Option>`. +- `src/openhuman/tools/impl/computer/automate.rs` — `AutomateTool { allow_mutations }`; reuses the `ax_interact` gating posture (mutations opt-in, `SENSITIVE_APPS` denylist, `permission_level_with_args` = Dangerous, `external_effect_with_args` = true). +- `src/openhuman/accessibility/automate_tests.rs` — unit tests for the loop (mock perceive/act/decide), schema parse/repair, budget, fast-path dispatch. + +**Changed** +- `accessibility/helper.rs` (macOS Swift) — add `ax_wait_settled` (AXObserver on `kAXValueChanged`/`kAXFocusedUIElementChanged`/`kAXCreated`, debounce ~150ms, bounded ~3s) and return richer element fields (enabled / on-screen / supported actions) from `ax_list`. +- `accessibility/ax_interact.rs` — surface a `ax_wait_settled` Rust wrapper; extend `AXElement` with the new optional fields (back-compat: `#[serde(default)]`). +- `accessibility/mod.rs` — declare `automate`, `app_fastpaths`. +- `inference/provider/factory.rs` — add an `"automation"` role (falls back to the fast/summarization tier) so the loop's model is independently configurable. +- `tools/ops.rs` (`all_tools_with_runtime`), `tools/user_filter.rs` (new `"automate"` family), `agent_registry/agents/orchestrator/agent.toml` (`named` list), `app/src/utils/toolDefinitions.ts` (Settings → Agent Access toggle). +- Tracker: flip Change 1.14 / Phase 1.5 rows from ⏳ Planned → in progress as milestones land. + +## 5. Fast-model call + +`create_chat_provider("automation", &cfg)` → `(provider, model)`; build a +`ChatRequest { messages, tools: None, stream: None }` with a system prompt that +pins the JSON schema and a user message carrying `{goal, snapshot, history_tail}`. +No tools array — we want a single JSON object back, parsed by us, executed by us. +Temperature low. Token budget small (snapshot is already ≤60 elements). + +## 6. Music proof (first target) + +`app_fastpaths/music.rs` encodes the §1.11 proven sequence behind one entry: +1. `launch_app("Music")` +2. open `music://music.apple.com/search?term=` (URL scheme) +3. `ax_wait_settled` +4. `ax_list_elements_filtered("Music", )` → find the song row +5. `ax_press_element` the row (navigate into detail) +6. `ax_wait_settled` → `ax_list` the detail page → `ax_press_element("Play")` +7. verify `osascript … get player state == playing` (best-effort, logged) + +If the fast-path can't find the row (timing/locale), fall through to the **general +loop**, which is what proves the architecture is app-agnostic. + +## 7. Progress streaming + +Emit a `DomainEvent` per step (`AutomateProgress { app, step, action, ok }`) on the +event bus; a subscriber bridges to the existing notch/voice status surface +(PR #3166) so the user sees "Opening Music → searching → playing" live. Reuses the +`ApprovalSurfaceSubscriber` bridging pattern. + +## 8. Testing + +- **Unit** (`automate_tests.rs`, CI-safe): action JSON parse + repair; budget exhaustion → `fail`; fast-path dispatch chosen over loop; verify-failure triggers retry/alternate. Perceive/act/decide are trait-injected so tests need no mic/AX/LLM. +- **Integration** (`#[ignore]`, run on a real Mac): the Music flow end-to-end (mirrors `ax_interact_tests::test_full_flow_search_and_play_acdc`); tool-level success hard-asserted, playback best-effort. +- **Agent-in-the-loop**: ask the running app "play ``", confirm it picks `automate` and the song plays; watch `[automate]` logs. + +## 9. Milestones (sequenced) + +1. **M1** — `automate.rs` loop skeleton + action schema + fast-model call + `AutomateTool` (gated, registered). Loop runs against existing (non-settled) `ax_interact` helpers. Unit tests. *Compiles + agent can call it.* +2. **M2** — `ax_wait_settled` (helper + wrapper) + verify step wired into the loop. Kills the timing-race class. +3. **M3** — Music fast-path; prove the flow end-to-end on a Mac. +4. **M4** — progress streaming to the notch surface. +5. **M5** — richer element model (enabled/onscreen/actions) for better matching. +6. *(later)* Spotify + Slack fast-paths; vision fallback for Electron; Windows UIA settle parity. + +## 10. Risks / open questions + +- **Fast model availability** — if no fast tier is configured, fall back to the + chat model for the loop (still one tool call; just slower). The `"automation"` + role makes this a config decision, not a hard dependency. +- **AXObserver from the Swift helper** — needs a short run-loop pump; if flaky, + fall back to a polling settle (count-stable-for-150ms) behind the same wrapper. +- **macOS-only first** — Windows UIA settle/verify parity is M6, gated like the + existing cfg-dispatch; non-mac/non-win returns the existing clean runtime error. +- **Safety** — `automate` is a mutating tool: same opt-in + `SENSITIVE_APPS` + denylist + ApprovalGate routing as `ax_interact`; the inner loop may not target a + denylisted app even if the model asks. diff --git a/src/openhuman/accessibility/app_fastpaths/fastpaths_tests.rs b/src/openhuman/accessibility/app_fastpaths/fastpaths_tests.rs new file mode 100644 index 0000000000..29ead72c25 --- /dev/null +++ b/src/openhuman/accessibility/app_fastpaths/fastpaths_tests.rs @@ -0,0 +1,215 @@ +//! Tests for the app fast-paths: pure query parsing + the Music sequence via a +//! scripted backend (no live Music, no model). + +use super::super::automate::{AutomateBackend, AutomateOutcome}; +use super::super::ax_interact::AXElement; +use super::music; +use async_trait::async_trait; +use std::sync::Mutex; + +// ── Pure parser tests ─────────────────────────────────────────────── + +#[test] +fn matches_music_play_intents() { + assert!(music::matches("Music", "play Numb by Linkin Park")); + assert!(music::matches("Apple Music", "play Highway to Hell")); + assert!(music::matches("music", "launch music and play Numb")); + // Not a play intent → no fast-path. + assert!(!music::matches("Music", "pause")); + // Not Music → no fast-path. + assert!(!music::matches("Slack", "play Numb")); +} + +#[test] +fn extract_query_basic() { + assert_eq!( + music::extract_play_query("play Numb by Linkin Park").as_deref(), + Some("Numb Linkin Park") + ); +} + +#[test] +fn extract_query_strips_filler_and_suffix() { + assert_eq!( + music::extract_play_query("play the song Highway to Hell by AC/DC").as_deref(), + Some("Highway to Hell AC/DC") + ); + assert_eq!( + music::extract_play_query("play Numb in Apple Music").as_deref(), + Some("Numb") + ); +} + +#[test] +fn extract_query_after_launch_clause() { + assert_eq!( + music::extract_play_query("launch Music and play Numb").as_deref(), + Some("Numb") + ); +} + +#[test] +fn extract_query_rejects_non_play() { + assert_eq!(music::extract_play_query("pause the music"), None); + assert_eq!(music::extract_play_query("display settings"), None); // "play" inside "display" + assert_eq!(music::extract_play_query("play"), None); // nothing after + // Right boundary: "play" must be a whole word, not a prefix of "playback". + assert_eq!(music::extract_play_query("open playback settings"), None); + assert!(!music::matches("Music", "show playback options")); +} + +#[test] +fn extract_query_handles_unicode_without_panicking() { + // `to_lowercase()` can change byte lengths for non-ASCII text; the parser + // (and replace_ci's " by " rewrite) must never slice mid-codepoint. + assert_eq!( + music::extract_play_query("play Café del Mar by Renée").as_deref(), + Some("Café del Mar Renée") + ); +} + +#[test] +fn extract_query_from_quoted_title_with_artist() { + // The exact goal that failed live: song quoted earlier, sentence ends "…play it". + assert_eq!( + music::extract_play_query( + "launch Music app, search for \"Highway to Hell\" by AC/DC, and play it" + ) + .as_deref(), + Some("Highway to Hell AC/DC") + ); + assert_eq!( + music::extract_play_query("play \"Numb\" by Linkin Park").as_deref(), + Some("Numb Linkin Park") + ); + // Quoted title, no artist. + assert_eq!( + music::extract_play_query("please play \"Bohemian Rhapsody\"").as_deref(), + Some("Bohemian Rhapsody") + ); +} + +#[test] +fn extract_query_rejects_bare_pronoun() { + // No song name anywhere → decline (let the general loop / a clarifier handle it). + assert_eq!(music::extract_play_query("play it"), None); + assert_eq!(music::extract_play_query("play something"), None); + assert!(!music::matches("Music", "play it")); +} + +// ── Sequence test via scripted backend ────────────────────────────── + +struct Backend { + acts: Mutex>, + /// Elements returned by perceive (the search results screen). + elements: Vec, + press_fail_on: Option, +} + +impl Backend { + fn new(elements: Vec) -> Self { + Self { + acts: Mutex::new(Vec::new()), + elements, + press_fail_on: None, + } + } + fn acts(&self) -> Vec { + self.acts.lock().unwrap().clone() + } +} + +#[async_trait] +impl AutomateBackend for Backend { + async fn perceive(&self, _app: &str, _filter: &str) -> Result, String> { + Ok(self.elements.clone()) + } + async fn decide(&self, _system: &str, _user: &str) -> Result { + Err("fast-path must not call the model".into()) + } + async fn act_launch(&self, app: &str) -> Result { + self.acts.lock().unwrap().push(format!("launch:{app}")); + Ok("ok".into()) + } + async fn act_press(&self, app: &str, label: &str) -> Result { + self.acts + .lock() + .unwrap() + .push(format!("press:{app}:{label}")); + if self.press_fail_on.as_deref() == Some(label) { + return Err("press failed".into()); + } + Ok("ok".into()) + } + async fn act_set_value(&self, _a: &str, _l: &str, _v: &str) -> Result { + Ok("ok".into()) + } + async fn open_url(&self, url: &str) -> Result { + self.acts.lock().unwrap().push(format!("open_url:{url}")); + Ok("ok".into()) + } + async fn settle(&self, _app: &str) {} + async fn wait(&self, _ms: u64) {} +} + +fn song_row(label: &str) -> AXElement { + AXElement::new("AXCell", label) +} + +#[tokio::test] +async fn music_fastpath_full_sequence() { + let backend = Backend::new(vec![song_row("Numb"), AXElement::new("AXButton", "Play")]); + let out = music::run("play Numb by Linkin Park", &backend).await; + assert!(out.success, "expected success: {out:?}"); + let acts = backend.acts(); + // launch → open search url → press the row → press detail Play. + assert_eq!(acts[0], "launch:Music"); + assert!(acts[1].starts_with("open_url:music://"), "got {}", acts[1]); + assert!(acts.contains(&"press:Music:Numb".to_string()), "{acts:?}"); + assert!(acts.contains(&"press:Music:Play".to_string()), "{acts:?}"); +} + +#[tokio::test] +async fn music_fastpath_no_row_fails_for_fallthrough() { + // Search screen has nothing matching → fast-path fails (loop falls through). + let backend = Backend::new(vec![AXElement::new("AXButton", "Some Unrelated Button")]); + let out = music::run("play Numb", &backend).await; + assert!(!out.success); + assert!(out.summary.contains("no matching song"), "{}", out.summary); +} + +#[tokio::test] +async fn music_fastpath_presses_row_even_if_reported_disabled() { + // Apple Music reports pressable result rows as enabled=Some(false); the + // fast-path must still press them (regression guard for the M5 mis-gate). + let mut row = AXElement::new("AXCell", "Numb"); + row.enabled = Some(false); + let backend = Backend::new(vec![row, AXElement::new("AXButton", "Play")]); + let out = music::run("play Numb", &backend).await; + assert!(out.success, "must press a 'disabled'-reported row: {out:?}"); + assert!(backend.acts().contains(&"press:Music:Numb".to_string())); +} + +#[tokio::test] +async fn try_fastpath_dispatches_music_and_skips_others() { + let backend = Backend::new(vec![song_row("Numb")]); + // Non-music app → None (general loop handles it). + assert!(super::try_fastpath("Slack", "play Numb", &backend) + .await + .is_none()); + // Music + play → Some. + assert!(super::try_fastpath("Music", "play Numb", &backend) + .await + .is_some()); +} + +// Outcome type sanity: fast-paths build the same outcome the loop returns. +#[test] +fn outcome_shape() { + let o = AutomateOutcome { + success: true, + summary: "x".into(), + steps: vec![], + }; + assert!(o.success); +} diff --git a/src/openhuman/accessibility/app_fastpaths/mod.rs b/src/openhuman/accessibility/app_fastpaths/mod.rs new file mode 100644 index 0000000000..534d7299b5 --- /dev/null +++ b/src/openhuman/accessibility/app_fastpaths/mod.rs @@ -0,0 +1,34 @@ +//! Deterministic per-app accelerators for the `automate` loop. +//! +//! A fast-path encodes a *proven* native sequence for a common (app, intent) +//! pair so the loop doesn't have to rediscover it with the model every time. +//! [`try_fastpath`] is consulted **before** the general loop and returns: +//! - `Some(success)` → the loop returns it directly, +//! - `Some(failure)` → the loop logs and falls through to the model loop, +//! - `None` → no fast-path applies; straight to the model loop. +//! +//! So a fast-path can only *help*. This is deliberately different from the +//! removed `play_music` tool (tracker §1.13): that was a separate tool the LLM +//! had to choose (and chose wrong); this is internal to `automate`, transparent, +//! and always backed by the general loop. + +use super::automate::AutomateBackend; +use super::automate::AutomateOutcome; + +mod music; + +/// Try every registered fast-path; return the first that claims the (app, goal). +pub async fn try_fastpath( + app: &str, + goal: &str, + backend: &dyn AutomateBackend, +) -> Option { + if music::matches(app, goal) { + return Some(music::run(goal, backend).await); + } + None +} + +#[cfg(test)] +#[path = "fastpaths_tests.rs"] +mod tests; diff --git a/src/openhuman/accessibility/app_fastpaths/music.rs b/src/openhuman/accessibility/app_fastpaths/music.rs new file mode 100644 index 0000000000..02f281f965 --- /dev/null +++ b/src/openhuman/accessibility/app_fastpaths/music.rs @@ -0,0 +1,535 @@ +//! Apple Music fast-path: "play ``". +//! +//! Encodes the sequence empirically proven in tracker §1.11: open the Music +//! search URL scheme, press the matching song row to **navigate** into it, then +//! press the detail-page **Play** (a search-result press only selects/navigates; +//! the second Play press is what actually starts playback). All steps go through +//! the injectable [`AutomateBackend`], so the whole flow is unit-testable with a +//! scripted backend — no live Music, no model. + +use super::AutomateBackend; +use super::AutomateOutcome; + +const APP: &str = "Music"; + +/// Element roles that represent a tappable search result / song row. +const ROW_ROLES: &[&str] = &["AXCell", "AXRow", "ListItem", "AXButton", "AXStaticText"]; + +/// Does this (app, goal) look like an Apple Music "play X" request? +pub fn matches(app: &str, goal: &str) -> bool { + is_music_app(app) && extract_play_query(goal).is_some() +} + +/// True for the Apple Music app under its common display names. +fn is_music_app(app: &str) -> bool { + let a = app.trim().to_lowercase(); + a == "music" || a == "apple music" || a == "itunes" +} + +/// Pull the search query out of a "play …" goal, or `None` if it isn't one. +/// +/// Two strategies, in order: +/// 1. **Quoted title** — the orchestrator usually quotes the song, e.g. +/// `search for "Highway to Hell" by AC/DC, and play it`. Use the first +/// quoted span, plus any `by ` that immediately follows it. This is +/// robust to where "play" sits in the sentence (it was the bug: a goal +/// ending in "…and play it" made the after-"play" strategy extract "it"). +/// 2. **After "play"** — `play Numb by Linkin Park`, `play the song X`, etc. +/// +/// Either way: drop leading `the song`/`track` filler, a trailing +/// `in/on (apple) music`, rewrite ` by ` to a space (better catalog recall), +/// and reject bare pronouns ("it"/"this"/…) that carry no song name. +pub fn extract_play_query(goal: &str) -> Option { + // Strategy 1: first quoted title (+ trailing "by artist"). + if let Some((title, rest)) = first_quoted(goal) { + let mut q = title.trim().to_string(); + if let Some(artist) = trailing_by_artist(rest) { + q.push(' '); + q.push_str(&artist); + } + let q = clean_query(&q); + if !q.is_empty() && !is_pronoun(&q) { + return Some(q); + } + } + + // Strategy 2: text after the last word-boundary "play". + let lower = goal.to_lowercase(); + let idx = lower.rfind("play")?; + let before_ok = idx == 0 + || !lower[..idx] + .chars() + .next_back() + .map(|c| c.is_alphabetic()) + .unwrap_or(false); + let after_idx = idx + "play".len(); + // Right boundary too, so "playback …" isn't parsed as a play intent. + let after_ok = lower[after_idx..] + .chars() + .next() + .map(|c| !c.is_alphabetic()) + .unwrap_or(true); + if !(before_ok && after_ok) { + return None; + } + let after = &goal[after_idx..]; + let mut q = after.trim().to_string(); + for filler in ["the song ", "the track ", "song ", "track ", "me "] { + if q.to_lowercase().starts_with(filler) { + q = q[filler.len()..].to_string(); + break; + } + } + let q = clean_query(&q); + if q.is_empty() || is_pronoun(&q) { + None + } else { + Some(q) + } +} + +/// Strip a trailing "(in|on) [apple] music" and rewrite " by " → " ". +fn clean_query(q: &str) -> String { + let mut q = q.trim().to_string(); + let ql = q.to_lowercase(); + for suffix in [ + " in apple music", + " on apple music", + " in music", + " on music", + ] { + if ql.ends_with(suffix) { + q.truncate(q.len() - suffix.len()); + break; + } + } + replace_ci(&q, " by ", " ").trim().to_string() +} + +/// A query that's just a pronoun / generic noun carries no song — reject it so +/// the fast-path declines and the general loop (or a clarifying reply) handles it. +fn is_pronoun(q: &str) -> bool { + matches!( + q.trim().to_lowercase().as_str(), + "it" | "this" | "that" | "them" | "something" | "some music" | "music" | "a song" | "songs" + ) +} + +/// Return the first single- or double-quoted span and the text after its close. +fn first_quoted(s: &str) -> Option<(String, &str)> { + // Support straight and curly double quotes. + let opens = ['"', '\u{201C}']; + let closes = ['"', '\u{201D}']; + let start = s.find(|c| opens.contains(&c))?; + let after_open = start + s[start..].chars().next()?.len_utf8(); + let rel = s[after_open..].find(|c| closes.contains(&c))?; + let inner = &s[after_open..after_open + rel]; + let close_end = after_open + rel + s[after_open + rel..].chars().next()?.len_utf8(); + if inner.trim().is_empty() { + return None; + } + Some((inner.to_string(), &s[close_end..])) +} + +/// If `rest` begins with `by `, capture the artist up to the next +/// clause boundary ("," / " and " / " then " / end). +fn trailing_by_artist(rest: &str) -> Option { + let t = rest.trim_start(); + let lower = t.to_lowercase(); + let after = lower.strip_prefix("by ")?; + let artist_region = &t[t.len() - after.len()..]; + // Cut at the first clause boundary. + let mut end = artist_region.len(); + for delim in [",", " and ", " then ", " in ", " on "] { + if let Some(p) = artist_region.to_lowercase().find(delim) { + end = end.min(p); + } + } + let artist = artist_region[..end].trim().to_string(); + if artist.is_empty() { + None + } else { + Some(artist) + } +} + +/// Case-insensitive replace of `needle` with `repl` in `haystack`. +fn replace_ci(haystack: &str, needle: &str, repl: &str) -> String { + if needle.is_empty() { + return haystack.to_string(); + } + let nl = needle.to_lowercase(); + let mut out = String::with_capacity(haystack.len()); + let mut rest = haystack; + while !rest.is_empty() { + // Compare on `rest` itself (never index the lowercased copy with + // original byte offsets — `to_lowercase` can change byte lengths for + // Unicode, which would slice mid-codepoint and panic). + if rest.len() >= needle.len() + && rest.is_char_boundary(needle.len()) + && rest[..needle.len()].to_lowercase() == nl + { + out.push_str(repl); + rest = &rest[needle.len()..]; + } else { + let ch = rest.chars().next().unwrap(); + out.push(ch); + rest = &rest[ch.len_utf8()..]; + } + } + out +} + +/// Build the Apple Music search URL scheme for `query`. +fn search_url(query: &str) -> String { + format!( + "music://music.apple.com/search?term={}", + percent_encode(query) + ) +} + +/// Percent-encode the reserved characters that matter in a query value +/// (space + the URL delimiters). Enough for app URL schemes; not a full +/// RFC-3986 encoder. +fn percent_encode(s: &str) -> String { + let mut out = String::with_capacity(s.len()); + for b in s.bytes() { + match b { + b'A'..=b'Z' | b'a'..=b'z' | b'0'..=b'9' | b'-' | b'_' | b'.' | b'~' => { + out.push(b as char) + } + _ => out.push_str(&format!("%{b:02X}")), + } + } + out +} + +/// The first query token worth filtering on (length > 2 so "to"/"by" don't +/// match everything). Used as the perceive filter: the snapshot's substring +/// filter can't match a whole multi-word title, so we narrow by one strong +/// token and let `pick_row` do the full token match. +fn first_token(query: &str) -> String { + query + .split_whitespace() + .find(|t| t.len() > 2) + .unwrap_or("") + .to_string() +} + +/// Choose the best matching row from a perceive snapshot: an exact label match +/// first, else the first row-role element whose label shares a word with the +/// query. Returns the element label to press. +fn pick_row(elements: &[super::super::ax_interact::AXElement], query: &str) -> Option { + let ql = query.to_lowercase(); + // Exact label match wins. (We deliberately do NOT skip elements whose + // reported `enabled` is false — Apple Music marks pressable result rows as + // disabled; see AXElement::enabled docs.) + if let Some(e) = elements.iter().find(|e| e.label.to_lowercase() == ql) { + return Some(e.label.clone()); + } + let tokens: Vec<&str> = ql.split_whitespace().filter(|t| t.len() > 2).collect(); + elements + .iter() + .filter(|e| ROW_ROLES.iter().any(|r| e.role.contains(r))) + .find(|e| { + let l = e.label.to_lowercase(); + tokens.iter().any(|t| l.contains(t)) + }) + .map(|e| e.label.clone()) +} + +/// Run the play fast-path. Returns a failed [`AutomateOutcome`] (not a panic) +/// whenever a step can't proceed, so the caller falls through to the general +/// loop. +pub async fn run(goal: &str, backend: &dyn AutomateBackend) -> AutomateOutcome { + let mut steps: Vec = Vec::new(); + let query = match extract_play_query(goal) { + Some(q) => q, + None => { + return fail("not a play request", steps); + } + }; + log::info!("[automate::music] ▶ play query={query:?}"); + use super::super::automate::progress; + use crate::openhuman::overlay::OverlayAttentionTone; + progress( + format!("Searching Music for {query}…"), + OverlayAttentionTone::Accent, + ); + + // 1. Launch Music. + match backend.act_launch(APP).await { + Ok(m) => steps.push(format!("launch: {m}")), + Err(e) => steps.push(format!("launch FAILED: {e}")), + } + backend.settle(APP).await; + + // 2. Open the search URL. + let url = search_url(&query); + match backend.open_url(&url).await { + Ok(m) => steps.push(format!("search: {m}")), + Err(e) => { + steps.push(format!("search url FAILED: {e}")); + return fail("could not open Music search", steps); + } + } + // 3. Find the song row and press it to navigate in. Search results render + // asynchronously (the §1.13 timing race), so retry across settles, and + // filter the snapshot by one strong token (a substring filter can't + // match a whole multi-word title). + let filter = first_token(&query); + let mut row = None; + for attempt in 0..6 { + backend.settle(APP).await; + let els = backend.perceive(APP, &filter).await.unwrap_or_default(); + if let Some(r) = pick_row(&els, &query) { + row = Some(r); + break; + } + // Catalog search results arrive asynchronously (~3-4s); element-count + // settle can report "stable" while the network fetch is still pending, + // so wait real time between attempts rather than spinning instantly. + log::info!("[automate::music] search results not ready (attempt {attempt}), waiting"); + backend.wait(800).await; + } + let row = match row { + Some(r) => r, + None => return fail("no matching song row found", steps), + }; + // Baseline count of "Play" controls *before* navigating, so we can tell + // when the song's detail-page Play has actually rendered (vs. only the + // toolbar transport Play that's always present). + let plays_before = count_play_buttons(backend).await; + + match backend.act_press(APP, &row).await { + Ok(m) => steps.push(format!("open song: {m}")), + Err(e) => { + steps.push(format!("open song FAILED: {e}")); + return fail("could not open the song", steps); + } + } + + // 4. Wait for the detail-page Play to appear. Pressing too early hits only + // the toolbar transport (empty queue → silence) — the exact false-success + // we hit live. Poll until a new Play control shows up (or give up after a + // few settles and try anyway). + for _ in 0..5 { + backend.settle(APP).await; + if count_play_buttons(backend).await > plays_before { + break; + } + } + + // 5. Press Play, then VERIFY real playback. If it didn't start, the press + // landed on the wrong Play — wait and retry a couple of times. Only + // report success when player state is actually "playing" (or the backend + // can't verify, in which case it's best-effort). + let mut verified: Option = None; + for attempt in 0..3 { + match backend.act_press(APP, "Play").await { + Ok(m) => steps.push(format!("play press (attempt {attempt}): {m}")), + Err(e) => steps.push(format!("play press FAILED: {e}")), + } + backend.settle(APP).await; + match backend.verify_playing().await { + Some(true) => { + verified = Some(true); + break; + } + Some(false) => { + verified = Some(false); + // Give the detail page a beat to settle, then retry. + tokio::time::sleep(std::time::Duration::from_millis(700)).await; + } + None => { + // Can't verify (non-macOS) — accept best-effort and stop. + verified = None; + break; + } + } + } + + match verified { + Some(false) => { + steps.push("verify: player state never reached 'playing'".to_string()); + fail("opened the song but playback didn't start", steps) + } + Some(true) => { + steps.push("verify: playing ✓".to_string()); + progress(format!("Playing {query}"), OverlayAttentionTone::Success); + AutomateOutcome { + success: true, + summary: format!("Playing '{query}' in Music."), + steps, + } + } + None => AutomateOutcome { + success: true, + summary: format!("Started '{query}' in Music (playback unverified)."), + steps, + }, + } +} + +/// Count "Play"-labelled controls currently visible (toolbar + any detail-page +/// Play). Used to detect when navigation has rendered the song's own Play. +async fn count_play_buttons(backend: &dyn AutomateBackend) -> usize { + backend + .perceive(APP, "Play") + .await + .map(|els| { + els.iter() + .filter(|e| e.label.eq_ignore_ascii_case("Play")) + .count() + }) + .unwrap_or(0) +} + +fn fail(msg: &str, steps: Vec) -> AutomateOutcome { + AutomateOutcome { + success: false, + summary: format!("Music fast-path: {msg}"), + steps, + } +} + +#[cfg(test)] +mod unit { + use super::*; + + #[test] + fn first_token_skips_short_words() { + assert_eq!(first_token("Highway to Hell AC/DC"), "Highway"); + assert_eq!(first_token("Numb Linkin Park"), "Numb"); + // All-short → empty (perceive then falls back to a broad list). + assert_eq!(first_token("a x"), ""); + } + + #[test] + fn percent_encode_escapes_reserved() { + assert_eq!(percent_encode("Highway to Hell"), "Highway%20to%20Hell"); + // The slash in AC/DC must be encoded (this was the live-run bug). + assert_eq!(percent_encode("AC/DC"), "AC%2FDC"); + assert_eq!(percent_encode("rock&roll"), "rock%26roll"); + } + + #[test] + fn search_url_is_well_formed() { + let u = search_url("Highway to Hell AC/DC"); + assert_eq!( + u, + "music://music.apple.com/search?term=Highway%20to%20Hell%20AC%2FDC" + ); + } + + #[test] + fn pick_row_prefers_exact_then_token() { + use super::super::super::ax_interact::AXElement; + let els = vec![ + AXElement::new("AXCell", "Highway to Hell"), + AXElement::new("AXButton", "Play"), + ]; + // Token match (query has extra "AC/DC" the row label lacks). + assert_eq!( + pick_row(&els, "Highway to Hell AC/DC").as_deref(), + Some("Highway to Hell") + ); + } +} + +/// Live integration test — drives the real Apple Music app. Ignored by default +/// (needs macOS, the Music app, and Accessibility permission for the runner). +/// +/// Run on a Mac with: +/// cargo test --lib music_fastpath_live -- --ignored --nocapture +#[cfg(all(test, target_os = "macos"))] +mod live { + use super::run; + use crate::openhuman::accessibility::automate::RealBackend; + + #[tokio::test] + #[ignore = "requires macOS + Music app + Accessibility permission"] + async fn music_fastpath_live() { + let backend = RealBackend::new(crate::openhuman::config::Config::default()); + let out = run("play Highway to Hell by AC/DC", &backend).await; + // Tool-level success is asserted; actual playback is best-effort + // (Apple Music's UI is nondeterministic — tracker §1.11/§1.13). + println!( + "[music_fastpath_live] success={} summary={}", + out.success, out.summary + ); + for s in &out.steps { + println!(" - {s}"); + } + let state = player_state(); + println!("[music_fastpath_live] player_state={state}"); + // Now that the flow verifies playback, hold it to the real bar: + // the song must actually be playing. + assert!(out.success, "fast-path reported failure: {}", out.summary); + assert_eq!(state, "playing", "Music did not actually start playing"); + } + + /// `osascript` ground-truth for whether audio is actually playing. + fn player_state() -> String { + std::process::Command::new("osascript") + .args(["-e", "tell application \"Music\" to player state as string"]) + .output() + .ok() + .map(|o| String::from_utf8_lossy(&o.stdout).trim().to_string()) + .unwrap_or_else(|| "(osascript failed)".into()) + } + + /// Empirical probe (not an assertion): open the search, dump what Music's + /// AX tree actually exposes, and report player state before/after each + /// candidate press. Used to design the real play sequence. + #[tokio::test] + #[ignore = "probe — run manually to inspect Music's AX tree"] + async fn music_probe() { + use crate::openhuman::accessibility::ax_interact as ax; + let q = "Highway to Hell"; + let _ = std::process::Command::new("open") + .arg("-a") + .arg("Music") + .status(); + std::thread::sleep(std::time::Duration::from_secs(3)); + let _ = std::process::Command::new("open") + .arg(format!( + "music://music.apple.com/search?term={}", + q.replace(' ', "%20") + )) + .status(); + std::thread::sleep(std::time::Duration::from_secs(4)); + + println!("=== player state at start: {} ===", player_state()); + let dump = |label: &str, filter: &str| match ax::ax_list_elements_filtered("Music", filter) + { + Ok(els) => { + println!( + "--- {label} (filter={filter:?}): {} elements ---", + els.len() + ); + for e in els.iter().take(60) { + println!(" [{}] {} enabled={:?}", e.role, e.label, e.enabled); + } + } + Err(e) => println!("--- {label}: ERROR {e} ---"), + }; + dump("after search", "Highway"); + dump("play buttons", "Play"); + + // Press the first search-result row → does it navigate / play? + println!("\n>>> pressing result 'Highway to Hell'"); + let _ = ax::ax_press_element("Music", "Highway to Hell"); + std::thread::sleep(std::time::Duration::from_secs(3)); + println!("=== player state after row press: {} ===", player_state()); + dump("detail page play", "Play"); + + // Try the detail-page Play (not the toolbar one) if still stopped. + if player_state() != "playing" { + println!("\n>>> pressing 'Play' after navigate"); + let _ = ax::ax_press_element("Music", "Play"); + std::thread::sleep(std::time::Duration::from_secs(3)); + println!("=== player state after Play press: {} ===", player_state()); + } + } +} diff --git a/src/openhuman/accessibility/automate.rs b/src/openhuman/accessibility/automate.rs new file mode 100644 index 0000000000..910d2f8e57 --- /dev/null +++ b/src/openhuman/accessibility/automate.rs @@ -0,0 +1,540 @@ +//! `automate` — Rust-driven multi-step UI automation loop. +//! +//! Phase 1.5 (see `docs/voice-automate-plan.md`). The chat orchestrator calls +//! `automate{app, goal}` **once**; this module then runs the whole multi-step +//! flow internally with a *fast* model, so the heavy chat model never sits +//! inside the click loop. Each iteration is **perceive → decide → act → +//! settle → verify**: +//! +//! - **perceive** — read a small, filtered accessibility snapshot of the app +//! (`ax_interact::ax_list_elements_filtered`, capped — never a raw dump, +//! which is what made the chat model hallucinate; tracker §1.13). +//! - **decide** — ask the fast model for exactly one JSON action. +//! - **act** — run it via the existing AX primitives / `launch_app`. +//! - **settle** — wait for the UI to stop changing (M2 makes this real; the +//! M1 backend uses a short fixed wait). +//! - **verify** — fold the post-action snapshot back into the next prompt. +//! +//! The loop is generic over an [`AutomateBackend`] so the decision model, the +//! accessibility calls, and the launcher are all injectable — the unit tests +//! drive a scripted backend with no mic, no AX tree, and no LLM. + +use super::ax_interact as ax; +use crate::openhuman::overlay::{publish_attention, OverlayAttentionEvent, OverlayAttentionTone}; +use async_trait::async_trait; +use serde::Deserialize; + +const LOG_PREFIX: &str = "[automate]"; + +/// Push a one-line progress message to the notch / overlay so the user sees the +/// automation happening live (M4). Fire-and-forget: a no-op when nothing is +/// subscribed (e.g. unit tests, or the notch window isn't running). +pub(crate) fn progress(message: impl Into, tone: OverlayAttentionTone) { + let _ = publish_attention( + OverlayAttentionEvent::new(message) + .with_source("automate") + .with_tone(tone) + .with_ttl_ms(5000), + ); +} + +/// Default ceiling on loop iterations. Each iteration is one fast-model call +/// plus one action, so this bounds latency and cost even if the model never +/// emits `done`. +pub const DEFAULT_STEP_BUDGET: u32 = 12; + +/// How many elements a perceive snapshot renders into the prompt. Mirrors the +/// `ax_interact` tool cap so a broad/empty filter can't overflow the model's +/// context and trigger the truncation→hallucination failure (tracker §1.13). +const MAX_SNAPSHOT: usize = 40; + +/// One decoded action from the fast model. +#[derive(Debug, Clone, Deserialize, Default, PartialEq)] +pub struct Action { + /// The model's short reasoning. Logged, never executed. + #[serde(default)] + pub thought: String, + /// One of: `launch`, `list`, `press`, `set_value`, `done`, `fail`. + pub action: String, + /// Optional per-action app override; defaults to the task's app. + #[serde(default)] + pub app: Option, + /// Substring filter for `list`. + #[serde(default)] + pub filter: String, + /// Element label for `press` / `set_value`. + #[serde(default)] + pub label: String, + /// Text to enter for `set_value`. + #[serde(default)] + pub value: String, + /// Final message for `done` / `fail`. + #[serde(default)] + pub summary: String, +} + +/// The result of a completed (or budget-exhausted) automation run. +#[derive(Debug, Clone, PartialEq)] +pub struct AutomateOutcome { + pub success: bool, + pub summary: String, + /// One human-readable line per executed step — surfaced back to the chat + /// agent and useful in logs. + pub steps: Vec, +} + +impl AutomateOutcome { + fn fail(summary: impl Into, steps: Vec) -> Self { + Self { + success: false, + summary: summary.into(), + steps, + } + } +} + +/// Injectable side-effects for the loop. The production impl +/// ([`RealBackend`]) talks to the OS accessibility tree and a fast LLM; tests +/// supply a scripted impl. +#[async_trait] +pub trait AutomateBackend: Send + Sync { + /// Read interactive elements in `app` whose label contains `filter`. + async fn perceive(&self, app: &str, filter: &str) -> Result, String>; + /// Ask the decision model for one JSON action. `system` pins the schema; + /// `user` carries the goal + current snapshot + recent step history. + async fn decide(&self, system: &str, user: &str) -> Result; + async fn act_launch(&self, app: &str) -> Result; + async fn act_press(&self, app: &str, label: &str) -> Result; + async fn act_set_value(&self, app: &str, label: &str, value: &str) -> Result; + /// Open a URL / URI-scheme (e.g. `music://…search?term=…`) via the OS opener. + /// Used by deterministic app fast-paths; the general loop does not call it. + async fn open_url(&self, url: &str) -> Result; + /// Best-effort: is media currently playing? `None` when the backend can't + /// tell (non-macOS, or not applicable). Media fast-paths use this to confirm + /// an action *actually started playback* rather than just succeeding at the + /// AX level — the false-success that made "play" silently no-op (§1.11). + async fn verify_playing(&self) -> Option { + None + } + /// Block until the UI settles after an action. + async fn settle(&self, app: &str); + /// Wait ~`ms` of real time. Used by fast-paths to let asynchronous content + /// (e.g. network search results) render between perceive attempts. Default + /// is a real sleep; test backends override it to a no-op so suites stay fast. + async fn wait(&self, ms: u64) { + tokio::time::sleep(std::time::Duration::from_millis(ms)).await; + } +} + +/// Tuning for a run. +#[derive(Debug, Clone, Copy)] +pub struct AutomateOptions { + pub step_budget: u32, +} + +impl Default for AutomateOptions { + fn default() -> Self { + Self { + step_budget: DEFAULT_STEP_BUDGET, + } + } +} + +/// System prompt pinning the action contract for the fast model. +fn system_prompt() -> String { + "You drive a desktop app's UI to accomplish a goal. You see a list of the \ + app's interactive elements (each as `[role] label`) and act one step at a \ + time.\n\ + \n\ + Respond with EXACTLY ONE JSON object and nothing else:\n\ + {\"thought\":\"...\",\"action\":\"\",\"app\":\"\",\ + \"filter\":\"...\",\"label\":\"...\",\"value\":\"...\",\"summary\":\"...\"}\n\ + \n\ + Verbs:\n\ + • launch — open the app (use first if it isn't showing any elements)\n\ + • list — re-read elements; set `filter` to a substring to narrow them\n\ + • press — activate the element whose label matches `label`\n\ + • set_value — type `value` into the field matching `label` (omit label = first field)\n\ + • done — goal achieved; put a short result in `summary`\n\ + • fail — goal cannot be achieved; explain in `summary`\n\ + \n\ + Rules:\n\ + - Pressing a LIST ROW or SEARCH RESULT usually only selects/opens it. To \ + trigger playback or submission you must then press the actual action button \ + (e.g. open a song, THEN press its 'Play'). After such a press, `list` again \ + to see the new screen.\n\ + - Prefer an exact label match. Keep `filter` specific so the snapshot stays small.\n\ + - Output JSON only — no prose, no code fences." + .to_string() +} + +/// Render a perceive snapshot into compact prompt text. +fn render_snapshot(app: &str, filter: &str, elements: &[ax::AXElement]) -> String { + if elements.is_empty() { + return format!( + "App '{app}' shows no elements matching filter '{filter}' (it may still be \ + loading, or needs launching)." + ); + } + let shown = elements.len().min(MAX_SNAPSHOT); + let mut out = format!( + "App '{app}' elements (filter '{filter}', showing {shown} of {}):\n", + elements.len() + ); + for e in elements.iter().take(MAX_SNAPSHOT) { + // NB: we don't annotate `enabled` here — AXEnabled is unreliable + // per-app (Apple Music marks pressable rows disabled), so surfacing it + // would mislead the model into avoiding real controls. + out.push_str(&format!(" [{}] {}\n", e.role, e.label)); + } + out +} + +/// Parse one action from raw model text, tolerating code fences and surrounding +/// prose by extracting the first balanced `{...}` block. Returns `Err` so the +/// caller can issue a single repair retry before giving up — we never *act* on +/// an unparseable guess (tracker §1.13 hallucination lesson). +fn parse_action(raw: &str) -> Result { + let trimmed = raw.trim(); + if let Ok(a) = serde_json::from_str::(trimmed) { + return Ok(a); + } + // Extract the first {...} span and retry. + if let (Some(start), Some(end)) = (trimmed.find('{'), trimmed.rfind('}')) { + if end > start { + if let Ok(a) = serde_json::from_str::(&trimmed[start..=end]) { + return Ok(a); + } + } + } + Err(format!( + "could not parse an action from model output: {trimmed:?}" + )) +} + +/// Run the automation loop until the goal is met, it fails, or the step budget +/// is exhausted. +pub async fn run( + app: &str, + goal: &str, + backend: &dyn AutomateBackend, + opts: AutomateOptions, +) -> AutomateOutcome { + log::info!( + "{LOG_PREFIX} ▶ run app={app:?} goal={goal:?} budget={}", + opts.step_budget + ); + + // Foreground the target app FIRST, always. This guarantees the app is + // frontmost before we perceive or act — so AX reads the right window and any + // synthetic input (keyboard/mouse) lands on it, not on OpenHuman's own + // window (which is what crashed CEF in §1.8). `act_launch` is `open -a`, + // which both opens and activates; idempotent if already running. + match backend.act_launch(app).await { + Ok(m) => log::info!("{LOG_PREFIX} foregrounded: {m}"), + Err(e) => log::warn!("{LOG_PREFIX} foreground failed for {app:?}: {e}"), + } + backend.settle(app).await; + + // Deterministic accelerator: if a known app + intent has a proven native + // sequence, run it first. On `None` (no fast-path) or a failed fast-path we + // fall through to the general model-driven loop — so the fast-path can only + // help, never block. (Structurally different from the removed `play_music` + // tool, §1.13: this is internal to `automate`, not a tool the LLM selects.) + if let Some(outcome) = super::app_fastpaths::try_fastpath(app, goal, backend).await { + if outcome.success { + log::info!("{LOG_PREFIX} fast-path succeeded for app={app:?}"); + return outcome; + } + log::info!("{LOG_PREFIX} fast-path did not complete; falling through to general loop"); + } + + let system = system_prompt(); + let mut steps: Vec = Vec::new(); + let mut last_filter = String::new(); + // One repair retry budget for unparseable model output. + let mut repair_left = 1u32; + // No-progress guard: track the last actionable signature so a model that + // keeps issuing the same call (e.g. pressing 'Search' over and over) bails + // instead of burning the whole step budget. + let mut last_sig = String::new(); + let mut repeat_count = 0u32; + + for step in 0..opts.step_budget { + // ── perceive ── + let snapshot = match backend.perceive(app, &last_filter).await { + Ok(els) => render_snapshot(app, &last_filter, &els), + Err(e) => { + log::warn!("{LOG_PREFIX} perceive failed: {e}"); + format!("(perceive error: {e})") + } + }; + + // ── decide ── + let user = format!( + "Goal: {goal}\nApp: {app}\n\nCurrent screen:\n{snapshot}\n\nSteps so far:\n{}\n\n\ + Reply with the next single JSON action.", + if steps.is_empty() { + " (none yet)".to_string() + } else { + steps + .iter() + .map(|s| format!(" - {s}")) + .collect::>() + .join("\n") + } + ); + let raw = match backend.decide(&system, &user).await { + Ok(t) => t, + Err(e) => { + log::warn!("{LOG_PREFIX} decide failed: {e}"); + return AutomateOutcome::fail(format!("decision model error: {e}"), steps); + } + }; + + let action = match parse_action(&raw) { + Ok(a) => a, + Err(e) => { + if repair_left > 0 { + repair_left -= 1; + log::warn!("{LOG_PREFIX} step={step} unparseable action, retrying: {e}"); + steps.push("(model produced unparseable output; retried)".to_string()); + continue; + } + return AutomateOutcome::fail(format!("model output unparseable: {e}"), steps); + } + }; + + let target_app = action + .app + .as_deref() + .filter(|s| !s.is_empty()) + .unwrap_or(app); + log::info!( + "{LOG_PREFIX} step={step} action={:?} app={target_app:?} label={:?} filter={:?}", + action.action, + action.label, + action.filter + ); + + // ── no-progress guard ── + if !matches!(action.action.as_str(), "done" | "fail") { + let sig = format!("{}|{}|{}", action.action, action.label, action.filter); + if sig == last_sig { + repeat_count += 1; + } else { + repeat_count = 0; + last_sig = sig; + } + // initial + 2 repeats = 3 identical actions in a row. + if repeat_count >= 2 { + log::warn!("{LOG_PREFIX} no progress: action repeated 3× ({last_sig}); aborting"); + steps.push(format!( + "aborted: repeated '{}' 3× with no progress", + action.action + )); + return AutomateOutcome::fail( + "Got stuck repeating the same action with no progress.", + steps, + ); + } + } + + // ── act ── + match action.action.as_str() { + "done" => { + let summary = if action.summary.is_empty() { + "Goal completed.".to_string() + } else { + action.summary.clone() + }; + log::info!("{LOG_PREFIX} ✓ done: {summary}"); + progress(&summary, OverlayAttentionTone::Success); + return AutomateOutcome { + success: true, + summary, + steps, + }; + } + "fail" => { + let summary = if action.summary.is_empty() { + "Goal could not be completed.".to_string() + } else { + action.summary.clone() + }; + log::info!("{LOG_PREFIX} ✗ model gave up: {summary}"); + progress(&summary, OverlayAttentionTone::Neutral); + return AutomateOutcome::fail(summary, steps); + } + "list" => { + last_filter = action.filter.clone(); + steps.push(format!("list filter={:?}", last_filter)); + } + "launch" => { + progress( + format!("Opening {target_app}…"), + OverlayAttentionTone::Accent, + ); + match backend.act_launch(target_app).await { + Ok(msg) => steps.push(format!("launch: {msg}")), + Err(e) => steps.push(format!("launch FAILED: {e}")), + } + backend.settle(target_app).await; + } + "press" => { + if action.label.trim().is_empty() { + steps.push("press skipped: empty label".to_string()); + continue; + } + progress( + format!("Pressing {}…", action.label), + OverlayAttentionTone::Accent, + ); + match backend.act_press(target_app, &action.label).await { + Ok(msg) => steps.push(format!("press: {msg}")), + Err(e) => steps.push(format!("press FAILED: {e}")), + } + backend.settle(target_app).await; + } + "set_value" => { + if action.value.is_empty() { + steps.push("set_value skipped: empty value".to_string()); + continue; + } + progress("Typing…", OverlayAttentionTone::Accent); + match backend + .act_set_value(target_app, &action.label, &action.value) + .await + { + Ok(msg) => steps.push(format!("set_value: {msg}")), + Err(e) => steps.push(format!("set_value FAILED: {e}")), + } + backend.settle(target_app).await; + } + other => { + steps.push(format!("unknown action {other:?} ignored")); + } + } + } + + log::info!("{LOG_PREFIX} step budget ({}) exhausted", opts.step_budget); + AutomateOutcome::fail( + format!( + "Step budget ({}) exhausted before the goal was confirmed complete.", + opts.step_budget + ), + steps, + ) +} + +/// Production backend: real AX primitives + a fast LLM for decisions. +pub struct RealBackend { + config: crate::openhuman::config::Config, +} + +impl RealBackend { + pub fn new(config: crate::openhuman::config::Config) -> Self { + Self { config } + } +} + +#[async_trait] +impl AutomateBackend for RealBackend { + async fn perceive(&self, app: &str, filter: &str) -> Result, String> { + ax::ax_list_elements_filtered(app, filter) + } + + async fn decide(&self, system: &str, user: &str) -> Result { + // Fast tier: the `memory` role maps to `memory_provider` — a cheap, + // quick model class. A dedicated `automation` provider knob is a + // follow-up (see plan §5); routing through `memory` keeps M1 free of + // Config-schema churn while still keeping the chat model out of the loop. + let (provider, model) = + crate::openhuman::inference::provider::create_chat_provider("memory", &self.config) + .map_err(|e| format!("fast-model provider unavailable: {e}"))?; + provider + .chat_with_system(Some(system), user, &model, 0.0) + .await + .map_err(|e| format!("fast-model call failed: {e}")) + } + + async fn act_launch(&self, app: &str) -> Result { + crate::openhuman::tools::implementations::system::launch_platform(app).await + } + + async fn act_press(&self, app: &str, label: &str) -> Result { + ax::ax_press_element(app, label) + } + + async fn act_set_value(&self, app: &str, label: &str, value: &str) -> Result { + ax::ax_set_field_value(app, label, value) + } + + async fn open_url(&self, url: &str) -> Result { + // Cross-platform URI opener. macOS `open`, Linux `xdg-open`, Windows + // `cmd /C start`. Only invoked by fast-paths with app-controlled URLs + // (never user free-text), so there's no untrusted-URL surface here. + #[cfg(target_os = "macos")] + let mut cmd = { + let mut c = tokio::process::Command::new("open"); + c.arg(url); + c + }; + #[cfg(target_os = "linux")] + let mut cmd = { + let mut c = tokio::process::Command::new("xdg-open"); + c.arg(url); + c + }; + #[cfg(target_os = "windows")] + let mut cmd = { + let mut c = tokio::process::Command::new("cmd"); + c.args(["/C", "start", "", url]); + c + }; + match cmd.output().await { + Ok(o) if o.status.success() => Ok(format!("Opened {url}")), + Ok(o) => Err(format!( + "opener exited {}: {}", + o.status, + String::from_utf8_lossy(&o.stderr).trim() + )), + Err(e) => Err(format!("failed to launch opener: {e}")), + } + } + + async fn verify_playing(&self) -> Option { + // macOS: ask Apple Music for ground-truth player state. Other OSes can't + // verify this way → None (fast-path treats None as best-effort). + #[cfg(target_os = "macos")] + { + let out = tokio::process::Command::new("osascript") + .args(["-e", "tell application \"Music\" to player state as string"]) + .output() + .await + .ok()?; + let state = String::from_utf8_lossy(&out.stdout).trim().to_lowercase(); + Some(state == "playing") + } + #[cfg(not(target_os = "macos"))] + { + None + } + } + + async fn settle(&self, app: &str) { + // M2: poll the element count until the UI stops changing (≤2s), instead + // of a blind fixed wait. Removes the timing-race class (tracker §1.11/ + // §1.13) — the next perceive sees a settled tree. `ax_wait_settled` is + // blocking (synchronous helper IPC), so run it off the async runtime. + let app = app.to_string(); + let _ = tokio::task::spawn_blocking(move || { + ax::ax_wait_settled(&app, 240, 2000); + }) + .await; + } +} + +#[cfg(test)] +#[path = "automate_tests.rs"] +mod tests; diff --git a/src/openhuman/accessibility/automate_tests.rs b/src/openhuman/accessibility/automate_tests.rs new file mode 100644 index 0000000000..6b169e98b7 --- /dev/null +++ b/src/openhuman/accessibility/automate_tests.rs @@ -0,0 +1,266 @@ +//! Unit tests for the `automate` loop. A scripted [`AutomateBackend`] feeds +//! canned model responses and records every action, so the loop is exercised +//! with no mic, no AX tree, and no LLM. + +use super::*; +use std::sync::Mutex; + +/// Scripted backend: `decide` returns the next queued response each call; +/// perceive/act are stubbed and recorded. +struct ScriptedBackend { + /// Queued raw model outputs, consumed in order. + responses: Mutex>, + /// Elements every `perceive` returns. + elements: Vec, + /// Record of act calls, for assertions. + acts: Mutex>, + /// Force act_press to error (to exercise the failure-recording path). + press_errors: bool, +} + +impl ScriptedBackend { + fn new(responses: &[&str]) -> Self { + Self { + responses: Mutex::new(responses.iter().map(|s| s.to_string()).collect()), + elements: vec![ + ax::AXElement::new("AXButton", "Play"), + ax::AXElement::new("AXTextField", "Search"), + ], + acts: Mutex::new(Vec::new()), + press_errors: false, + } + } + fn acts(&self) -> Vec { + self.acts.lock().unwrap().clone() + } +} + +#[async_trait] +impl AutomateBackend for ScriptedBackend { + async fn perceive(&self, _app: &str, _filter: &str) -> Result, String> { + Ok(self.elements.clone()) + } + async fn decide(&self, _system: &str, _user: &str) -> Result { + Ok(self + .responses + .lock() + .unwrap() + .pop_front() + // When the script runs dry, keep listing so the budget guard is what + // ends the run (rather than a decide error). + .unwrap_or_else(|| r#"{"action":"list","filter":""}"#.to_string())) + } + async fn act_launch(&self, app: &str) -> Result { + self.acts.lock().unwrap().push(format!("launch:{app}")); + Ok(format!("Opened '{app}'.")) + } + async fn act_press(&self, app: &str, label: &str) -> Result { + self.acts + .lock() + .unwrap() + .push(format!("press:{app}:{label}")); + if self.press_errors { + return Err("no such element".into()); + } + Ok(format!("Pressed '{label}' in '{app}'.")) + } + async fn act_set_value(&self, app: &str, label: &str, value: &str) -> Result { + self.acts + .lock() + .unwrap() + .push(format!("set_value:{app}:{label}={value}")); + Ok(format!("Set '{label}' in '{app}'.")) + } + async fn open_url(&self, url: &str) -> Result { + self.acts.lock().unwrap().push(format!("open_url:{url}")); + Ok(format!("Opened {url}")) + } + async fn settle(&self, _app: &str) {} + async fn wait(&self, _ms: u64) {} +} + +fn opts(budget: u32) -> AutomateOptions { + AutomateOptions { + step_budget: budget, + } +} + +#[tokio::test] +async fn happy_path_launch_list_press_done() { + // Use a non-fast-path app/goal so the GENERAL loop is what runs. + // run() foregrounds (launch) the app first, so the model needn't. + let backend = ScriptedBackend::new(&[ + r#"{"action":"list","filter":"Play"}"#, + r#"{"action":"press","label":"Play"}"#, + r#"{"action":"done","summary":"Playing."}"#, + ]); + let out = run("Notes", "do a thing", &backend, opts(8)).await; + assert!(out.success, "expected success, got {out:?}"); + assert_eq!(out.summary, "Playing."); + let acts = backend.acts(); + // Leading launch is the foreground-first guarantee. + assert_eq!(acts, vec!["launch:Notes", "press:Notes:Play"]); +} + +#[tokio::test] +async fn navigate_then_activate_sequence() { + // Press the row (navigates), then press the detail Play, then done. + // Non-fast-path app so this exercises the general loop's two-press flow. + let backend = ScriptedBackend::new(&[ + r#"{"action":"press","label":"Highway to Hell"}"#, + r#"{"action":"press","label":"Play"}"#, + r#"{"action":"done","summary":"ok"}"#, + ]); + let out = run("Photos", "open the top album", &backend, opts(8)).await; + assert!(out.success); + assert_eq!( + backend.acts(), + vec![ + "launch:Photos", // foreground-first + "press:Photos:Highway to Hell", + "press:Photos:Play" + ] + ); +} + +#[tokio::test] +async fn set_value_routes_app_override() { + let backend = ScriptedBackend::new(&[ + r#"{"action":"set_value","app":"Slack","label":"message","value":"hi"}"#, + r#"{"action":"done"}"#, + ]); + let out = run("Slack", "message Steven hi", &backend, opts(5)).await; + assert!(out.success); + assert_eq!( + backend.acts(), + vec!["launch:Slack", "set_value:Slack:message=hi"] // foreground-first + ); +} + +#[tokio::test] +async fn budget_exhaustion_fails() { + // Script always lists → never done → budget guard ends the run. + let backend = ScriptedBackend::new(&[r#"{"action":"list","filter":"x"}"#]); + let out = run("Music", "never finishes", &backend, opts(3)).await; + assert!(!out.success); + assert!(out.summary.contains("budget"), "got: {}", out.summary); +} + +#[tokio::test] +async fn no_progress_guard_aborts_repeated_action() { + // Model keeps pressing the same control (the live "Search ×11" pathology). + let backend = ScriptedBackend::new(&[ + r#"{"action":"press","label":"Search"}"#, + r#"{"action":"press","label":"Search"}"#, + r#"{"action":"press","label":"Search"}"#, + r#"{"action":"press","label":"Search"}"#, + ]); + let out = run("Photos", "do something", &backend, opts(10)).await; + assert!(!out.success); + assert!( + out.summary.contains("stuck repeating"), + "got: {}", + out.summary + ); + // foreground launch, then acted twice; the 3rd identical action aborts. + assert_eq!( + backend.acts(), + vec![ + "launch:Photos", + "press:Photos:Search", + "press:Photos:Search" + ] + ); +} + +#[tokio::test] +async fn one_repair_retry_then_succeeds() { + let backend = ScriptedBackend::new(&[ + "garbage not json", + r#"{"action":"done","summary":"recovered"}"#, + ]); + let out = run("Music", "g", &backend, opts(5)).await; + assert!(out.success, "should recover after one repair: {out:?}"); + assert_eq!(out.summary, "recovered"); +} + +#[tokio::test] +async fn two_unparseable_outputs_fail() { + let backend = ScriptedBackend::new(&["garbage one", "garbage two"]); + let out = run("Music", "g", &backend, opts(5)).await; + assert!(!out.success); + assert!(out.summary.contains("unparseable"), "got: {}", out.summary); +} + +#[tokio::test] +async fn explicit_fail_action_propagates() { + let backend = ScriptedBackend::new(&[r#"{"action":"fail","summary":"app not installed"}"#]); + let out = run("Music", "x", &backend, opts(5)).await; + assert!(!out.success); + assert_eq!(out.summary, "app not installed"); +} + +#[tokio::test] +async fn press_failure_is_recorded_not_fatal() { + let mut backend = ScriptedBackend::new(&[ + r#"{"action":"press","label":"Play"}"#, + r#"{"action":"done","summary":"tried"}"#, + ]); + backend.press_errors = true; + let out = run("Music", "x", &backend, opts(5)).await; + assert!(out.success); // the run continues; the press failure is just logged + assert!( + out.steps.iter().any(|s| s.contains("press FAILED")), + "steps: {:?}", + out.steps + ); +} + +#[test] +fn parse_action_plain_json() { + let a = parse_action(r#"{"action":"press","label":"Play"}"#).unwrap(); + assert_eq!(a.action, "press"); + assert_eq!(a.label, "Play"); +} + +#[test] +fn parse_action_strips_code_fence_and_prose() { + let raw = "Sure!\n```json\n{\"action\":\"done\",\"summary\":\"ok\"}\n```\n"; + let a = parse_action(raw).unwrap(); + assert_eq!(a.action, "done"); + assert_eq!(a.summary, "ok"); +} + +#[test] +fn parse_action_rejects_garbage() { + assert!(parse_action("not json at all").is_err()); + assert!(parse_action("").is_err()); +} + +#[test] +fn render_snapshot_caps_and_labels() { + let many: Vec = (0..100) + .map(|i| ax::AXElement::new("AXButton", format!("btn{i}"))) + .collect(); + let s = render_snapshot("Music", "btn", &many); + assert!(s.contains("showing 40 of 100")); + assert!(s.contains("btn0")); + assert!(!s.contains("btn50"), "should be capped at 40"); +} + +#[test] +fn render_snapshot_does_not_annotate_enabled() { + // AXEnabled is unreliable per-app, so the snapshot must not surface it + // (would mislead the model into avoiding pressable controls). + let mut disabled = ax::AXElement::new("AXButton", "Play"); + disabled.enabled = Some(false); + let s = render_snapshot("Music", "", &[disabled]); + assert!(!s.contains("disabled"), "got: {s}"); + assert!(s.contains("[AXButton] Play")); +} + +#[test] +fn render_snapshot_empty_hint() { + let s = render_snapshot("Music", "zzz", &[]); + assert!(s.contains("no elements")); +} diff --git a/src/openhuman/accessibility/ax_interact.rs b/src/openhuman/accessibility/ax_interact.rs index dda9724e05..cb3ad21bb0 100644 --- a/src/openhuman/accessibility/ax_interact.rs +++ b/src/openhuman/accessibility/ax_interact.rs @@ -21,10 +21,68 @@ mod tests; #[path = "uia_interact_tests.rs"] mod uia_tests; -#[derive(Debug, Clone, Deserialize)] +// Portable (non-OS-gated) unit tests for the pure settle core. The sibling +// `ax_interact_tests.rs` is macOS-only + #[ignore] (needs a live app); these +// run everywhere so the settle logic stays covered in CI. +#[cfg(test)] +mod settle_tests { + use super::counts_settled; + + #[test] + fn not_settled_until_enough_samples() { + assert!(!counts_settled(&[5], 3)); + assert!(!counts_settled(&[5, 5], 3)); + } + + #[test] + fn settled_when_tail_is_constant() { + assert!(counts_settled(&[1, 4, 7, 7, 7], 3)); + } + + #[test] + fn not_settled_when_still_changing() { + assert!(!counts_settled(&[7, 7, 8], 3)); + assert!(!counts_settled(&[2, 4, 6], 3)); + } + + #[test] + fn zero_or_one_required_settles_immediately() { + assert!(counts_settled(&[9], 1)); + assert!(counts_settled(&[9], 0)); + } + + #[test] + fn only_the_tail_matters() { + // Early churn doesn't matter once the last `need` samples agree. + assert!(counts_settled(&[0, 99, 3, 3], 2)); + } +} + +#[derive(Debug, Clone, Default, Deserialize)] pub struct AXElement { pub role: String, pub label: String, + /// The control's reported `AXEnabled` state, when the backend supplies it. + /// + /// **Informational only — do NOT gate pressing on this.** Empirically + /// unreliable per-app: Apple Music reports its search-result rows as + /// `Some(false)` even though `AXPress` on them works. Kept for diagnostics + /// and for apps that report it faithfully; matchers must not skip elements + /// solely because this is `Some(false)`. + #[serde(default)] + pub enabled: Option, +} + +impl AXElement { + /// Convenience constructor (enabled unknown). Keeps call sites terse and + /// insulated from future optional fields. + pub fn new(role: impl Into, label: impl Into) -> Self { + Self { + role: role.into(), + label: label.into(), + enabled: None, + } + } } /// List interactive UI elements (buttons, text fields, checkboxes, …) in `app_name`. @@ -112,6 +170,64 @@ pub fn ax_press_element(app_name: &str, label: &str) -> Result { } } +/// Decide, from a rolling history of element counts, whether the UI has +/// settled — i.e. the most recent `stable_samples` counts are all identical +/// (and there are at least that many samples). Pure so it can be unit-tested +/// without any AX backend or real clock. +/// +/// `stable_samples == 0` or `1` means "settled as soon as we have one sample". +pub(crate) fn counts_settled(history: &[usize], stable_samples: usize) -> bool { + let need = stable_samples.max(1); + if history.len() < need { + return false; + } + let tail = &history[history.len() - need..]; + tail.iter().all(|c| *c == tail[0]) +} + +/// Block until `app_name`'s interactive-element count stops changing for +/// `stable_ms`, or `timeout_ms` elapses. Returns the final observed count. +/// +/// This is the **settle** primitive for the `automate` loop: after an action +/// (press / type / launch) the UI is mid-render, and reading it immediately is +/// what caused the timing-race failures (tracker §1.11/§1.13). Polling the +/// element count until it's stable is a portable replacement for a blind fixed +/// sleep — it works on both backends because it rides on `ax_list_elements`, +/// which already cfg-dispatches (macOS AX / Windows UIA). +/// +/// Blocking (uses `std::thread::sleep` + synchronous helper IPC); async callers +/// should run it via `spawn_blocking`. An AXObserver-driven settle is a later +/// optimization that can sit behind this same signature. +pub fn ax_wait_settled(app_name: &str, stable_ms: u64, timeout_ms: u64) -> usize { + use std::time::{Duration, Instant}; + // Sample roughly every `poll_ms`; declare settled once the count has held + // for ceil(stable_ms / poll_ms) consecutive samples. + let poll_ms = 80u64; + let stable_samples = (stable_ms.div_ceil(poll_ms)).max(2) as usize; + let deadline = Instant::now() + Duration::from_millis(timeout_ms); + let mut history: Vec = Vec::new(); + + loop { + let count = ax_list_elements(app_name).map(|v| v.len()).unwrap_or(0); + history.push(count); + if counts_settled(&history, stable_samples) { + log::debug!( + "[ax_interact] settle: '{app_name}' stable at {count} elements after {} samples", + history.len() + ); + return count; + } + if Instant::now() >= deadline { + log::debug!( + "[ax_interact] settle: '{app_name}' timed out after {} samples (last count={count})", + history.len() + ); + return count; + } + std::thread::sleep(Duration::from_millis(poll_ms)); + } +} + /// Set the value of the first text field in `app_name` whose label contains `label`. /// Pass an empty `label` to target the first available text field. pub fn ax_set_field_value(app_name: &str, label: &str, value: &str) -> Result { diff --git a/src/openhuman/accessibility/ax_interact_tests.rs b/src/openhuman/accessibility/ax_interact_tests.rs index 89f57906e7..ccb123dec2 100644 --- a/src/openhuman/accessibility/ax_interact_tests.rs +++ b/src/openhuman/accessibility/ax_interact_tests.rs @@ -165,3 +165,24 @@ fn test_ax_press_nonexistent_app() { let result = ax_press_element("NonExistentApp12345", "Play"); assert!(result.is_err()); } + +/// Env-driven AX dump probe: `AX_PROBE_APP="Slack" cargo test ax_probe_app -- --ignored --nocapture`. +/// Lists interactive elements an app exposes via the macOS Accessibility API — +/// used to diagnose Electron apps (Slack/Discord) whose tree may be empty +/// unless accessibility is enabled. +#[test] +#[ignore = "manual AX probe — set AX_PROBE_APP"] +fn ax_probe_app() { + let app = std::env::var("AX_PROBE_APP").unwrap_or_else(|_| "Slack".to_string()); + let _ = Command::new("open").arg("-a").arg(&app).status(); + sleep(Duration::from_secs(4)); + match ax_list_elements(&app) { + Ok(els) => { + println!("[ax_probe] {app}: {} interactive elements", els.len()); + for e in els.iter().take(80) { + println!(" [{}] {}", e.role, e.label); + } + } + Err(e) => println!("[ax_probe] {app}: ERROR {e}"), + } +} diff --git a/src/openhuman/accessibility/helper.rs b/src/openhuman/accessibility/helper.rs index c271915aeb..97c9c1fbd7 100644 --- a/src/openhuman/accessibility/helper.rs +++ b/src/openhuman/accessibility/helper.rs @@ -693,16 +693,27 @@ func axListElements(appName: String, id: String?) -> [String: Any] { "AXCheckBox", "AXRadioButton", "AXSlider", "AXPopUpButton", "AXComboBox", "AXLink", "AXTab" ] - var elements: [[String: String]] = [] - axWalk(axApp, maxDepth: 10) { _, role, label in + var elements: [[String: Any]] = [] + axWalk(axApp, maxDepth: 10) { el, role, label in if interactiveRoles.contains(role) && !label.isEmpty { - elements.append(["role": role, "label": label]) + elements.append(["role": role, "label": label, "enabled": axEnabled(el)]) } return false } return ["type": "ax_list", "id": id ?? "", "ok": true, "error": NSNull(), "elements": elements] } +/// Read the AXEnabled attribute; default to `true` when the attribute is absent +/// (most static/text elements don't expose it, and we don't want to hide them). +func axEnabled(_ element: AXUIElement) -> Bool { + var ref: AnyObject? + if AXUIElementCopyAttributeValue(element, kAXEnabledAttribute as CFString, &ref) == .success, + let b = ref as? Bool { + return b + } + return true +} + /// Collect all AX elements whose label contains `label` (case-insensitive). /// Returns matches sorted exact-first so "Play" beats "Playlist". struct AXCandidate { diff --git a/src/openhuman/accessibility/mod.rs b/src/openhuman/accessibility/mod.rs index ccd1dd1841..d90a7cb40c 100644 --- a/src/openhuman/accessibility/mod.rs +++ b/src/openhuman/accessibility/mod.rs @@ -5,6 +5,8 @@ //! Consumer modules (autocomplete, screen_intelligence, voice) call into this module //! instead of owning platform-specific code directly. +pub mod app_fastpaths; +pub mod automate; mod automation_state; pub mod ax_interact; mod capture; diff --git a/src/openhuman/accessibility/uia_interact.rs b/src/openhuman/accessibility/uia_interact.rs index 4ec1060233..cfce495fb9 100644 --- a/src/openhuman/accessibility/uia_interact.rs +++ b/src/openhuman/accessibility/uia_interact.rs @@ -217,6 +217,9 @@ pub fn list(app_name: &str, filter: &str) -> Result, String> { out.push(AXElement { role: format!("{ct:?}"), label, + // TODO(windows): populate from UIA `IsEnabled` once verified on a + // Windows box; `None` = "assume enabled" (current behaviour). + enabled: None, }); } diff --git a/src/openhuman/tools/impl/system/launch_app.rs b/src/openhuman/tools/impl/system/launch_app.rs index c423a34449..7766c2f833 100644 --- a/src/openhuman/tools/impl/system/launch_app.rs +++ b/src/openhuman/tools/impl/system/launch_app.rs @@ -176,7 +176,11 @@ impl Tool for LaunchAppTool { } /// Platform-specific launch dispatch. Returns a human-readable success message. -async fn launch_platform(app_name: &str) -> Result { +/// +/// `pub(crate)` so the `automate` inner loop (`accessibility::automate`) can +/// launch an app as one of its steps without duplicating the platform branches +/// or routing back through the full tool surface. +pub(crate) async fn launch_platform(app_name: &str) -> Result { log::info!( "[launch_app] platform={} dispatching launch for app_name={app_name:?}", std::env::consts::OS diff --git a/src/openhuman/tools/impl/system/mod.rs b/src/openhuman/tools/impl/system/mod.rs index 118a567b30..a532e1f713 100644 --- a/src/openhuman/tools/impl/system/mod.rs +++ b/src/openhuman/tools/impl/system/mod.rs @@ -20,6 +20,8 @@ pub use detect_tools::DetectToolsTool; pub use insert_sql_record::InsertSqlRecordTool; pub use install_tool::InstallToolTool; pub use launch_app::LaunchAppTool; +// Reused by the `automate` inner loop to launch an app mid-flow. +pub(crate) use launch_app::launch_platform; pub use lsp::{lsp_capability_enabled, LspTool, LSP_ENABLED_ENV}; pub use node_exec::NodeExecTool; pub use npm_exec::NpmExecTool;