Skip to content

Commit 32e2b24

Browse files
bobleerbowen628
andauthored
feat(computer-use): update settings (#263)
Co-authored-by: bowen628 <bowen628@noreply.gitcode.com>
1 parent 4a8a8af commit 32e2b24

File tree

8 files changed

+239
-30
lines changed

8 files changed

+239
-30
lines changed

src/crates/core/src/agentic/agents/prompts/claw_mode.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ When the `ComputerUse` tool is available, you may capture the screen and use mou
5050
- **Strict rule — no blind Enter, no blind click:** Before **`click`**, you **must** have a **fine** screenshot after the pointer is aligned: **`quadrant_navigation_click_ready`: true** (preferred: **`screenshot` + `screenshot_navigate_quadrant`** each step until the tool JSON says so) **or** a **point-crop `screenshot`** (~500×500 via `screenshot_crop_center_*`) when the exceptions above apply. A **full-screen-only** frame alone does **not** authorize **`click`**. Before **`key_chord` that includes Return or Enter**, you **must** call **`screenshot` first** and **visually confirm** focus and target. The only exception is when the user explicitly asks for an unverified / blind step.
5151
- For sending messages, payments, destructive actions, or anything sensitive, state the exact steps first and obtain clear user confirmation in chat before executing.
5252
- If Computer use is disabled or OS permissions are missing, tell the user what to enable in BitFun settings / system privacy instead of claiming success.
53-
- Screenshot results require the session primary model to use Anthropic API format so the image is attached to the tool result for vision. The JPEG matches **native display resolution** (no downscale): `coordinate_mode` `"image"` uses the same pixel grid as the bitmap.
53+
- Screenshot results require the session primary model to use Anthropic or OpenAI-compatible API format so the image is attached to the tool result for vision. The JPEG matches **native display resolution** (no downscale): `coordinate_mode` `"image"` uses the same pixel grid as the bitmap.
5454
- **Host-enforced screenshot (two cases):** The desktop host **rejects `click`** until the last `screenshot` after the last pointer move is a **valid fine basis**: **`quadrant_navigation_click_ready`: true** (quadrant drill until the region’s longest side is below the host threshold) **or** a **fresh point-crop** (`screenshot_crop_center_*`, ~500×500). **Full-screen-only** is **not** enough. It **rejects `key_chord` that includes Return or Enter** until a **fresh `screenshot`** since the last pointer move or click. **`mouse_move`** may use **`coordinate_mode` `\"image\"`** on any prior **`screenshot`**. Still **prefer `key_chord`** when it matches the step.
5555
- **Rulers vs zoom:** Full-frame JPEGs have **margin rulers** and a **grid** — use them to orient. For small controls, **default to quadrant drill** (`screenshot_navigate_quadrant` on each `screenshot` step); use **point crop** only as a **secondary** option (see default path above). Each quadrant step **adds padding on every side** (clamped) so controls on split lines stay in the JPEG. **Do not** rely only on huge full-display images when a smaller view answers the question.
5656
- **Click guard:** The host **rejects `click`** if there was **`mouse_move` / `pointer_nudge` / `pointer_move_rel` or a previous `click`** since the last `screenshot`, or if the last `screenshot` was **full-screen only** without **`quadrant_navigation_click_ready`**. **`screenshot`** before **Return/Enter** in **`key_chord`** when the outcome matters.

src/crates/core/src/agentic/tools/implementations/computer_use_tool.rs

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,18 @@ impl ComputerUseTool {
3636
.to_lowercase()
3737
}
3838

39-
fn require_anthropic_for_screenshot(ctx: &ToolUseContext) -> BitFunResult<()> {
40-
if Self::primary_api_format(ctx) == "anthropic" {
39+
/// Screenshot tool results attach JPEGs via `tool_image_attachments`; only providers whose
40+
/// request converters emit multimodal tool output are supported (Anthropic + OpenAI-compatible).
41+
fn require_multimodal_tool_output_for_screenshot(ctx: &ToolUseContext) -> BitFunResult<()> {
42+
let f = Self::primary_api_format(ctx);
43+
if matches!(
44+
f.as_str(),
45+
"anthropic" | "openai" | "response" | "responses"
46+
) {
4147
return Ok(());
4248
}
4349
Err(BitFunError::tool(
44-
"Screenshot results include images in tool results; set the primary model to an Anthropic (Claude) API format. Other providers are not supported for screenshots yet.".to_string(),
50+
"Screenshot results include images in tool results; set the primary model to Anthropic (Claude) or OpenAI-compatible API format. Other providers are not supported for screenshots yet.".to_string(),
4551
))
4652
}
4753

@@ -598,7 +604,7 @@ Each **`screenshot`** JPEG: **four-side margin coordinate scales** (numbers), **
598604

599605
match action {
600606
"screenshot" => {
601-
Self::require_anthropic_for_screenshot(context)?;
607+
Self::require_multimodal_tool_output_for_screenshot(context)?;
602608
let (params, ignored_crop_for_quadrant) = Self::parse_screenshot_params(input)?;
603609
let crop_for_debug = params.crop_center;
604610
let nav_debug = params.navigate_quadrant.map(|q| match q {

src/crates/core/src/infrastructure/ai/providers/openai/message_converter.rs

Lines changed: 123 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,9 +83,37 @@ impl OpenAIMessageConverter {
8383

8484
fn convert_tool_message_to_responses_item(msg: Message) -> Option<Value> {
8585
let call_id = msg.tool_call_id?;
86-
let output = msg
87-
.content
88-
.unwrap_or_else(|| "Tool execution completed".to_string());
86+
let text = msg.content.unwrap_or_default();
87+
88+
// Responses API: `output` may be a string or a list of input_text / input_image / input_file
89+
// (see OpenAI FunctionCallOutput schema).
90+
let output: Value = if let Some(attachments) = msg.tool_image_attachments.filter(|a| !a.is_empty()) {
91+
let mut parts: Vec<Value> = attachments
92+
.into_iter()
93+
.map(|att| {
94+
let data_url = format!("data:{};base64,{}", att.mime_type, att.data_base64);
95+
json!({
96+
"type": "input_image",
97+
"image_url": data_url
98+
})
99+
})
100+
.collect();
101+
parts.push(json!({
102+
"type": "input_text",
103+
"text": if text.is_empty() {
104+
"Tool execution completed".to_string()
105+
} else {
106+
text
107+
}
108+
}));
109+
json!(parts)
110+
} else {
111+
json!(if text.is_empty() {
112+
"Tool execution completed".to_string()
113+
} else {
114+
text
115+
})
116+
};
89117

90118
Some(json!({
91119
"type": "function_call_output",
@@ -171,6 +199,44 @@ impl OpenAIMessageConverter {
171199
}
172200

173201
fn convert_single_message(msg: Message) -> Value {
202+
// Chat Completions: multimodal tool message (e.g. GPT-4o vision + tools) — image parts + text.
203+
if msg.role == "tool" {
204+
if let Some(ref attachments) = msg.tool_image_attachments {
205+
if !attachments.is_empty() {
206+
let mut parts: Vec<Value> = attachments
207+
.iter()
208+
.map(|att| {
209+
let url = format!("data:{};base64,{}", att.mime_type, att.data_base64);
210+
json!({
211+
"type": "image_url",
212+
"image_url": { "url": url, "detail": "auto" }
213+
})
214+
})
215+
.collect();
216+
let text = msg.content.clone().unwrap_or_default();
217+
if text.trim().is_empty() {
218+
parts.push(json!({
219+
"type": "text",
220+
"text": "Tool execution completed"
221+
}));
222+
} else {
223+
parts.push(json!({ "type": "text", "text": text }));
224+
}
225+
let mut openai_msg = json!({
226+
"role": "tool",
227+
"content": Value::Array(parts),
228+
});
229+
if let Some(id) = msg.tool_call_id {
230+
openai_msg["tool_call_id"] = Value::String(id);
231+
}
232+
if let Some(name) = msg.name {
233+
openai_msg["name"] = Value::String(name);
234+
}
235+
return openai_msg;
236+
}
237+
}
238+
}
239+
174240
let mut openai_msg = json!({
175241
"role": msg.role,
176242
});
@@ -282,7 +348,7 @@ impl OpenAIMessageConverter {
282348
#[cfg(test)]
283349
mod tests {
284350
use super::OpenAIMessageConverter;
285-
use crate::util::types::{Message, ToolCall};
351+
use crate::util::types::{Message, ToolCall, ToolImageAttachment};
286352
use serde_json::json;
287353
use std::collections::HashMap;
288354

@@ -354,4 +420,57 @@ mod tests {
354420
assert_eq!(content[0]["type"], json!("input_image"));
355421
assert_eq!(content[1]["type"], json!("input_text"));
356422
}
423+
424+
#[test]
425+
fn converts_tool_message_with_images_to_responses_function_call_output() {
426+
let messages = vec![Message {
427+
role: "tool".to_string(),
428+
content: Some("Screen captured".to_string()),
429+
reasoning_content: None,
430+
thinking_signature: None,
431+
tool_calls: None,
432+
tool_call_id: Some("call_cu_1".to_string()),
433+
name: Some("computer_use".to_string()),
434+
tool_image_attachments: Some(vec![ToolImageAttachment {
435+
mime_type: "image/jpeg".to_string(),
436+
data_base64: "AAA".to_string(),
437+
}]),
438+
}];
439+
440+
let (_, input) = OpenAIMessageConverter::convert_messages_to_responses_input(messages);
441+
let out = &input[0];
442+
assert_eq!(out["type"], json!("function_call_output"));
443+
assert_eq!(out["call_id"], json!("call_cu_1"));
444+
let output = out["output"].as_array().expect("multimodal output");
445+
assert_eq!(output[0]["type"], json!("input_image"));
446+
assert!(output[0]["image_url"]
447+
.as_str()
448+
.unwrap()
449+
.starts_with("data:image/jpeg;base64,"));
450+
assert_eq!(output[1]["type"], json!("input_text"));
451+
assert_eq!(output[1]["text"], json!("Screen captured"));
452+
}
453+
454+
#[test]
455+
fn converts_tool_message_with_images_to_chat_completions_content_parts() {
456+
let msg = Message {
457+
role: "tool".to_string(),
458+
content: Some("ok".to_string()),
459+
reasoning_content: None,
460+
thinking_signature: None,
461+
tool_calls: None,
462+
tool_call_id: Some("call_1".to_string()),
463+
name: Some("computer_use".to_string()),
464+
tool_image_attachments: Some(vec![ToolImageAttachment {
465+
mime_type: "image/jpeg".to_string(),
466+
data_base64: "YmFi".to_string(),
467+
}]),
468+
};
469+
470+
let openai = OpenAIMessageConverter::convert_messages(vec![msg]);
471+
let content = openai[0]["content"].as_array().expect("content parts");
472+
assert_eq!(content[0]["type"], json!("image_url"));
473+
assert_eq!(content[1]["type"], json!("text"));
474+
assert_eq!(content[1]["text"], json!("ok"));
475+
}
357476
}

src/web-ui/src/infrastructure/config/components/AIFeaturesConfig.scss

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,17 @@
1515
gap: $size-gap-2;
1616
}
1717

18+
/** Computer use: authorized status label (Granted / 已授权) */
19+
&__perm-status--granted {
20+
color: var(--color-text-muted);
21+
}
22+
23+
/** Computer use permission row: keep label short, avoid squashing action buttons (nowrap). */
24+
&__row-action-btn {
25+
flex-shrink: 0;
26+
white-space: nowrap;
27+
}
28+
1829
&__row-control--model {
1930
width: 100%;
2031
align-items: stretch;

src/web-ui/src/infrastructure/config/components/SessionConfig.tsx

Lines changed: 81 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -520,31 +520,95 @@ const SessionConfig: React.FC = () => {
520520
<span className="bitfun-func-agent-config__hint">{computerUseNote}</span>
521521
</ConfigPageRow>
522522
) : null}
523-
<ConfigPageRow label={t('computerUse.accessibility')} align="center">
524-
<div className="bitfun-func-agent-config__row-control" style={{ display: 'flex', gap: 8, flexWrap: 'wrap' }}>
525-
<span>{computerUseAccess ? t('computerUse.granted') : t('computerUse.notGranted')}</span>
526-
<Button size="small" variant="secondary" disabled={computerUseBusy} onClick={() => void handleComputerUseRequestPermissions()}>
527-
{t('computerUse.request')}
528-
</Button>
529-
<Button size="small" variant="secondary" disabled={computerUseBusy} onClick={() => void handleComputerUseOpenSettings('accessibility')}>
523+
<ConfigPageRow label={t('computerUse.accessibility')} align="center" balanced>
524+
<div
525+
className="bitfun-func-agent-config__row-control"
526+
style={{
527+
display: 'flex',
528+
flexDirection: 'row',
529+
flexWrap: 'nowrap',
530+
alignItems: 'center',
531+
justifyContent: 'flex-end',
532+
gap: 8,
533+
}}
534+
>
535+
<span style={{ display: 'inline-flex', alignItems: 'center', gap: 6, flexShrink: 0 }}>
536+
<span className={computerUseAccess ? 'bitfun-func-agent-config__perm-status--granted' : undefined}>
537+
{computerUseAccess ? t('computerUse.granted') : t('computerUse.notGranted')}
538+
</span>
539+
<IconButton
540+
type="button"
541+
size="small"
542+
variant="ghost"
543+
aria-label={t('computerUse.refreshStatus')}
544+
tooltip={t('computerUse.refreshStatus')}
545+
disabled={computerUseBusy}
546+
onClick={() => void refreshComputerUseStatus()}
547+
>
548+
<RefreshCw size={14} />
549+
</IconButton>
550+
</span>
551+
{!computerUseAccess ? (
552+
<Button
553+
className="bitfun-func-agent-config__row-action-btn"
554+
size="small"
555+
variant="secondary"
556+
disabled={computerUseBusy}
557+
onClick={() => void handleComputerUseRequestPermissions()}
558+
>
559+
{t('computerUse.request')}
560+
</Button>
561+
) : null}
562+
<Button
563+
className="bitfun-func-agent-config__row-action-btn"
564+
size="small"
565+
variant="secondary"
566+
disabled={computerUseBusy}
567+
onClick={() => void handleComputerUseOpenSettings('accessibility')}
568+
>
530569
{t('computerUse.openSettings')}
531570
</Button>
532571
</div>
533572
</ConfigPageRow>
534-
<ConfigPageRow label={t('computerUse.screenCapture')} align="center">
535-
<div className="bitfun-func-agent-config__row-control" style={{ display: 'flex', gap: 8, flexWrap: 'wrap' }}>
536-
<span>{computerUseScreen ? t('computerUse.granted') : t('computerUse.notGranted')}</span>
537-
<Button size="small" variant="secondary" disabled={computerUseBusy} onClick={() => void handleComputerUseOpenSettings('screen_capture')}>
573+
<ConfigPageRow label={t('computerUse.screenCapture')} align="center" balanced>
574+
<div
575+
className="bitfun-func-agent-config__row-control"
576+
style={{
577+
display: 'flex',
578+
flexDirection: 'row',
579+
flexWrap: 'nowrap',
580+
alignItems: 'center',
581+
justifyContent: 'flex-end',
582+
gap: 8,
583+
}}
584+
>
585+
<span style={{ display: 'inline-flex', alignItems: 'center', gap: 6, flexShrink: 0 }}>
586+
<span className={computerUseScreen ? 'bitfun-func-agent-config__perm-status--granted' : undefined}>
587+
{computerUseScreen ? t('computerUse.granted') : t('computerUse.notGranted')}
588+
</span>
589+
<IconButton
590+
type="button"
591+
size="small"
592+
variant="ghost"
593+
aria-label={t('computerUse.refreshStatus')}
594+
tooltip={t('computerUse.refreshStatus')}
595+
disabled={computerUseBusy}
596+
onClick={() => void refreshComputerUseStatus()}
597+
>
598+
<RefreshCw size={14} />
599+
</IconButton>
600+
</span>
601+
<Button
602+
className="bitfun-func-agent-config__row-action-btn"
603+
size="small"
604+
variant="secondary"
605+
disabled={computerUseBusy}
606+
onClick={() => void handleComputerUseOpenSettings('screen_capture')}
607+
>
538608
{t('computerUse.openSettings')}
539609
</Button>
540610
</div>
541611
</ConfigPageRow>
542-
<ConfigPageRow label={t('computerUse.refreshStatus')} align="center">
543-
<Button size="small" variant="secondary" disabled={computerUseBusy} onClick={() => void refreshComputerUseStatus()}>
544-
<RefreshCw size={14} style={{ marginRight: 6 }} />
545-
{t('computerUse.refreshStatus')}
546-
</Button>
547-
</ConfigPageRow>
548612
</>
549613
) : null}
550614
</ConfigPageSection>

src/web-ui/src/infrastructure/config/components/common/ConfigPageLayout.tsx

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,11 @@ export interface ConfigPageRowProps {
9292
multiline?: boolean;
9393
/** Flip to 3/7 ratio giving the control column more space */
9494
wide?: boolean;
95+
/**
96+
* ~40% label / ~60% control — middle ground between default (7:3) and wide (2:8).
97+
* Use when the label must stay on one line (e.g. two-word titles) and controls need room.
98+
*/
99+
balanced?: boolean;
95100
}
96101

97102
export const ConfigPageRow: React.FC<ConfigPageRowProps> = ({
@@ -102,17 +107,21 @@ export const ConfigPageRow: React.FC<ConfigPageRowProps> = ({
102107
align = 'start',
103108
multiline = false,
104109
wide = false,
110+
balanced = false,
105111
}) => {
106112
const cls = [
107113
'bitfun-config-page-row',
108114
`bitfun-config-page-row--${align}`,
109115
multiline && 'bitfun-config-page-row--multiline',
110116
wide && 'bitfun-config-page-row--wide',
117+
balanced && 'bitfun-config-page-row--balanced',
111118
className,
112119
].filter(Boolean).join(' ');
113120

114121
const gridStyle: React.CSSProperties | undefined = wide
115122
? { gridTemplateColumns: 'minmax(0, 2fr) minmax(0, 8fr)' }
123+
: balanced
124+
? { gridTemplateColumns: 'minmax(0, 2fr) minmax(0, 3fr)' }
116125
: multiline
117126
? { gridTemplateColumns: '1fr' }
118127
: undefined;

src/web-ui/src/locales/en-US/settings/session-config.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,15 @@
1414
},
1515
"computerUse": {
1616
"sectionTitle": "Computer use (Claw)",
17-
"sectionDescription": "Let the assistant capture the screen and control the mouse and keyboard in BitFun desktop. Requires macOS Accessibility and Screen Recording (or equivalent on other platforms). Screenshots in tool results need a primary model with Anthropic API format.",
17+
"sectionDescription": "Let the assistant capture the screen and control the mouse and keyboard in BitFun desktop.",
1818
"enable": "Enable Computer use",
1919
"enableDesc": "When off, the ComputerUse tool stays disabled even in Claw mode.",
2020
"accessibility": "Accessibility",
2121
"screenCapture": "Screen recording",
2222
"granted": "Granted",
2323
"notGranted": "Not granted",
2424
"request": "Request",
25-
"openSettings": "Open System Settings",
25+
"openSettings": "Setting",
2626
"refreshStatus": "Refresh status",
2727
"desktopOnly": "Computer use settings are only available in the BitFun desktop app.",
2828
"platformNote": "Note"

0 commit comments

Comments
 (0)