Submission: duvo-eye-1, Holo-3.1-35B-A3B + LoRA (72.9% overall, self-reported) by tomascupr · Pull Request #29 · likaixin2000/ScreenSpot-Pro-GUI-Grounding

tomascupr · 2026-06-12T21:34:24Z

duvo-eye-1 — 72.9% on ScreenSpot-Pro (self-reported)

Model: duvoai/duvo-eye-1 (public weights)
Authors: Duvo

Method

Holo-3.1-35B-A3B (3B active) with a LoRA adapter (rank 64, alpha 128, lr 7e-5, 1 epoch) trained on duvoai/SynthUI (14.9k synthetic enterprise-UI rows). No benchmark data in training. The model emits {"x": int, "y": int} in [0, 1000] relative to the input image; the point is scaled to absolute pixels via the original image dimensions.

Results (full English split, all 1,581 samples)

Category	Icon	Text	Avg
Office	81.1	89.3	87.4
Scientific	54.5	82.6	70.5
Dev	67.6	84.4	76.3
CAD	45.3	71.6	65.1
Creative	49.0	79.3	66.6
OS	65.2	84.1	75.5
Overall	59.3	81.4	72.9

0.0% unparseable outputs (1,581/1,581 valid coordinate responses). For reference, H Company's published ScreenSpot-Pro number for the base Holo-3.1-35B-A3B is 71.5 (their model card).

Protocol (honest notes)

Self-reported, from our own harness (bench_eval.py, included in the predictions dataset), not this repo's eval_screenspot_pro.py. The adapter in this PR reproduces the protocol inside your harness.
Single-shot, temperature 0, max_tokens 64, thinking disabled (chat_template_kwargs: {"enable_thinking": false}), guided JSON decoding.
vLLM 0.22.1 with mm-processor-kwargs {"max_pixels": 8000000}.
Exact-match scoring: predicted point inside ground-truth bbox; unparseable output = miss.
Per-sample predictions (raw output, parsed point, absolute pixels, hit/miss) are published for independent rescoring: duvoai/duvo-eye-1-evals (screenspot-pro/bench_sspro_highres_duvo-eye-1.predictions.jsonl). The category table above was recomputed from those raw outputs joined against the dataset annotations.

Files changed

models/duvo_eye_1.py — adapter implementing ground_only_positive() (same interface as models/holo1_5.py), so results can be reproduced with this repo's harness.
model_factory.py — added duvo_eye_1 model type.

Usage

python eval_screenspot_pro.py \
  --model_type duvo_eye_1 \
  --model_name_or_path duvoai/duvo-eye-1 \
  --screenspot_imgs /path/to/ScreenSpot-Pro/images \
  --screenspot_test /path/to/ScreenSpot-Pro/annotations \
  --task all --inst_style instruction --language en --gt_type positive \
  --log_path results/duvo_eye_1.json

PROMPT_TEMPLATE contained {"x":...} which .format(instruction=...) parsed as a replacement field, raising KeyError: '"x"'. Double the braces so the JSON example survives .format(). Verified against eval_screenspot_pro.py.

tomascupr · 2026-06-13T06:51:30Z

Update: I ran this end-to-end under your eval_screenspot_pro.py harness on the canonical likaixin/ScreenSpot-Pro data (--task all --inst_style instruction --language en --gt_type positive). Result:

action_acc 0.7287 (1152/1581), text_acc 0.8147, icon_acc 0.5894, wrong_format_num 0

So 72.87% overall under your harness, matching the 72.9% reported in the PR (our harness, 1153/1581) to within a single sample, with zero malformed outputs.

Two small notes:

I pushed a one-line fix to the adapter — the prompt template's literal JSON braces {"x":...} needed escaping for .format() (was raising KeyError). It now runs cleanly under eval_screenspot_pro.py.
Full per-app breakdown + the official result JSON are public at https://huggingface.co/datasets/duvoai/duvo-eye-1-evals (screenspot-pro/duvo_eye_1_official_harness.json), alongside the per-sample predictions.

Happy to adjust anything for the merge.

Submission: duvo-eye-1 (72.9% overall)

1e81cf1

tomascupr mentioned this pull request Jun 12, 2026

Add community-reported results section: duvo-eye-1 (78.0 standard subset / 70.6 full set, self-reported) xlang-ai/OSWorld-G#24

Open

Fix adapter: escape literal JSON braces in prompt template

5a7ed98

PROMPT_TEMPLATE contained {"x":...} which .format(instruction=...) parsed as a replacement field, raising KeyError: '"x"'. Double the braces so the JSON example survives .format(). Verified against eval_screenspot_pro.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: duvo-eye-1, Holo-3.1-35B-A3B + LoRA (72.9% overall, self-reported)#29

Submission: duvo-eye-1, Holo-3.1-35B-A3B + LoRA (72.9% overall, self-reported)#29
tomascupr wants to merge 2 commits into
likaixin2000:mainfrom
tomascupr:submission-duvo-eye-1

tomascupr commented Jun 12, 2026

Uh oh!

tomascupr commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomascupr commented Jun 12, 2026

duvo-eye-1 — 72.9% on ScreenSpot-Pro (self-reported)

Method

Results (full English split, all 1,581 samples)

Protocol (honest notes)

Files changed

Usage

Uh oh!

tomascupr commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant