Skip to content

Submission: duvo-eye-1, Holo-3.1-35B-A3B + LoRA (72.9% overall, self-reported)#29

Open
tomascupr wants to merge 2 commits into
likaixin2000:mainfrom
tomascupr:submission-duvo-eye-1
Open

Submission: duvo-eye-1, Holo-3.1-35B-A3B + LoRA (72.9% overall, self-reported)#29
tomascupr wants to merge 2 commits into
likaixin2000:mainfrom
tomascupr:submission-duvo-eye-1

Conversation

@tomascupr

Copy link
Copy Markdown

duvo-eye-1 — 72.9% on ScreenSpot-Pro (self-reported)

Model: duvoai/duvo-eye-1 (public weights)
Authors: Duvo

Method

Holo-3.1-35B-A3B (3B active) with a LoRA adapter (rank 64, alpha 128, lr 7e-5, 1 epoch) trained on duvoai/SynthUI (14.9k synthetic enterprise-UI rows). No benchmark data in training. The model emits {"x": int, "y": int} in [0, 1000] relative to the input image; the point is scaled to absolute pixels via the original image dimensions.

Results (full English split, all 1,581 samples)

Category Icon Text Avg
Office 81.1 89.3 87.4
Scientific 54.5 82.6 70.5
Dev 67.6 84.4 76.3
CAD 45.3 71.6 65.1
Creative 49.0 79.3 66.6
OS 65.2 84.1 75.5
Overall 59.3 81.4 72.9

0.0% unparseable outputs (1,581/1,581 valid coordinate responses). For reference, H Company's published ScreenSpot-Pro number for the base Holo-3.1-35B-A3B is 71.5 (their model card).

Protocol (honest notes)

  • Self-reported, from our own harness (bench_eval.py, included in the predictions dataset), not this repo's eval_screenspot_pro.py. The adapter in this PR reproduces the protocol inside your harness.
  • Single-shot, temperature 0, max_tokens 64, thinking disabled (chat_template_kwargs: {"enable_thinking": false}), guided JSON decoding.
  • vLLM 0.22.1 with mm-processor-kwargs {"max_pixels": 8000000}.
  • Exact-match scoring: predicted point inside ground-truth bbox; unparseable output = miss.
  • Per-sample predictions (raw output, parsed point, absolute pixels, hit/miss) are published for independent rescoring: duvoai/duvo-eye-1-evals (screenspot-pro/bench_sspro_highres_duvo-eye-1.predictions.jsonl). The category table above was recomputed from those raw outputs joined against the dataset annotations.

Files changed

  • models/duvo_eye_1.py — adapter implementing ground_only_positive() (same interface as models/holo1_5.py), so results can be reproduced with this repo's harness.
  • model_factory.py — added duvo_eye_1 model type.

Usage

python eval_screenspot_pro.py \
  --model_type duvo_eye_1 \
  --model_name_or_path duvoai/duvo-eye-1 \
  --screenspot_imgs /path/to/ScreenSpot-Pro/images \
  --screenspot_test /path/to/ScreenSpot-Pro/annotations \
  --task all --inst_style instruction --language en --gt_type positive \
  --log_path results/duvo_eye_1.json

PROMPT_TEMPLATE contained {"x":...} which .format(instruction=...) parsed
as a replacement field, raising KeyError: '"x"'. Double the braces so the
JSON example survives .format(). Verified against eval_screenspot_pro.py.
@tomascupr

Copy link
Copy Markdown
Author

Update: I ran this end-to-end under your eval_screenspot_pro.py harness on the canonical likaixin/ScreenSpot-Pro data (--task all --inst_style instruction --language en --gt_type positive). Result:

action_acc 0.7287 (1152/1581), text_acc 0.8147, icon_acc 0.5894, wrong_format_num 0

So 72.87% overall under your harness, matching the 72.9% reported in the PR (our harness, 1153/1581) to within a single sample, with zero malformed outputs.

Two small notes:

  1. I pushed a one-line fix to the adapter — the prompt template's literal JSON braces {"x":...} needed escaping for .format() (was raising KeyError). It now runs cleanly under eval_screenspot_pro.py.
  2. Full per-app breakdown + the official result JSON are public at https://huggingface.co/datasets/duvoai/duvo-eye-1-evals (screenspot-pro/duvo_eye_1_official_harness.json), alongside the per-sample predictions.

Happy to adjust anything for the merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant