Skip to content

feat: add dolphin #1772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

geoHeil
Copy link

@geoHeil geoHeil commented Jun 14, 2025

Issue resolved by this Pull Request:
Resolves #1622

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Copy link

mergify bot commented Jun 14, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@geoHeil geoHeil changed the title prepare for dolfin feat: add dolphin Jun 14, 2025
@geoHeil geoHeil force-pushed the feat/add-dolphin branch from 03ce2d2 to f16de96 Compare June 14, 2025 07:00
@dolfim-ibm
Copy link
Contributor

@geoHeil Thanks for the addition. Looks almost ready for me.

Note that here we also need

  1. the commits to be signed off
  2. the code should be formatted with the pre-commit hooks, i.e. uv run pre-commit run --all-files.

@PeterStaar-IBM
Copy link
Contributor

@geoHeil I think this is not how Dolphin works. In essence, you need to do a double pass (see here):

  1. Once run to get the layout
    • the parse the layout to obtain bbox and label
    • crop the images of the original images
  2. Run the model again (with different prompt) to obtain the OCR-ed part

In this sense, Dolphin allows to obtain the bbox-es, something the other VLM can not do.

@geoHeil
Copy link
Author

geoHeil commented Jun 16, 2025

@geoHeil I think this is not how Dolphin works. In essence, you need to do a double pass (see here):

1. Once run to get the layout
   
   * the parse the layout to obtain bbox and label
   * crop the images of the original images

2. Run the model again (with different prompt) to obtain the OCR-ed part

In this sense, Dolphin allows to obtain the bbox-es, something the other VLM can not do.

Are already well working prompts for these 2 tasks clear? Should these be added (either as examples or somewhre else)?

geoHeil added 3 commits June 17, 2025 08:25
Signed-off-by: Georg Heiler <[email protected]>
Signed-off-by: Georg Heiler <[email protected]>
@geoHeil geoHeil force-pushed the feat/add-dolphin branch from 63dc7e7 to 448c932 Compare June 17, 2025 06:25
@PeterStaar-IBM
Copy link
Contributor

@geoHeil I think this is not how Dolphin works. In essence, you need to do a double pass (see here):

1. Once run to get the layout
   
   * the parse the layout to obtain bbox and label
   * crop the images of the original images

2. Run the model again (with different prompt) to obtain the OCR-ed part

In this sense, Dolphin allows to obtain the bbox-es, something the other VLM can not do.

Are already well working prompts for these 2 tasks clear? Should these be added (either as examples or somewhre else)?

Yes, if you look here,

  1. parse layout: https://github.com/bytedance/Dolphin/blob/master/demo_page.py#L63
  2. proces elements: https://github.com/bytedance/Dolphin/blob/master/demo_page.py#L94

this is what is essentially happening.

In a sense, this breaks a bit with our current VLM pipeline, that will do a page with a single prediction. It would be good to actually have a "two-shot VLM pipeline", which would support this.

@cau-git @dolfim-ibm fyi: ^^

@geoHeil
Copy link
Author

geoHeil commented Jun 18, 2025

How do we move forward here? Should we get this merged? And then explore a 2nd PR for a separate VLM pipeline?

@PeterStaar-IBM
Copy link
Contributor

How do we move forward here? Should we get this merged? And then explore a 2nd PR for a separate VLM pipeline?

I want to check it out and run it myself first. My main worry is that we dont showcase the rich output that Dolphin currently provides (with layout boxes). But, this PR might be good enough as a starting point. I would just love to go all the credit to the team that built the Dolphin model.

@PeterStaar-IBM
Copy link
Contributor

@geoHeil I would like to merge this asap, but I see we fail the MyPy (https://github.com/docling-project/docling/actions/runs/15699814942/job/44748636363?pr=1772),

Could you fix this quickly?

Signed-off-by: Georg Heiler <[email protected]>
Copy link
Contributor

DCO Check Passed

Thanks @geoHeil, all your commits are properly signed off. 🎉

@geoHeil
Copy link
Author

geoHeil commented Jun 30, 2025

pleaese re-run CI - should work now

Copy link

codecov bot commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../models/vlm_models_inline/hf_transformers_model.py 0.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add ByteDance/Dolphin model for Docling
3 participants