Skip to content

feat(html): add anchor tag support in HTML conversion #1402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ka-weihe
Copy link

Description:
This pull request adds support for processing anchor (<a>) tags during the HTML document conversion. The changes introduce a new branch in the analyze_tag method to detect and handle anchor tags by extracting both the visible text and the associated hyperlink (href attribute). If the anchor contains visible text, it is combined with the hyperlink; otherwise, only the link is added as text. This enhancement improves the fidelity of the HTML conversion by ensuring links are properly captured in the document model.

Commit Message:

feat(html): add anchor tag support in HTML document conversion

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Signed-off-by: ka-weihe <[email protected]>
Copy link

mergify bot commented Apr 15, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@ka-weihe ka-weihe changed the title Update html_backend.py feat(html): add anchor tag support in HTML conversion Apr 15, 2025
Signed-off-by: ka-weihe <[email protected]>
Signed-off-by: ka-weihe <[email protected]>
@dolfim-ibm
Copy link
Contributor

@ka-weihe could you please provide some simple (dummy) example file which uses this new feature? For example one with text and one without.

@ka-weihe
Copy link
Author

ka-weihe commented May 4, 2025

@ka-weihe could you please provide some simple (dummy) example file which uses this new feature? For example one with text and one without.

I have added an example file now.

@krrome
Copy link

krrome commented May 7, 2025

I have also looked into this topic and have discovered this PR only now. Thanks to you all for your work! Looking at the suggested changes here I would like to point out that anchor tags can occur in any place of the HTML structure. In my understanding the code here parses the given example correctly, but it will not handle:

  • anchors in lists
  • anchors in headers
  • anchors in tables (problem is here also that docling_core TableCell doesn't support that)
  • ...

Please correct me if I am wrong.

I would like to propose an alternative solution, should I create a separate PR? The code touches the same files, but looks quite a bit different.

@dolfim-ibm
Copy link
Contributor

@krrome I agree the current implementation will still miss a few <a> components, i.e. when they are nested in another structure.

I propose to start with this implementation and build on top of it.

I also see we have another PR which is proposing a large refactoring of the HTML backend #1411 (cc @ceberam).

@ka-weihe From the checks I think you still have to

  1. Apply the code styling, e.g. via poetry run pre-commit run --all-files
  2. Generate the test results for your new test file, e.g. with DOCLING_GEN_TEST_DATA=1 poetry run pytest tests/test_backend_html.py

@krrome
Copy link

krrome commented May 15, 2025

Thank you @dolfim-ibm for your reply. I see the point of taking a first step, I just wonder how this PR will help users in a real-world situation. At least for my documents this PR would not change anything because, all anchor tags are inside paragraphs, list elements, headings, etc. Like also how hyperlinks are used in e.g. Wikipedia.

My proposed solution handles anchor tags more like annotations of any kind of text, using a context manager for keeping track of the currently "active" hyperlink. Essentially something like:

class HTMLDocumentBackendWLinks(HTMLDocumentBackend):
...
    @override
    def analyze_tag(self, tag: Tag, doc: DoclingDocument) -> None:
        if tag.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
            self.handle_header(tag, doc)
...
        elif tag.name == "a":
            with self.use_hyperlink(tag):
                self.walk(tag, doc)
        else:
            self.walk(tag, doc)

    @contextmanager
    def use_hyperlink(self, tag):
        this_href = tag.get("href")
        if this_href is None:
            yield None
        else:
            if self.original_url is not None:
                this_href = urljoin(self.original_url, this_href)
            if this_href:
                old_hyperlink = self.hyperlink
                self.hyperlink = this_href
            try:
                yield None
            finally:
                if this_href:
                    self.hyperlink = old_hyperlink

and additionally all elemnt.text calls in handle_*-methods would have to be altered to recursively fetch the text from child nodes and check for anchor tags = like 5-10 lines of code.

The reason why I haven't proposed my solution as a PR yet is that DoclingDocument currently seems to lack a way to assign a hyperlink only to a portion of a given text in .add_text(). If I split the text by hyperlink and add individual texts using .add_text(), then the Chunking will be garbled because every link-Text becomes a separate paragraph. I will try to propose a solution in DoclingDocument as well as here.

After a having had a quick glimpse at #1411 it seems to be a simplification of the code, which is great, but it seems to "handle" style and script tags by deleting them and anchor tags remain untouched.

My free time permitting, I will propose the changes to docling-core and here within the next week and will try to open PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants