-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat(html): add anchor tag support in HTML conversion #1402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ka-weihe <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Signed-off-by: ka-weihe <[email protected]>
Signed-off-by: ka-weihe <[email protected]>
@ka-weihe could you please provide some simple (dummy) example file which uses this new feature? For example one with text and one without. |
Signed-off-by: ka-weihe <[email protected]>
Signed-off-by: ka-weihe <[email protected]>
I have added an example file now. |
I have also looked into this topic and have discovered this PR only now. Thanks to you all for your work! Looking at the suggested changes here I would like to point out that anchor tags can occur in any place of the HTML structure. In my understanding the code here parses the given example correctly, but it will not handle:
Please correct me if I am wrong. I would like to propose an alternative solution, should I create a separate PR? The code touches the same files, but looks quite a bit different. |
@krrome I agree the current implementation will still miss a few I propose to start with this implementation and build on top of it. I also see we have another PR which is proposing a large refactoring of the HTML backend #1411 (cc @ceberam). @ka-weihe From the checks I think you still have to
|
Thank you @dolfim-ibm for your reply. I see the point of taking a first step, I just wonder how this PR will help users in a real-world situation. At least for my documents this PR would not change anything because, all anchor tags are inside paragraphs, list elements, headings, etc. Like also how hyperlinks are used in e.g. Wikipedia. My proposed solution handles anchor tags more like annotations of any kind of text, using a context manager for keeping track of the currently "active" hyperlink. Essentially something like: class HTMLDocumentBackendWLinks(HTMLDocumentBackend):
...
@override
def analyze_tag(self, tag: Tag, doc: DoclingDocument) -> None:
if tag.name in ["h1", "h2", "h3", "h4", "h5", "h6"]:
self.handle_header(tag, doc)
...
elif tag.name == "a":
with self.use_hyperlink(tag):
self.walk(tag, doc)
else:
self.walk(tag, doc)
@contextmanager
def use_hyperlink(self, tag):
this_href = tag.get("href")
if this_href is None:
yield None
else:
if self.original_url is not None:
this_href = urljoin(self.original_url, this_href)
if this_href:
old_hyperlink = self.hyperlink
self.hyperlink = this_href
try:
yield None
finally:
if this_href:
self.hyperlink = old_hyperlink and additionally all The reason why I haven't proposed my solution as a PR yet is that DoclingDocument currently seems to lack a way to assign a hyperlink only to a portion of a given text in .add_text(). If I split the text by hyperlink and add individual texts using .add_text(), then the Chunking will be garbled because every link-Text becomes a separate paragraph. I will try to propose a solution in DoclingDocument as well as here. After a having had a quick glimpse at #1411 it seems to be a simplification of the code, which is great, but it seems to "handle" style and script tags by deleting them and anchor tags remain untouched. My free time permitting, I will propose the changes to docling-core and here within the next week and will try to open PRs. |
Description:
This pull request adds support for processing anchor (
<a>
) tags during the HTML document conversion. The changes introduce a new branch in theanalyze_tag
method to detect and handle anchor tags by extracting both the visible text and the associated hyperlink (href attribute). If the anchor contains visible text, it is combined with the hyperlink; otherwise, only the link is added as text. This enhancement improves the fidelity of the HTML conversion by ensuring links are properly captured in the document model.Commit Message:
Checklist: