You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, a great package and thank you for making it available!
I am looking for a way to match a link to its bookmark - is that possible in principle? The below is an example of what I want to achieve.
I defined an annotation for the attribute id - that returns the location and the text of the bookmark, but not the actual id. So, it's not possible to know which bookmark has been identified.
Also, when the div element doesn't have any text or sub-elements, the annotation doesn't return anything.
Would I have to re-define the div tag handler (even though I don't know which tags may contain bookmarks)?
Also, it seems that I cannot define any custom attribute handlers, other than the three already defined?
Many thanks!
from lxml.html import fromstring
from inscriptis.html_engine import Inscriptis
from inscriptis import ParserConfig
from inscriptis.css_profiles import CSS_PROFILES
from inscriptis import get_annotated_text
doc = r"""
<html><body>
<div><a href="#idd1">Part 1</a></div>
<div><a href="#idd2">Part 2</a></div>
<div id="idd1"></div>
<div id="idd2">target with text</div>
</body></html>
"""
annotation_rules = {"a": ["link"], "#id": ["target"]}
css = CSS_PROFILES['relaxed'].copy()
inscriptis_parser_config = ParserConfig(display_links=True, annotation_rules=annotation_rules, css=css)
html_tree = fromstring(doc)
parser = Inscriptis(html_tree, config=inscriptis_parser_config)
txt = parser.get_text()
ant = parser.get_annotations()
labels = [(a.start, a.end, a.metadata) for a in ant]
for ii, ant in enumerate(labels):
print(f"{ii} {ant[2]} {ant[0]} {txt[ant[0]:ant[1]]}")
The output is:
0 link 3 Part 1](#idd1)
1 link 21 Part 2](#idd2)
2 target 36 target with text
In this example, I am looking for the id of the last div element, as well as the id, location and text of the third div element.
(Note also that the text of the link doesn't include the opening [.)
The text was updated successfully, but these errors were encountered:
First, a great package and thank you for making it available!
I am looking for a way to match a link to its bookmark - is that possible in principle? The below is an example of what I want to achieve.
I defined an annotation for the attribute
id
- that returns the location and the text of the bookmark, but not the actual id. So, it's not possible to know which bookmark has been identified.Also, when the
div
element doesn't have any text or sub-elements, the annotation doesn't return anything.Would I have to re-define the
div
tag handler (even though I don't know which tags may contain bookmarks)?Also, it seems that I cannot define any custom attribute handlers, other than the three already defined?
Many thanks!
The output is:
In this example, I am looking for the
id
of the lastdiv
element, as well as theid
, location and text of the thirddiv
element.(Note also that the text of the link doesn't include the opening
[
.)The text was updated successfully, but these errors were encountered: