You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was setting a test site and playing with trafilatura and found a weird bug.
site URL: https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/
as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicate
Command:
html = trafilatura.fetch_url(url, no_ssl=True,)
ts = trafilatura.extract(html, output_format='xml', include_comments=False)
It is very simple extraction but I find some elements are extracted twice.
elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.
hi,
I was setting a test site and playing with trafilatura and found a weird bug.
site URL:
https://milkfriends.s1-tastewp.com/2024/06/27/ok-this/
as this test site is only available for 2 days, so I also attached the simple Gutenberg block code below for you to replicate
Command:
the Wordpress Gutenberg htmls below
It is very simple extraction but I find some elements are extracted twice.
elements below "this is sample intro" appeared twice but not all of the elements appear twice. some of the list elements only show up once.
See the extraction below:
The text was updated successfully, but these errors were encountered: