You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After some digging on mediacloud/story-indexer#278 it looks like tweaking of integration of Trafilatura to use favor_precision=True could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.
Test case (change the favor_precision variable to see results):
importtrafilaturaimportrequestsMEDIA_CLOUD_USER_AGENT='Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'defis_text_in_webpage_content(txt:str, url:str) ->bool:
req=requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
parsed=trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
include_images=False, include_comments=False,
favor_precision=True)
content_text=parsed['text']
returntxtincontent_textprint(is_text_in_webpage_content(
'Thai Official', # item on bottom of page in "Latest News" section'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
'HIV from Terrence Higgins to Today', # <li> under the "listen on sounds" banner after article'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
'Madhuri Dixit', # title of an item in the featured movie below the main content area'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
'Immigration, Ukraine', # title of an item in the "most popular" sidebar content'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))
The text was updated successfully, but these errors were encountered:
After some digging on mediacloud/story-indexer#278 it looks like tweaking of integration of Trafilatura to use
favor_precision=True
could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.Test case (change the
favor_precision
variable to see results):The text was updated successfully, but these errors were encountered: