Assess tweaks to content extraction to remove headlines at end of article #86

rahulbot · 2024-05-07T14:53:13Z

After some digging on mediacloud/story-indexer#278 it looks like tweaking of integration of Trafilatura to use favor_precision=True could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.

Test case (change the favor_precision variable to see results):

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False,
                                        favor_precision=True)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))

The text was updated successfully, but these errors were encountered:

rahulbot · 2024-05-07T15:18:43Z

(started testing work on the feature-favor-precision branch)

rahulbot · 2024-07-17T18:17:16Z

@pgulley can you queue this up to re-asses with the test code vis-a-vis the comment at adbar/trafilatura#584 (comment)

rahulbot added the enhancement New feature or request label May 7, 2024

rahulbot added a commit that referenced this issue May 7, 2024

favor_precision with Trafilatura and add first tests (all pass) #86

58d480d

pgulley added this to Ingest + Index Infrastructure Jul 24, 2024

pgulley moved this to Todo in Ingest + Index Infrastructure Jul 24, 2024

pgulley modified the milestones: 4 - September, 3 - August Jul 24, 2024

pgulley self-assigned this Jul 31, 2024

pgulley modified the milestones: 3 - August, 4 - September Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assess tweaks to content extraction to remove headlines at end of article #86

Assess tweaks to content extraction to remove headlines at end of article #86

rahulbot commented May 7, 2024

rahulbot commented May 7, 2024

rahulbot commented Jul 17, 2024

Assess tweaks to content extraction to remove headlines at end of article #86

Assess tweaks to content extraction to remove headlines at end of article #86

Comments

rahulbot commented May 7, 2024

rahulbot commented May 7, 2024

rahulbot commented Jul 17, 2024