Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess tweaks to content extraction to remove headlines at end of article #86

Open
rahulbot opened this issue May 7, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@rahulbot
Copy link
Contributor

rahulbot commented May 7, 2024

After some digging on mediacloud/story-indexer#278 it looks like tweaking of integration of Trafilatura to use favor_precision=True could help. In the sample code I provided on a few test cases from our researchers it helped in 3/4 cases. This needs more vetting to gauge impacts to consider rolling out the change.

Test case (change the favor_precision variable to see results):

import trafilatura
import requests
MEDIA_CLOUD_USER_AGENT = 'Mozilla/5.0 (compatible; mediacloud academic archive; mediacloud.org)'

def is_text_in_webpage_content(txt:str, url:str) -> bool:
    req = requests.get(url, headers={'User-Agent': MEDIA_CLOUD_USER_AGENT},timeout=30)
    parsed = trafilatura.bare_extraction(req.text, only_with_metadata=False, url=url,
                                         include_images=False, include_comments=False,
                                        favor_precision=True)
    content_text = parsed['text']
    return txt in content_text

print(is_text_in_webpage_content(
    'Thai Official',  # item on bottom of page in "Latest News" section
    'https://www.ibtimes.co.uk/falling-inflation-shifts-focus-when-ecb-could-cut-rates-1722106'))
print(is_text_in_webpage_content(
    'HIV from Terrence Higgins to Today',  # <li> under the "listen on sounds" banner after article
    'https://www.bbc.co.uk/sport/football/67640638'))
print(is_text_in_webpage_content(
    'Madhuri Dixit',  # title of an item in the featured movie below the main content area
    'https://timesofindia.indiatimes.com/videos/lifestyle/fashion/10-indian-saris-every-woman-should-have-in-her-wardrobe/videoshow/105809845.cms'))
print(is_text_in_webpage_content(
    'Immigration, Ukraine',  # title of an item in the "most popular" sidebar content
    'https://www.bfmtv.com/cote-d-azur/nice-25-personnes-expulsees-lors-d-operations-anti-squat-menees-dans-le-quartier-des-liserons_AN-202312150639.html'))
@rahulbot rahulbot added the enhancement New feature or request label May 7, 2024
@rahulbot
Copy link
Contributor Author

rahulbot commented May 7, 2024

(started testing work on the feature-favor-precision branch)

@rahulbot
Copy link
Contributor Author

@pgulley can you queue this up to re-asses with the test code vis-a-vis the comment at adbar/trafilatura#584 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants