Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate appearance of other article headlines in article content #278

Closed
rahulbot opened this issue May 2, 2024 · 3 comments
Closed
Assignees
Labels
data-quality question Further information is requested
Milestone

Comments

@rahulbot
Copy link
Contributor

rahulbot commented May 2, 2024

While conducting research we've come across a higher-than-desired amount of headlines from another article appearing in the text of any article. This is often caused by news layouts that link to lots of related or timely articles around an article, or embed a "related" articles box of content in the middle of an article. We need to characterize if this problem is due to poor extraction or is something to think about harder.

This is a well-know problem to us. In the legacy system we used to tokenize into sentences and store those sentences. Then we would look for sentences that appeared in multiple stories and eliminate them from stories after the first instance of usage. This helped us significantly reduce this problem for researchers. We decided to eliminate that feature because it was very costly (in both storage and compute).

Some sample data to work with from Emily:

@rahulbot rahulbot added question Further information is requested data-quality labels May 2, 2024
@rahulbot rahulbot self-assigned this May 2, 2024
@rahulbot rahulbot added this to the Production Beta 6 milestone May 2, 2024
@philbudne
Copy link
Contributor

philbudne commented May 3, 2024 via email

@rahulbot
Copy link
Contributor Author

rahulbot commented May 3, 2024

Right: custom solutions for specific sources is always something we shy away from as too brittle and unscalable.

First observation: I pulled some examples from the supplied files and they were all things that were "related" or "most popular" links at the end of the content, not embedded in the middle. With that in mind I created some sample code showing the error (from our perspective) asked for some advice on adbar/trafilatura#584.

@rahulbot
Copy link
Contributor Author

rahulbot commented May 7, 2024

Moved suggested actions into metadata-lib repo for consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-quality question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants