investigate appearance of other article headlines in article content #278

rahulbot · 2024-05-02T22:03:13Z

While conducting research we've come across a higher-than-desired amount of headlines from another article appearing in the text of any article. This is often caused by news layouts that link to lots of related or timely articles around an article, or embed a "related" articles box of content in the middle of an article. We need to characterize if this problem is due to poor extraction or is something to think about harder.

This is a well-know problem to us. In the legacy system we used to tokenize into sentences and store those sentences. Then we would look for sentences that appeared in multiple stories and eliminate them from stories after the first instance of usage. This helped us significantly reduce this problem for researchers. We decided to eliminate that feature because it was very costly (in both storage and compute).

Some sample data to work with from Emily:

philbudne · 2024-05-03T00:04:00Z

I remember seeing that the old system had a plethora of fetcher classes, and I've wondered if that was to deal with extracting the "meat" from particulars sites pages (tho it could also have been about link extraction)... If that was the case, I wonder if supplying a list XPath strings with domains to use/try them on could be a quick and dirty solution? BUT, I would expect any solution that enshrines any knowledge of page structure to break often, and falling back to inhaling everything (rather than nothing) it would require constant maintenance to keep it operating in any useful way.

rahulbot · 2024-05-03T17:21:48Z

Right: custom solutions for specific sources is always something we shy away from as too brittle and unscalable.

First observation: I pulled some examples from the supplied files and they were all things that were "related" or "most popular" links at the end of the content, not embedded in the middle. With that in mind I created some sample code showing the error (from our perspective) asked for some advice on adbar/trafilatura#584.

rahulbot · 2024-05-07T14:53:46Z

Moved suggested actions into metadata-lib repo for consideration.

rahulbot added question Further information is requested data-quality labels May 2, 2024

rahulbot self-assigned this May 2, 2024

rahulbot added this to the Production Beta 6 milestone May 2, 2024

rahulbot mentioned this issue May 7, 2024

Assess tweaks to content extraction to remove headlines at end of article mediacloud/metadata-lib#86

Open

rahulbot closed this as completed May 7, 2024

rahulbot mentioned this issue May 7, 2024

Consider approaches to sentence-based deduplication mediacloud/sous-chef#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate appearance of other article headlines in article content #278

investigate appearance of other article headlines in article content #278

rahulbot commented May 2, 2024

philbudne commented May 3, 2024 via email

rahulbot commented May 3, 2024

rahulbot commented May 7, 2024

investigate appearance of other article headlines in article content #278

investigate appearance of other article headlines in article content #278

Comments

rahulbot commented May 2, 2024

philbudne commented May 3, 2024 via email

rahulbot commented May 3, 2024

rahulbot commented May 7, 2024