You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While conducting research we've come across a higher-than-desired amount of headlines from another article appearing in the text of any article. This is often caused by news layouts that link to lots of related or timely articles around an article, or embed a "related" articles box of content in the middle of an article. We need to characterize if this problem is due to poor extraction or is something to think about harder.
This is a well-know problem to us. In the legacy system we used to tokenize into sentences and store those sentences. Then we would look for sentences that appeared in multiple stories and eliminate them from stories after the first instance of usage. This helped us significantly reduce this problem for researchers. We decided to eliminate that feature because it was very costly (in both storage and compute).
I remember seeing that the old system had a plethora of fetcher
classes, and I've wondered if that was to deal with extracting the
"meat" from particulars sites pages (tho it could also have been about
link extraction)...
If that was the case, I wonder if supplying a list XPath strings with
domains to use/try them on could be a quick and dirty solution?
BUT, I would expect any solution that enshrines any knowledge of page
structure to break often, and falling back to inhaling everything
(rather than nothing) it would require constant maintenance to keep it
operating in any useful way.
Right: custom solutions for specific sources is always something we shy away from as too brittle and unscalable.
First observation: I pulled some examples from the supplied files and they were all things that were "related" or "most popular" links at the end of the content, not embedded in the middle. With that in mind I created some sample code showing the error (from our perspective) asked for some advice on adbar/trafilatura#584.
While conducting research we've come across a higher-than-desired amount of headlines from another article appearing in the text of any article. This is often caused by news layouts that link to lots of related or timely articles around an article, or embed a "related" articles box of content in the middle of an article. We need to characterize if this problem is due to poor extraction or is something to think about harder.
This is a well-know problem to us. In the legacy system we used to tokenize into sentences and store those sentences. Then we would look for sentences that appeared in multiple stories and eliminate them from stories after the first instance of usage. This helped us significantly reduce this problem for researchers. We decided to eliminate that feature because it was very costly (in both storage and compute).
Some sample data to work with from Emily:
The text was updated successfully, but these errors were encountered: