Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically redirect to articles with same checksum #33

Open
satyamtg opened this issue Jul 18, 2020 · 7 comments
Open

Automatically redirect to articles with same checksum #33

satyamtg opened this issue Jul 18, 2020 · 7 comments
Labels
enhancement New feature or request
Milestone

Comments

@satyamtg
Copy link
Contributor

As discussed in openzim/sotoki#162 (comment), it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of a resource and create redirects if that's being duplicated (or fail intelligently so we can handle).

@kelson42
Copy link
Contributor

kelson42 commented Jul 18, 2020

To me, for the moment, such a feature should better be in python-scraperlib (or any higher level library) because:

  • This is too smart to be done in the libzim
  • I believe basically the scraper (not the libzim) should be able to do things in a clean manner
  • I understand under certain special conditions the high level scraper might better rely for a certain range of articles of a lower level smart feature like this

@kelson42 kelson42 transferred this issue from openzim/libzim Jul 19, 2020
@rgaudin
Copy link
Member

rgaudin commented Jul 20, 2020

Also, as discussed with @kelson42, articles have no checksum in the ZIM. I was led to think that based on zimcheck's duplicates output but it's zimcheck calculating those.

What we could do is have a helper in scraperlib that calculates checksums, stores them and compares them to adjust behavior (create redirects?).
That would be extra and should be enabled on a subset of articles via some filtering pattern.
The main use case would be for zimit where the scraper has no control over the content. In this case, if the zimcheck reports duplicates, we could enable this mechanism in the recipe by specifying the filtering patterns.

This feature could have a HUGE impact on resources (CPU, RAM, potentially IO) so it's goal will be to clear duplicates for the case it cannot be done in the scraper. Non-generic scrapers should take care of duplicates themselves.

@rgaudin rgaudin changed the title A way to automatically redirect to articles with same checksum Automatically redirect to articles with same checksum Jul 20, 2020
@kelson42 kelson42 added the enhancement New feature or request label Aug 9, 2020
@kelson42 kelson42 pinned this issue Aug 9, 2020
@stale
Copy link

stale bot commented Oct 10, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@benoit74
Copy link
Collaborator

I will start to work on an implementation of this issue. Will open a PR once I have something ready to review. I will try to follow advises mentioned above

@stale stale bot removed the stale label Apr 20, 2022
@kelson42 kelson42 assigned benoit74 and unassigned rgaudin and kelson42 Apr 21, 2022
@kelson42 kelson42 added this to the 1.5.0 milestone Apr 21, 2022
@kelson42 kelson42 removed this from the 1.7.0 milestone Jun 11, 2022
@stale
Copy link

stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Aug 13, 2022
@kelson42
Copy link
Contributor

Maybe we should better use aliases?

@stale stale bot removed the stale label Dec 16, 2023
@rgaudin
Copy link
Member

rgaudin commented Dec 16, 2023

Maybe we should better use aliases?

Doesn't solve anything. We still don't know ahead of adding the entry that it's a duplicate otherwise we'd probably do thing differently depending on the scraper: not include the resource, use an alias or a redirect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants