-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically redirect to articles with same checksum #33
Comments
To me, for the moment, such a feature should better be in python-scraperlib (or any higher level library) because:
|
Also, as discussed with @kelson42, articles have no checksum in the ZIM. I was led to think that based on zimcheck's duplicates output but it's zimcheck calculating those. What we could do is have a helper in scraperlib that calculates checksums, stores them and compares them to adjust behavior (create redirects?). This feature could have a HUGE impact on resources (CPU, RAM, potentially IO) so it's goal will be to clear duplicates for the case it cannot be done in the scraper. Non-generic scrapers should take care of duplicates themselves. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
I will start to work on an implementation of this issue. Will open a PR once I have something ready to review. I will try to follow advises mentioned above |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Maybe we should better use aliases? |
Doesn't solve anything. We still don't know ahead of adding the entry that it's a duplicate otherwise we'd probably do thing differently depending on the scraper: not include the resource, use an alias or a redirect. |
As discussed in openzim/sotoki#162 (comment), it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of a resource and create redirects if that's being duplicated (or fail intelligently so we can handle).
The text was updated successfully, but these errors were encountered: