-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit handling of images processing and other fixes #2143
base: main
Are you sure you want to change the base?
Conversation
8ab34d2
to
8cf2027
Compare
@benoit74 Just to be clear, glad to see you working on the issue, but I don't think put webp content in path ending with .png (just an example) is a good idea at all. It is simply semantically wrong and we should not do that IMHO. Current approach works (modulo bugs - like always) and if we really want to do better we should keep track about the content mime-type (instead of relying on the extension). |
I agree, but this would mean a significant redesign of the scraper: with current architecture, as stated in the issue, we cannot know at HTML rewriting time what the result of image download/conversion will be ; for this we need to download the image and try the reencoding, which is currently done at a totally different stage. For now I prefer to have a scraper producing working ZIMs under all conditions with some semantic incoherence invisible to 99% of our users, rather than having non-working ZIMs like #2088. I do not mind to open an issue to fix this semantic incoherence on the medium / long term. For the record, this semantic incoherence is already present since "forever" in S3 keys used to cache image and we lived pretty well with it. |
Sample ZIMs:
|
It's not and should be any incoherence in S3 because the entry is tagged "webp" AFAIK. |
8cf2027
to
35b686a
Compare
S3 key is computed from online URL directly without any logic handling webp conversion: Line 591 in fc2af69
|
35b686a
to
c413004
Compare
c413004
to
88bbeec
Compare
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (75.55%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2143 +/- ##
==========================================
- Coverage 75.98% 75.19% -0.79%
==========================================
Files 41 41
Lines 3202 3213 +11
Branches 706 704 -2
==========================================
- Hits 2433 2416 -17
- Misses 655 676 +21
- Partials 114 121 +7 ☔ View full report in Codecov by Sentry. |
This is kinda a significant PR to fix many issues around images processing.
Fix #2140
Fix #2136
Fix #2088
Fix #2138
Changes
isImageUrl
,getMimeType
and constants:IMAGE_URL_REGEX
mime-type
package.webp
suffix to the path of images which have been converted to webp.webp
path prefix on reencoded images is skewed #2140, and as observed in Do not rely on URL filename extension to detect images #2088, we cannot have this information at HTML processing timedownloader.downloadContent
when downloading content, instead of the whole response upstream (which could contain "anything")