-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RGBA PNG saved as PDF renders incorrectly in some applications #8074
Comments
Could you share the code that you're running to get that output from pdf.js? |
The pdf.js output (as well as the Evince one for Evince) is just a screenshot of what I see when opening the PDF file with Firefox where pdf.js is the default viewer. |
Pillow uses JPXDecode when saving RGBA PDFs. Two issues have been opened with pdf.js about rendering from this filter - mozilla/pdf.js#16782 and mozilla/pdf.js#17416 - so I don't think this is explicitly a Pillow bug. |
Thanks for the research. Given that Evince and Okular show the same behavior, it seems like this is not too uncommon for free tools. As a consequence, it seems like using Pillow to convert RGBA images to PDF files still has its limitations (although rather on the client side) after having been unsupported previously. Thus I am still left with either pasting the RGBA image onto a white background for the conversion or avoid the color space conversion altogether. |
As a heads up out of curiosity: Do we really need the image conversion from PNG to JPEG2000? Shouldn't we be able to just use the original PNG image inside the PDF file? |
You can see on page 31 of https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf that JPEG2000 is used in the JPXDecode filter. JPEG2000 is naturally supported in a way that PNGs are not. It does appear that we could split up the data into different roles and pass that to the PDF, but it's not as simple as JPEG2000 encoding. |
The PdfImage class in pdfrwx embeds images with transparency using PDF image xobject's soft masks — which is how it should be embedded according to PDF specs. (disclaimer: I am the author of pdfrwx; and I should also mention that the interface of the class will substantially change in the upcoming new version, so stay tuned). |
the approach of using JPX encoding is valid and in accordance with PDF spec : the issue is on definitely on pdf.js. does any want to raise up on mozilla/pdf.js#16782 ? 😳 |
I am not sure about that:
— this is from Adobe PDF Reference v. 1.7 page 88. If this is what you're referring to then the opacity mentioned above is not the same as transparency everywhere else (PNG etc.). This feature is specific to the JPEG2000 and has made its way into the standard simply for compatibility reasons (i.e., to be able to "just embed the JPEG2000 file"). In particular, please pay attention to the note above. This note should discourage any implementer to put anything into the JPEG2000 opacity channels other than in situations where you already have a JPEG2000 file in the first place and just want to keep it in the stream in exactly the state you got it in. And even in those circumstances, one should really just split JPEG2000 into a several image XObjects (one per channel with opacity). Because if you don't then you're at the mercy of the PDF processing application to interpret the opacity however it likes. |
The use of Opacity/Transparency in JPX is perfectly valid, even more, it provides capability to define 1 transparency information per a secondary channel which is not possible using masks. |
As I said, if you start with a JPEG2000 file (which possibly has channel-specific opacity) I might see reasons to keep it all inside one /JPXDecode -encoded dictionary stream. However, when starting with any other image format I know of, including PNG, 1) there's no benefit from the possibility of using channel-specific opacity since since we don't have any to begin with; 2) there's the downside of doing something which is not fully described by the spec, as opposed to something that is fully described by it. So, given this, could you explain the logic behind the decision to encode PNGs (or any other image format with transparency besides JPEG2000, for that matter) with /JPXDecode? Added: PNG specifically can use filters, which have their exact counter-parts in PDF (see PNG predictors in /FlateDecode). It makes all the more sense to encode PNGs as /FlateDecode. |
a) I remind that the issue is identified as an issue not yet solved of pdf.js : I see no reason to change pillow to fix it |
This is not an issue limited to pdf.js - It seems like most libraries/viewers/tools based upon poppler and/or cairo share the same limitation. Thus it depends on the internal implementation of each library. If Pillow is able to generate PDF files which render correctly with the existing libraries while ideally allowing for clean extraction with |
just reminding the current implementation:
|
I would strongly suggest changing the policy on the last line to using PDF image SMask-s. This is what SMask-s have been intended for in the first place. I also believe that for the case of "converting a JPEG2000 file to PDF" it should be possible to repackage JPEG2000 files in LA/RGBA modes as two separate files, one for the colors and one for the transparency, without the need to re-encode the pixel data, and then make a PDF image with two dictionary streams, for the image itself and for its SMask, encoded with /JPXDecode filters. But I'm not sure how relevant this issue is to Pillow in particular since looks like Pillow does not keep the original encoded JPEG2000 streams around when opening a JPEG2000 file (see #7896). To reiterate: I believe that PDF files with images containing streams of LA/RGBA images in the /JPXDecode filter should be avoided altogether. |
for me pillow deals with image manipulation /conversion and PDF is not an image format specially if you consider that sMask solution as 2 images. |
PDF treats images with SMask-s as one image. Dictionary-wise, SMask is just another entry in the image dictionary. It looks especially simple in pdfrw: image = IndirectPdfDict()
mask = IndirectPdfDict()
mask.Subtype = PdfName.Image
mask.stream = alpha_stream
image.Subtype = PdfName.Image
image.stream = colors_stream
image.SMask = mask That's it. Does Pillow already have a PyPDF dependency for other things? If not, I would suggest taking a look at pdfrw as it is trivial to create low-level objects with it. |
I would prefer Pillow just implement the SMask solution itself rather than add an external dependency. I've created #8097. See what you think. |
Apart from avoiding a third-party dependency, pdfrw has been unmaintained for years, thus adding a new dependency on it does not really feel future-proof. |
The PDF.js issue has now been resolved! mozilla/pdf.js#16782 It will still require another release of PDF.js, and then a release of Firefox to include that, but it is a positive development, and I would rather wait for that proper fix than the workaround of my PR. |
Although pdf.js might have a fix, it seems like poppler/cairo still has this issue and is widely used as well: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1486 |
poppler being used in Evince, and so is effectively the other software that you reported in the original comment. I'm not sold on the idea that if a viewer has a bug, then Pillow needs to workaround that - it seems like a slippery slope towards accepting responsibility for the problems of every image viewer out there. Sure, sometimes if a viewer isn't displaying an image correctly, that's a sign that Pillow has made a mistake, but pdf.js accepted the bug and fixed their end, so that isn't the case here. In the case of the image that you posted here, I don't see any transparency, so you could quite easily workaround this situation by converting the image to RGB. |
I am aware that you surely are not responsible for the rendering of other tooling. poppler just tends to be more or less the default library for most PDF viewers as far as I am aware. Personally, I see a general issue with converting images with an alpha channel to a fixed background, as this would require actual automated content analysis to choose the correct background color (screenshots and most common images tend to work with white indeed), but this is not directly related to this issue. |
Testing with the latest version of Firefox (129.0.2), it is now rendering the PDF correctly - so the PDF.js fix has made it through. |
Regardless of viewer issues, I feel like image transparency should generally be handled via |
The practical downside to this is that #8097 increases the PDF size of this issue's image from 15,721 bytes to 26,604 bytes.
I'm not following you here. Pillow currently offers a limited set of options when saving a PDF - users aren't able to arbitrarily choose their own filter (without change the image mode, that is). To what end is it helpful if the data is saved in a flexible manner? Are you planning on running other software over Pillow's PDFs, to... I don't know, remove the transparency, and that's easier if the mask is separate? |
The reason SMasks have been created as a separate structure in the PDF language (as opposed to just providing support in the standard for image streams that support transparency, such as PNG, TIFF and JPEG2000) is indeed that of flexibility. Here's one example: suppose you want to produce a version of a PDF for the web, which is was originally intended for print. You'd might need to convert CMYK images to RGB, and compress those in RGB to reduce the file size. The code that does the CMYK-RGB conversion is independent of the alpha-channel and can work on the colors channel only. Same with the compression code. Moreover, the compression is normally (e.g. in Acrobat) done on colors channel only as compressing the transparency mask can (and often does) introduce visible artifacts. In summary: there's a reason why SMask-s have been introduced, and before arguing against their utility I may suggest studying the history of the subject first; a chat with ChatGPT might be of use here. |
I was referring to the image from the start of this issue, saved with the code from the start of this issue -
My experience in #8097 is that the SMask is a separate image, and the output is noticeably larger. If you're able to put together an implementation demonstrating that I've missed a simple way to drastically reduce the file size, feel free to prove me wrong.
Ok, so the 'flexibility' argument is intended to help Pillow's internal function, and this isn't for the benefit of the end user? Except Pillow doesn't perform conversions like you're describing when writing the PDF images.
If you have an example of an image that Pillow is saving imperfectly, that would be helpful to demonstrate this. I'm not trying to argue that SMasks are bad - I'm asking what the benefit is. It doesn't seem to simplify the code, and from what I've seen so far, it increases file size, so I'm wondering why they're helpful to either Pillow or the end user. I think we need a reason to change Pillow's current behaviour by using explicit SMasks for LA/RGBA, not a reason not to change it. |
A thing to note is that this does not only compare separate vs. inline alpha, but also JPEG vs. JPEG2000, where the latter commonly gives a better quality:compression ratio, so it'd be interesting what part of that difference might just be caused by the other codec? However, if it's true that the increase is mainly due to the separate SMask, then that's a fair point. That said, if you don't mind the somewhat uncommon JPEG2000, why not use it for L/RGB/CMYK, too? That might be more consistent. |
I tried this with diff --git a/src/PIL/PdfImagePlugin.py b/src/PIL/PdfImagePlugin.py
index e9c20ddc1..2d91bd924 100644
--- a/src/PIL/PdfImagePlugin.py
+++ b/src/PIL/PdfImagePlugin.py
@@ -88,7 +88,7 @@ def _write_image(
dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceGray")
procset = "ImageB" # grayscale
elif im.mode == "L":
- decode_filter = "DCTDecode"
+ decode_filter = "JPXDecode"
# params = f"<< /Predictor 15 /Columns {width-2} >>"
dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceGray")
procset = "ImageB" # grayscale
@@ -116,7 +116,7 @@ def _write_image(
image_ref = _write_image(smask, filename, existing_pdf, image_refs)[0]
dict_obj["SMask"] = image_ref
elif im.mode == "RGB":
- decode_filter = "DCTDecode"
+ decode_filter = "JPXDecode"
dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceRGB")
procset = "ImageC" # color images
elif im.mode == "RGBA": import os
from PIL import Image
im = Image.open("Tests/images/hopper.png")
for mode in ("L", "RGB"):
im.convert(mode).save(mode+".pdf")
print(mode, os.stat(mode+".pdf").st_size) but it actually increased the filesize.
Testing the image from the start of this issue with #8097 but still with JPXDecode, diff --git a/src/PIL/PdfImagePlugin.py b/src/PIL/PdfImagePlugin.py
index 355eb55fb..1a2561bbc 100644
--- a/src/PIL/PdfImagePlugin.py
+++ b/src/PIL/PdfImagePlugin.py
@@ -106,7 +106,7 @@ def _write_image(im, filename, existing_pdf, image_refs):
if "transparency" in im.info:
smask = im.convert("LA")
elif im.mode in ("RGB", "RGBA"):
- filter = "DCTDecode"
+ filter = "JPXDecode"
dict_obj["ColorSpace"] = PdfParser.PdfName("DeviceRGB")
procset = "ImageC" # color images
if im.mode == "RGBA":
@@ -145,6 +145,9 @@ def _write_image(im, filename, existing_pdf, image_refs):
# use a single strip
strip_size=math.ceil(width / 8) * height,
)
+ elif filter == "JPXDecode":
+ del dict_obj["BitsPerComponent"]
+ Image.SAVE["JPEG2000"](im, op, filename)
elif filter == "DCTDecode":
Image.SAVE["JPEG"](im, op, filename)
else: I get 25,595 bytes - so the increase is indeed primarily due to the SMask, not the change in filter. |
Guys, this discussion, frankly, makes no sense to me: why would you do lossy compression (JPEG2000) on an image if the user is not asking for it? The user just wants to save an image in a particular format (PDF). The user expects that when reading back the saved PDF they will obtain exactly the same image as the original. What you are describing — using JPEG2000 to compress the image — is a behavior that is detrimental to the image quality and deceptive with respect to the user. The questions of "how easy it is to implement this", "what the file size is", "if it works then just let it work", "it's the fault of PDF.js" etc, etc — are all completely irrelevant. If you want to write a library that does what the user wants then do as the user wants. Otherwise no one will be using it. I personally would never use the
This looks wrong to me 1) why the need to convert? 2) this conversion produces a two-layer image, not a one-layer image needed for an SMask. The correct way is to split a PIL image and take the alpha-layer. Also, if for some reason you still really really want to compare things under (lossy) JPEG2000 compression then I'm assuming your realize that when trying to prove that "SMasks increase file size" you should compare a JPXDecode-encoded SMask (A) + JPXDecode-encoded image with the alpha-channel removed (B) vs JPXDecode-encoded image with the alpha-channel (C) [while the compression strength, block sizes and other parameters of the JPEG2000 compression are kept the same]. In other words, you did make sure that the size of (B) is smaller than the size of (C), right? |
|
When Pillow encodes JPEG2000 data for a PDF, it does so internally, and by default, Pillow's JPEG2000 images are lossless. See https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html#jpeg-2000-saving
So JPXDecode should be lossless for Pillow.
For clarity, the code you're referencing comes from #8097, which hasn't been merged. In that part of the code, it is dealing with a P mode image. You can't directly split the alpha channel out of a palette image, so a conversion is necessary. Later on in that PR, if smask:
smask = smask.getchannel("A") will only keep the alpha channel from the converted image.
To state this here for any who don't click on it, the link you've provided doesn't actually provide an example image. Rather, I think you are continuing the discussion about lossy saving. Hopefully I've already addressed that. |
What did you do?
Convert a PNG file (screenshots created on Gnome and on Windows) to PDF.
What did you expect to happen?
The rendered PDF looks correct.
What actually happened?
The output looks correct in MuPDF and Chromium, but incorrect in Evince and
pdf.js
.What are your OS, Python and Pillow versions?
Input:
Output: out2.pdf
Rendered output from Evince:
Rendered output from pdf.js:
The text was updated successfully, but these errors were encountered: