Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unicode and str in exporters for py2 #274

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

guewen
Copy link

@guewen guewen commented Aug 23, 2018

Before this commit, in py2 only bytes strings were exported and in py3
only unicode strings were exported.

I'm not sure I'm doing it right, that's at least an opening for a discussion.

Fixes #273

Before this commit, in py2 only bytes strings were exported and in py3
only unicode strings were exported.
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@guewen
Copy link
Author

guewen commented Aug 23, 2018

I signed it!

Hey, I signed the CLA :)

@googlebot
Copy link

CLAs look good, thanks!

@guewen
Copy link
Author

guewen commented Aug 23, 2018

I know the tests don't pass on py3 but before refining I'd like a validation that I understood correctly the goal and I'm not heading in the wrong direction.

@c24t
Copy link
Member

c24t commented Jan 22, 2019

Thanks for the PR @guewen. Using the six types to solve the problem of exporting unicode attribute values looks right to me.

In general though this problem is a big can of worms, and this PR makes it clear that we need to be more careful about internal use of the str type.

If I understand correctly: it looks like the library assumes string-valued attribute values are always strs. This is usually a safe assumption, but it means that we can't store non-ASCII characters in python 2.x. This is a problem for code that naively uses unicodes in place of strs since we'll silently fail to export these attributes.

So in python 2.x, strs are ASCII-encoded byte strings. Decoding a byte string with any valid encoding gets you a unicode... which is not a str:

>>> type(b'abc')
str

>>> type(b'abc'.decode('utf-8'))
unicode

>>> isinstance(b'abc'.decode('utf-8'), str)
False

And in python 3.x strs are effectively python 2.x's unicodes, and byte strings are demoted to bytes with no implicit encoding:

>>> type(b'abc')
bytes

>>> type(b'abc'.decode('utf-8'))
str

We have to support both versions of python, and have to support non-ASCII characters in attribute values. But the spec also says to truncate these strings to 256 bytes without specifying an encoding.

In 2/3 decoding a byte string with any valid encoding gets you a unicode/str, which is itself stored internally as unicode, using up to 4 bytes per character depending on the python implementation. Among other problems, this means that we might truncate a 265 character string down to 64 characters even if it's possible to encode it with ASCII. This is a moot point now since it doesn't look like we're actually truncating these strings, but does suggest we have to be careful making changes like this that add decode calls where byte strings would otherwise stay byte strings.

@c24t
Copy link
Member

c24t commented Jan 22, 2019

Which is all to say: the direction looks good, but there may be some unintended consequences.

@guewen
Copy link
Author

guewen commented Feb 13, 2019

Thanks for your detailed answer, particularly, I wasn't aware of the 256 bytes truncation (new to the subject).

@guewen guewen requested review from aabmass, hectorhdzg, lzchen, songy23 and a team as code owners May 13, 2021 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants