-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to deal with unicode? #34
Comments
The formula parser depends on the Python lexer, and the python lexer works I'm not sure how much we can really do to fix this within py2. Sure you
|
I'm just trying to clean up a PR that has been sitting around for a while, and it tries to support unicode. It also dawned on us that we have no tests for unicode formula input, so I imagine it won't quite work for non-ascii characters. I'll let you go through the permutations, but like I said, AFAICT, this is the only thing that "works." It'd be nice if patsy did it under the hood, so I don't have to decode things on the way back out to return unicode, but you know better than me.
|
There's really no reliable way for patsy to somehow reach inside the 'data' Two options that work now with the original DataFrame with unicode keys: Assumes source code is in utf-8dmatrices("Q('àèéòù'.decode('utf-8')) ~ x", data=data) Works in generaldmatrices(u"Q(u'àèéòù') ~ x".encode("unicode-escape"), data=data) Neither of these gives you nice term names, but that seems impossible Some things that patsy could do:
I'm really reluctant to implement either of these because they're both On Mon, Feb 17, 2014 at 3:25 PM, Skipper Seabold
Nathaniel J. Smith |
On Mon, Feb 17, 2014 at 7:10 PM, njsmith [email protected] wrote:
Yeah, we tried both and the latter was my "solution" given that it's easier
I agree that it's completely a least worst solution, and I understand if |
If you just put unicode characters into a string literal in py2, what even
|
Yikes, just had the same problem: I had a big list column names, which were in unicode and then constructed the formula like If there is no proper solution: please make this more obvious by e.g. warning in |
Has there been any progress on this? Looking back through the comments here, I don't see an explanation of why patsy requires bytestrings in the first place. |
Patsy does at least provide a more sensible/detailed error messages now: https://github.com/pydata/patsy/blob/master/patsy/highlevel.py#L49-L60 @BrenBarn: unfortunately, the bytestring requirement on py2 is baked into the language itself: patsy formulas contain python code, and on python 2, python code is bytestrings (specifically, if you try passing unicode to the |
Dealing with just Python 2 for now, I understand that patsy expects strings. But the data containers might not have this design. So what's the recommended way for handling this? Should we be messing with the data keys under the hood, or should patsy? The only way I can think to handle this (other than statsmodels doing it under the hood) is for patsy to accept unicode but also an encoding so the formula and they data keys can both be encoded correctly. E.g., this fails
But if we also encode the data keys, it's fine. So should dmatrices and whatever other entry points also take an encoding? Am I missing something?
The text was updated successfully, but these errors were encountered: