Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tries to improve encoding handling in Extractor #169

Merged
merged 4 commits into from
Nov 5, 2014

Conversation

flavioamieiro
Copy link
Member

Instead of immediately falling back to returning an unencoded string, we try to
use utf-8 if we cannot accurately detect the content encoding.

This is far from ideal, but since returning an unencoded string brings problems
further down the pipeline (other workers are not able to decode it), this is a
step to reduce this problems.

@fccoelho can you please review and comment on this approach?

Instead of immediately falling back to returning an unencoded string, we try to
use utf-8 if we cannot accurately detect the content encoding.

This is far from ideal, but since returning an unencoded string brings problems
further down the pipeline (other workers are not able to decode it), this is a
step to reduce this problems.
@flavioamieiro
Copy link
Member Author

@fccoelho Did you get a chance to review this pull request? If it's OK, I want to merge it as soon as possible to work around the problem in production.

@fccoelho
Copy link
Member

@flavioamieiro, I think we should try other encodings before giving up. For example, 8859-1 , CP1252, etc.

Since the rest of our workflow depends on the text being a unicode object, we
need to make sure this is what we return from our decode process. If even after
trying different codecs we cannot successfully decode the string, we will do it
using utf-8 and replacing invalid chars with the unicode replacement character.

I had to make two decisions here that could be changed further down the line:
the encoding to use and the error handling.  I decided on utf-8 because I think
it is the most common encoding in our use cases and on `replace` instead of
`ignore` so we would have some kind of evidence of where the invalid characters
were (and that they existed in the first place).
@flavioamieiro
Copy link
Member Author

@fccoelho can you please take a look at the code? Also, please read the f65f2df commit message, in which I explain some of the decisions I made here.

@fccoelho
Copy link
Member

fccoelho commented Nov 5, 2014

This is a good idea but another possibility would be add something to the Exceptions property, but maybe that should be used strictly for failures.
I am ok with the rest, if this can handle guilherme's documents, I think we can merge.

@flavioamieiro flavioamieiro merged commit c8690b1 into NAMD:develop Nov 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants