Tries to improve encoding handling in `Extractor` #169

flavioamieiro · 2014-10-23T15:01:16Z

Instead of immediately falling back to returning an unencoded string, we try to
use utf-8 if we cannot accurately detect the content encoding.

This is far from ideal, but since returning an unencoded string brings problems
further down the pipeline (other workers are not able to decode it), this is a
step to reduce this problems.

@fccoelho can you please review and comment on this approach?

Instead of immediately falling back to returning an unencoded string, we try to use utf-8 if we cannot accurately detect the content encoding. This is far from ideal, but since returning an unencoded string brings problems further down the pipeline (other workers are not able to decode it), this is a step to reduce this problems.

flavioamieiro · 2014-10-28T16:35:41Z

@fccoelho Did you get a chance to review this pull request? If it's OK, I want to merge it as soon as possible to work around the problem in production.

fccoelho · 2014-10-28T16:41:30Z

@flavioamieiro, I think we should try other encodings before giving up. For example, 8859-1 , CP1252, etc.

Since the rest of our workflow depends on the text being a unicode object, we need to make sure this is what we return from our decode process. If even after trying different codecs we cannot successfully decode the string, we will do it using utf-8 and replacing invalid chars with the unicode replacement character. I had to make two decisions here that could be changed further down the line: the encoding to use and the error handling. I decided on utf-8 because I think it is the most common encoding in our use cases and on `replace` instead of `ignore` so we would have some kind of evidence of where the invalid characters were (and that they existed in the first place).

flavioamieiro · 2014-11-05T15:18:29Z

@fccoelho can you please take a look at the code? Also, please read the f65f2df commit message, in which I explain some of the decisions I made here.

fccoelho · 2014-11-05T17:00:42Z

This is a good idea but another possibility would be add something to the Exceptions property, but maybe that should be used strictly for failures.
I am ok with the rest, if this can handle guilherme's documents, I think we can merge.

…oded

flavioamieiro added 2 commits November 5, 2014 13:06

Also tries to decode text as iso-8859-1 if utf-8 fails to parse it

708abe5

Adds metadata to the document specifying if the text was forcibly dec…

c8690b1

…oded

flavioamieiro merged commit c8690b1 into NAMD:develop Nov 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tries to improve encoding handling in `Extractor` #169

Tries to improve encoding handling in `Extractor` #169

flavioamieiro commented Oct 23, 2014

flavioamieiro commented Oct 28, 2014

fccoelho commented Oct 28, 2014

flavioamieiro commented Nov 5, 2014

fccoelho commented Nov 5, 2014

Tries to improve encoding handling in Extractor #169

Tries to improve encoding handling in Extractor #169

Conversation

flavioamieiro commented Oct 23, 2014

flavioamieiro commented Oct 28, 2014

fccoelho commented Oct 28, 2014

flavioamieiro commented Nov 5, 2014

fccoelho commented Nov 5, 2014

Tries to improve encoding handling in `Extractor` #169

Tries to improve encoding handling in `Extractor` #169