use a visually more appealing encoding#261
use a visually more appealing encoding#261christian-monch wants to merge 3 commits intodatalad:mainfrom
Conversation
This commit uses Unidecode to translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations. If unidecode returns an empty string, the name "__not_representable_<X>" is used, where "<X>" is the length of the original string.
|
It should be noted that the results of |
This commit ensures that mangle_path is tested with "printable" unicode characters, e.g. `ä`. that will be converted into ascii characters by `unidecode()`.
|
Why is
done, rather than only the fallback on the hexcodes? I cannot see from the test diff alone how it would look. Need to handcraft a test dataset and try. |
Good question. The answer is that we aimed at a human readable representation, and using the hexcodes would probably be confusing. I think it is a good idea though. If we would do that, we have to decide whether we want to distinguish hex-code-file names that are generated because We could also leave to interpretation to the user, who might know, which file names are "genuine" dataset file names and which file names are just a hex-code representation of names that are mapped on empty-strings by All in all, the simplest approach might be to use hex-codes if the I will change the code. |
Fixes #232
This PR uses
Unidecodeto translate unicode characters into the ASCII-range before employing any dataverse-specific character quotations.If
unidecode()returns an empty string, the name"__not_representable_<X>"is used, where<X>is the length of the original string.