Summary
Python code that handles invalid UTF-8 raises UnicodeDecodeError, which callers can catch. Tongues transpiles the explicit raise UnicodeDecodeError(...) calls correctly, but platform string/encoding operations in the target language throw different native exceptions that bypass the transpiled error handling.
Reproduction
Python:
try:
result = parse(invalid_utf8_input)
except (ParseError, UnicodeDecodeError):
# handled
In JS, the equivalent code catches ParseError and the transpiled UnicodeDecodeError class, but Node's TextDecoder throws a native TypeError: The encoded data was not valid for encoding utf-8 which isn't an instance of the transpiled UnicodeDecodeError.
Impact
7 test failures across JS, Perl, and Ruby backends in the Parable transpilation. All involve test inputs containing intentionally invalid UTF-8 byte sequences (from bash's multibyte character tests).
Possible fixes
- Wrap platform encoding operations: when transpiling
bytes.decode("utf-8") or equivalent, emit a try/catch that converts the native encoding error to the transpiled UnicodeDecodeError
- Normalize in the catch site: when transpiling
except UnicodeDecodeError, also catch the platform's native encoding error type (JS TypeError, Ruby Encoding::InvalidByteSequenceError, Perl Encode::...)
Discovered via ldayton/Parable#413.
Summary
Python code that handles invalid UTF-8 raises
UnicodeDecodeError, which callers can catch. Tongues transpiles the explicitraise UnicodeDecodeError(...)calls correctly, but platform string/encoding operations in the target language throw different native exceptions that bypass the transpiled error handling.Reproduction
Python:
In JS, the equivalent code catches
ParseErrorand the transpiledUnicodeDecodeErrorclass, but Node'sTextDecoderthrows a nativeTypeError: The encoded data was not valid for encoding utf-8which isn't an instance of the transpiledUnicodeDecodeError.Impact
7 test failures across JS, Perl, and Ruby backends in the Parable transpilation. All involve test inputs containing intentionally invalid UTF-8 byte sequences (from bash's multibyte character tests).
Possible fixes
bytes.decode("utf-8")or equivalent, emit a try/catch that converts the native encoding error to the transpiledUnicodeDecodeErrorexcept UnicodeDecodeError, also catch the platform's native encoding error type (JSTypeError, RubyEncoding::InvalidByteSequenceError, PerlEncode::...)Discovered via ldayton/Parable#413.