-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to interpret/display packet content as UTF-8 #1190
Comments
For ASCII (and other single-byte character encodings), there can be a one-to-one correspondence between offsets into the packet and positions in the display. For multi-byte character encodings, a decision has to be made as to how to display a character that's split between rows in the text display. The best thing to do is probably to display it at the location of the first byte, and perhaps to display the next character, which does not begin at the beginning of the next row, with some filler characters before it, corresponding to the bytes in that row that are part of the character that begins in the previous row. For variable-length multi-byte character encodings, such as UTF-8, there's not likely to be a correspondence between offsets in the packet and positions in the display. At best, what could be done is to display characters adjacent to one another, display characters that are split across rows at the location of the first byte, and show the aforementioned filler characters. |
Sequences of bytes that are valid UTF-8 characters but that are not printable characters should be displayed as ".", just as bytes that are not printable ASCII characters are displayed in the ASCII display. Any sequence of bytes that are not part of a valid UTF-8 character should probably also be displayed as a sequence of "."s. |
What would be the way to know where UTF-8 strings start and end in the packet data? UTF-8 bytes, whether perfectly valid or not, could be prepended/followed by pure binary bytes that could interfere with UTF-8 reading. As far as I understand, the only way to do it reliably would be to know the packet structure when doing a hex dump. |
There's a straightforward way to identify whether or not a sequence of bytes is valid UTF-8; https://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c is an example. |
It would be great to support displaying the content of a packet as UTF-8 in addition to ASCII.
The text was updated successfully, but these errors were encountered: