-
-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode character support in screen tab names #1642
base: main
Are you sure you want to change the base?
Conversation
Fails CI:
|
Noted. It should have been |
IDK, but nowhere else do we check for |
Any objections to pull in 4ccfc38 to |
I have just one question with the commit: Is it okay for the constants to be defined in BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂 |
Fine. |
|
5f64daf
to
fe32490
Compare
XUtils.h
Outdated
typedef struct WCharEncoderState_ { | ||
size_t pos; | ||
void* buf; | ||
mbstate_t mbState; | ||
} WCharEncoderState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also carry the buffer size, so the internal EncodeWChar
routines can perform bound checking.
Also note, that the CGroup code deliberately uses two distinct functions (one for counting the size, another for actually writing) and that these functions return success (character properly counted/written to buffer) or failure (OOB write). Similar could be done here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choose to make the size counting and the actual writing the same routine for a reason: I expect the "rules" for filtering characters be tweakable in the future. If there are additional characters besides !iswprint
that need to be filtered out (replaced with U+FFFD), people only need to modify one routine for that.
I'm no sure about bound checking yet. I can add it into the struct, but it would be for assertions only, as I always expect the buffer allocated would just fit. (That is, no truncation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I've changed my mind about the bound checking here. Considering the standard library functions mbsrtowcs(3)
and wcsrtombs(3)
both allow limits to the destination buffer size, I can add it to the WCharEncoderState
as a feature.
I still don't recommend it though. The string buffer would not be null-terminated when the truncation occurs, and I won't fix that (for multi-byte encodings other than UTF-8, it is allowed that wcrtomb(buf, L'\0', ps)
returns a length greater than 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what "Failure to write to the buffer" could be used for. If the EncodeWChar
function sees that the buffer is full and not NUL-terminated, it should return false
(AKA failure), which all callees could propagate accordingly. Similar to what the CGROUP compression code¹ does …
¹Actually that code over-allocates by one byte and forces the NUL-termination in that over-allocated byte. Cf. linux/CGroupUtils.c
in functions CGroup_filterName
and CGroup_filterContainer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenBE Nope. After a bit of consideration, this length check cannot be done in EncodeWChar
like in the CGROUP code. I wrote a comment on why instead.
It's a limitation of the wcrtomb(3)
interface. When we can detect wcrtomb
writing out of bounds we might be too late (access violation already happens). wcrtomb
doesn't give us the "length check before writing" we need. Trying to workaround that would make code unnecessarily complicated (wcrtomb
called one more time - three times in total).
Also, we cannot put the '\0'
byte in arbitrary position. Unlike in the CGROUP code, we can only put '\0'
on character boundaries here. So overall there's nothing I can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenBE Update: This is what I mean by being "unnecessarily expensive".
static bool String_encodeWChar(WCharEncoderState* ps, wchar_t wc) {
assert(!ps->buf || ps->pos < ps->size);
char tempBuf[MB_LEN_MAX];
size_t len = wcrtomb(tempBuf, wc, &ps->mbState);
if (len == (size_t)-1) {
return false;
}
assert(len > 0);
if (ps->buf) {
if (len > ps->size - ps->pos) {
// When this happens, ps->buf will not be null-terminated and
// ps->mbState becomes undefined state (cannot be reused).
// Thus we have no way to recover from here.
return false;
}
char* buf = ps->buf;
memcpy(&buf[ps->pos], tempBuf, len);
}
ps->pos += len;
return true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And with the implementation used in the CGroup code, a false
returned from the buffer write callback would immediately terminate processing. That's basically the same semantics used there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenBE That means an additional memcpy
call, slowing things down slightly. For checking an error that should never happen unless there's a bug in code. I don't really like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That memcpy
is from L1 cache to a buffer that's likely in L1 too. As mentioned elsewhere: If in doubt, I prefer correctness over performance.
And if you want to avoid this, over-allocate by MB_LEN_MAX + 1
and you can always write to ps->buf
, if you force the NUL terminator on error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if you want to avoid this, over-allocate by
MB_LEN_MAX + 1
and you can always write tops->buf
, if you force the NUL terminator on error.
Over-allocation to solve the (actually non-existent) problem of under-allocation. Sorry, but I'm irritated by this logic.
The whole point of adding the bound check by your suggestion was to correct the case of under-allocation (allocated buffer size is less than required). While I argued first that the problem literally won't exist, I agree on the point that we can assert and detect the error, at least as a precaution. The buffer should be allocated with the right size already, thus any code that attempts to correct or recover from any potential error is useless. I might just make the error fatal instead and call it a day.
if (ps->buf && len <= ps->size - ps->pos) {
fail();
}
As for why there is no NUL termination for the string when odds happen here: We are using a multibyte encoder wcrtomb
that is capable of handling any encoding, not just UTF-8. When interpreting the standard behavior of wcrtomb
in the strict sense, it is not necessary for a null wide character to be encoded as 1 byte. The standard wcrtomb
will output shift sequences (if any) that restore the shift state to the initial state before outputting the null byte itself. Since that would require extra buffer space (up to MB_CUR_MAX
bytes), any truncation of a multibyte string would have to be decided beforehand, and it won't be an escape hatch for any buffer size problem.
@@ -372,6 +372,7 @@ AC_CHECK_FUNCS([ \ | |||
sched_getscheduler \ | |||
sched_setscheduler \ | |||
strchrnul \ | |||
strnlen \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this, when we already have plenty of code, that requires this function without prior check …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick grep -r strnlen
shows no code in htop has been using strnlen
before this PR.
3d0bfaa
to
72b854a
Compare
I would like to share a feature that I have done as part of a code refactoring process.
These changes are still work in progress, but I think it's good to file it and gather early feedback.
This feature makes the screen tab names display with the correct terminal widths when they contain Unicode characters (in UTF-8 encoding). It replaces the naive
strlen
calculation with a new function that calculates the display terminal width of a name string.The PR introduces the following functions:
String_makePrintable
, a sanitization function with an API similar toxStrndup
. It reads the string, converts all unprintable characters to replacement code points (U+FFFD), and allocates and re-encodes the string.EncodePrintableString
, an internal routine forString_makePrintable
. This routine is designed with room to supportRichString
buffersString_mbswidth
, a function that calculates the terminal width of a string, or calculate the byte length of a substring if it is clipped by a terminal width limit.This PR is related to #1513.