Unicode character support in screen tab names #1642

Explorer09 · 2025-03-18T11:05:24Z

I would like to share a feature that I have done as part of a code refactoring process.

These changes are still work in progress, but I think it's good to file it and gather early feedback.

This feature makes the screen tab names display with the correct terminal widths when they contain Unicode characters (in UTF-8 encoding). It replaces the naive strlen calculation with a new function that calculates the display terminal width of a name string.

The PR introduces the following functions:

String_makePrintable, a sanitization function with an API similar toxStrndup. It reads the string, converts all unprintable characters to replacement code points (U+FFFD), and allocates and re-encodes the string.
EncodePrintableString, an internal routine for String_makePrintable. This routine is designed with room to support RichString buffers
String_mbswidth, a function that calculates the terminal width of a string, or calculate the byte length of a substring if it is clipped by a terminal width limit.

This PR is related to #1513.

XUtils.c

fasterit · 2025-03-18T12:28:47Z

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Explorer09 · 2025-03-18T13:01:04Z

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Noted. It should have been #ifdef.

BenBE · 2025-03-18T13:52:33Z

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Noted. It should have been #ifdef.

IDK, but nowhere else do we check for strnlen, so why here all of a sudden?

BenBE · 2025-03-18T14:00:38Z

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

Explorer09 · 2025-03-18T14:48:57Z

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂

BenBE · 2025-03-18T16:12:49Z

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

CRT.h seems to be fine. ScreenManager.h may be another place to put them, but actually I'm not too sure, where's best …

BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂

Fine.

Explorer09 · 2025-03-18T17:32:21Z

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

CRT.h seems to be fine. ScreenManager.h may be another place to put them, but actually I'm not too sure, where's best …

Action.c does not include ScreenManager.h. If I define the constants in ScreenManager.h, then Action.c would need another include line. Hence I avoided that in the first place.

XUtils.c

BenBE · 2025-03-18T23:08:41Z

XUtils.h

+typedef struct WCharEncoderState_ {
+   size_t pos;
+   void* buf;
+   mbstate_t mbState;
+} WCharEncoderState;


This should also carry the buffer size, so the internal EncodeWChar routines can perform bound checking.

Also note, that the CGroup code deliberately uses two distinct functions (one for counting the size, another for actually writing) and that these functions return success (character properly counted/written to buffer) or failure (OOB write). Similar could be done here too.

I choose to make the size counting and the actual writing the same routine for a reason: I expect the "rules" for filtering characters be tweakable in the future. If there are additional characters besides !iswprint that need to be filtered out (replaced with U+FFFD), people only need to modify one routine for that.

I'm no sure about bound checking yet. I can add it into the struct, but it would be for assertions only, as I always expect the buffer allocated would just fit. (That is, no truncation.)

Update: I've changed my mind about the bound checking here. Considering the standard library functions mbsrtowcs(3) and wcsrtombs(3) both allow limits to the destination buffer size, I can add it to the WCharEncoderState as a feature.

I still don't recommend it though. The string buffer would not be null-terminated when the truncation occurs, and I won't fix that (for multi-byte encodings other than UTF-8, it is allowed that wcrtomb(buf, L'\0', ps) returns a length greater than 1).

That's what "Failure to write to the buffer" could be used for. If the EncodeWChar function sees that the buffer is full and not NUL-terminated, it should return false (AKA failure), which all callees could propagate accordingly. Similar to what the CGROUP compression code¹ does …

¹Actually that code over-allocates by one byte and forces the NUL-termination in that over-allocated byte. Cf. linux/CGroupUtils.c in functions CGroup_filterName and CGroup_filterContainer.

@BenBE Nope. After a bit of consideration, this length check cannot be done in EncodeWChar like in the CGROUP code. I wrote a comment on why instead.

It's a limitation of the wcrtomb(3) interface. When we can detect wcrtomb writing out of bounds we might be too late (access violation already happens). wcrtomb doesn't give us the "length check before writing" we need. Trying to workaround that would make code unnecessarily complicated (wcrtomb called one more time - three times in total).

Also, we cannot put the '\0' byte in arbitrary position. Unlike in the CGROUP code, we can only put '\0' on character boundaries here. So overall there's nothing I can do.

@BenBE Update: This is what I mean by being "unnecessarily expensive".

static bool String_encodeWChar(WCharEncoderState* ps, wchar_t wc) { assert(!ps->buf || ps->pos < ps->size); char tempBuf[MB_LEN_MAX]; size_t len = wcrtomb(tempBuf, wc, &ps->mbState); if (len == (size_t)-1) { return false; } assert(len > 0); if (ps->buf) { if (len > ps->size - ps->pos) { // When this happens, ps->buf will not be null-terminated and // ps->mbState becomes undefined state (cannot be reused). // Thus we have no way to recover from here. return false; } char* buf = ps->buf; memcpy(&buf[ps->pos], tempBuf, len); } ps->pos += len; return true; }

Yes. And with the implementation used in the CGroup code, a false returned from the buffer write callback would immediately terminate processing. That's basically the same semantics used there too.

@BenBE That means an additional memcpy call, slowing things down slightly. For checking an error that should never happen unless there's a bug in code. I don't really like this.

That memcpy is from L1 cache to a buffer that's likely in L1 too. As mentioned elsewhere: If in doubt, I prefer correctness over performance.

And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.

And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.

Over-allocation to solve the (actually non-existent) problem of under-allocation. Sorry, but I'm irritated by this logic.

The whole point of adding the bound check by your suggestion was to correct the case of under-allocation (allocated buffer size is less than required). While I argued first that the problem literally won't exist, I agree on the point that we can assert and detect the error, at least as a precaution. The buffer should be allocated with the right size already, thus any code that attempts to correct or recover from any potential error is useless. I might just make the error fatal instead and call it a day.

if (ps->buf && len <= ps->size - ps->pos) { fail(); }

As for why there is no NUL termination for the string when odds happen here: We are using a multibyte encoder wcrtomb that is capable of handling any encoding, not just UTF-8. When interpreting the standard behavior of wcrtomb in the strict sense, it is not necessary for a null wide character to be encoded as 1 byte. The standard wcrtomb will output shift sequences (if any) that restore the shift state to the initial state before outputting the null byte itself. Since that would require extra buffer space (up to MB_CUR_MAX bytes), any truncation of a multibyte string would have to be decided beforehand, and it won't be an escape hatch for any buffer size problem.

BenBE · 2025-03-18T23:10:12Z

configure.ac

@@ -372,6 +372,7 @@ AC_CHECK_FUNCS([ \
    sched_getscheduler \
    sched_setscheduler \
    strchrnul \
+    strnlen \


I'm not sure I like this, when we already have plenty of code, that requires this function without prior check …

A quick grep -r strnlen shows no code in htop has been using strnlen before this PR.

XUtils.c

BenBE reviewed Mar 18, 2025

View reviewed changes

XUtils.c Outdated Show resolved Hide resolved

XUtils.c Outdated Show resolved Hide resolved

XUtils.c Show resolved Hide resolved

XUtils.c Outdated Show resolved Hide resolved

XUtils.c Outdated Show resolved Hide resolved

BenBE added enhancement Extension or improvement to existing feature code quality ♻️ Code quality enhancement labels Mar 18, 2025

Explorer09 force-pushed the screen-tab branch 4 times, most recently from 5f64daf to fe32490 Compare March 18, 2025 20:52

BenBE reviewed Mar 18, 2025

View reviewed changes

Explorer09 force-pushed the screen-tab branch from fe32490 to f39bf1d Compare March 19, 2025 07:38

BenBE reviewed Mar 19, 2025

View reviewed changes

XUtils.c Outdated Show resolved Hide resolved

Explorer09 force-pushed the screen-tab branch 3 times, most recently from 3d0bfaa to 72b854a Compare March 21, 2025 18:27

Unicode character support in screen tab names

2ca232d

Explorer09 force-pushed the screen-tab branch from 72b854a to 2ca232d Compare March 21, 2025 19:59

Explorer09 mentioned this pull request Mar 22, 2025

Wide character support in meter text #1648

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode character support in screen tab names #1642

Unicode character support in screen tab names #1642

Explorer09 commented Mar 18, 2025 •

edited

Loading

fasterit commented Mar 18, 2025

Explorer09 commented Mar 18, 2025

BenBE commented Mar 18, 2025

BenBE commented Mar 18, 2025

Explorer09 commented Mar 18, 2025

BenBE commented Mar 18, 2025

Explorer09 commented Mar 18, 2025

BenBE Mar 18, 2025

Explorer09 Mar 19, 2025

Explorer09 Mar 19, 2025

BenBE Mar 19, 2025

Explorer09 Mar 19, 2025 •

edited

Loading

Explorer09 Mar 20, 2025

BenBE Mar 20, 2025

Explorer09 Mar 20, 2025

BenBE Mar 20, 2025

Explorer09 Mar 20, 2025

BenBE Mar 18, 2025

Explorer09 Mar 19, 2025

Unicode character support in screen tab names #1642

Are you sure you want to change the base?

Unicode character support in screen tab names #1642

Conversation

Explorer09 commented Mar 18, 2025 • edited Loading

fasterit commented Mar 18, 2025

Explorer09 commented Mar 18, 2025

BenBE commented Mar 18, 2025

BenBE commented Mar 18, 2025

Explorer09 commented Mar 18, 2025

BenBE commented Mar 18, 2025

Explorer09 commented Mar 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Explorer09 Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Explorer09 commented Mar 18, 2025 •

edited

Loading

Explorer09 Mar 19, 2025 •

edited

Loading