Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character support in screen tab names #1642

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Explorer09
Copy link
Contributor

@Explorer09 Explorer09 commented Mar 18, 2025

I would like to share a feature that I have done as part of a code refactoring process.

These changes are still work in progress, but I think it's good to file it and gather early feedback.

This feature makes the screen tab names display with the correct terminal widths when they contain Unicode characters (in UTF-8 encoding). It replaces the naive strlen calculation with a new function that calculates the display terminal width of a name string.

The PR introduces the following functions:

  • String_makePrintable, a sanitization function with an API similar toxStrndup. It reads the string, converts all unprintable characters to replacement code points (U+FFFD), and allocates and re-encodes the string.
  • EncodePrintableString, an internal routine for String_makePrintable. This routine is designed with room to support RichString buffers
  • String_mbswidth, a function that calculates the terminal width of a string, or calculate the byte length of a substring if it is clipped by a terminal width limit.

This PR is related to #1513.

@fasterit
Copy link
Member

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

@BenBE BenBE added enhancement Extension or improvement to existing feature code quality ♻️ Code quality enhancement labels Mar 18, 2025
@Explorer09
Copy link
Contributor Author

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Noted. It should have been #ifdef.

@BenBE
Copy link
Member

BenBE commented Mar 18, 2025

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Noted. It should have been #ifdef.

IDK, but nowhere else do we check for strnlen, so why here all of a sudden?

@BenBE
Copy link
Member

BenBE commented Mar 18, 2025

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

@Explorer09
Copy link
Contributor Author

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂

@BenBE
Copy link
Member

BenBE commented Mar 18, 2025

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

CRT.h seems to be fine. ScreenManager.h may be another place to put them, but actually I'm not too sure, where's best …

BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂

Fine.

@Explorer09
Copy link
Contributor Author

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

CRT.h seems to be fine. ScreenManager.h may be another place to put them, but actually I'm not too sure, where's best …

Action.c does not include ScreenManager.h. If I define the constants in ScreenManager.h, then Action.c would need another include line. Hence I avoided that in the first place.

@Explorer09 Explorer09 force-pushed the screen-tab branch 4 times, most recently from 5f64daf to fe32490 Compare March 18, 2025 20:52
XUtils.h Outdated
Comment on lines 29 to 34
typedef struct WCharEncoderState_ {
size_t pos;
void* buf;
mbstate_t mbState;
} WCharEncoderState;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also carry the buffer size, so the internal EncodeWChar routines can perform bound checking.

Also note, that the CGroup code deliberately uses two distinct functions (one for counting the size, another for actually writing) and that these functions return success (character properly counted/written to buffer) or failure (OOB write). Similar could be done here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose to make the size counting and the actual writing the same routine for a reason: I expect the "rules" for filtering characters be tweakable in the future. If there are additional characters besides !iswprint that need to be filtered out (replaced with U+FFFD), people only need to modify one routine for that.

I'm no sure about bound checking yet. I can add it into the struct, but it would be for assertions only, as I always expect the buffer allocated would just fit. (That is, no truncation.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've changed my mind about the bound checking here. Considering the standard library functions mbsrtowcs(3) and wcsrtombs(3) both allow limits to the destination buffer size, I can add it to the WCharEncoderState as a feature.

I still don't recommend it though. The string buffer would not be null-terminated when the truncation occurs, and I won't fix that (for multi-byte encodings other than UTF-8, it is allowed that wcrtomb(buf, L'\0', ps) returns a length greater than 1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what "Failure to write to the buffer" could be used for. If the EncodeWChar function sees that the buffer is full and not NUL-terminated, it should return false (AKA failure), which all callees could propagate accordingly. Similar to what the CGROUP compression code¹ does …

¹Actually that code over-allocates by one byte and forces the NUL-termination in that over-allocated byte. Cf. linux/CGroupUtils.c in functions CGroup_filterName and CGroup_filterContainer.

Copy link
Contributor Author

@Explorer09 Explorer09 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE Nope. After a bit of consideration, this length check cannot be done in EncodeWChar like in the CGROUP code. I wrote a comment on why instead.

It's a limitation of the wcrtomb(3) interface. When we can detect wcrtomb writing out of bounds we might be too late (access violation already happens). wcrtomb doesn't give us the "length check before writing" we need. Trying to workaround that would make code unnecessarily complicated (wcrtomb called one more time - three times in total).

Also, we cannot put the '\0' byte in arbitrary position. Unlike in the CGROUP code, we can only put '\0' on character boundaries here. So overall there's nothing I can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE Update: This is what I mean by being "unnecessarily expensive".

static bool String_encodeWChar(WCharEncoderState* ps, wchar_t wc) {
   assert(!ps->buf || ps->pos < ps->size);
   
   char tempBuf[MB_LEN_MAX];
   size_t len = wcrtomb(tempBuf, wc, &ps->mbState);
   if (len == (size_t)-1) {
      return false;
   }
   assert(len > 0);
   if (ps->buf) {
      if (len > ps->size - ps->pos) {
         // When this happens, ps->buf will not be null-terminated and
         // ps->mbState becomes undefined state (cannot be reused).
         // Thus we have no way to recover from here.
         return false;
      }
      char* buf = ps->buf;
      memcpy(&buf[ps->pos], tempBuf, len);
   }

   ps->pos += len;
   return true;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And with the implementation used in the CGroup code, a false returned from the buffer write callback would immediately terminate processing. That's basically the same semantics used there too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE That means an additional memcpy call, slowing things down slightly. For checking an error that should never happen unless there's a bug in code. I don't really like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That memcpy is from L1 cache to a buffer that's likely in L1 too. As mentioned elsewhere: If in doubt, I prefer correctness over performance.

And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.

Over-allocation to solve the (actually non-existent) problem of under-allocation. Sorry, but I'm irritated by this logic.

The whole point of adding the bound check by your suggestion was to correct the case of under-allocation (allocated buffer size is less than required). While I argued first that the problem literally won't exist, I agree on the point that we can assert and detect the error, at least as a precaution. The buffer should be allocated with the right size already, thus any code that attempts to correct or recover from any potential error is useless. I might just make the error fatal instead and call it a day.

if (ps->buf && len <= ps->size - ps->pos) {
   fail();
}

As for why there is no NUL termination for the string when odds happen here: We are using a multibyte encoder wcrtomb that is capable of handling any encoding, not just UTF-8. When interpreting the standard behavior of wcrtomb in the strict sense, it is not necessary for a null wide character to be encoded as 1 byte. The standard wcrtomb will output shift sequences (if any) that restore the shift state to the initial state before outputting the null byte itself. Since that would require extra buffer space (up to MB_CUR_MAX bytes), any truncation of a multibyte string would have to be decided beforehand, and it won't be an escape hatch for any buffer size problem.

@@ -372,6 +372,7 @@ AC_CHECK_FUNCS([ \
sched_getscheduler \
sched_setscheduler \
strchrnul \
strnlen \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I like this, when we already have plenty of code, that requires this function without prior check …

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick grep -r strnlen shows no code in htop has been using strnlen before this PR.

@Explorer09 Explorer09 force-pushed the screen-tab branch 3 times, most recently from 3d0bfaa to 72b854a Compare March 21, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code quality ♻️ Code quality enhancement enhancement Extension or improvement to existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants