Skip to content

Conversation

@robertbastian
Copy link
Member

No description provided.

@robertbastian robertbastian marked this pull request as ready for review December 5, 2025 15:44
@robertbastian robertbastian requested review from a team, Manishearth and sffc as code owners December 5, 2025 15:44
expand!(
(
icu::properties::props::CanonicalCombiningClass,
PropertyEnumCanonicalCombiningClassV1,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing something like icu::properties::props::CanonicalCombiningClass::DataMarker here, but unfortunately that implementation is not allowed as the associated type is from a different crate

@sffc
Copy link
Member

sffc commented Dec 5, 2025

/gemini review

It seems fine so I'm happy to stamp it, but I didn't take the time to learn the macro :). I think @Manishearth knows this code better than me

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large refactoring to validate Unicode property names against the data files. It introduces validation logic in the data providers and updates property name definitions to match the standard. The changes improve code quality, robustness, and type safety. I have one suggestion for improving robustness in the data generation code.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good.

I found this PR a bit tedious to review: I didn't know where the actual important changes were vs changes downstream of that. It would be nice to have some context in PR bodies or a comment to help review.

make_binary_property! {
name: "Alnum";
short_name: "Alnum";
name: "alnum";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: these don't follow the property naming convention, but that might be because they're compat properties defined by the regex spec.

https://www.unicode.org/reports/tr18/#Compatibility_Properties

nit: please document the source to use for these names on the macro. There are multiple.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what's in icuexportdata

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust those names, icuexportdata is a hodgepodge. We should follow and check against something concrete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I'm not saying we should download the spec or UCD and check against it, I'm saying our code source of truth should be something that is not "this makes the tests pass", and if the tests fail we can see if things need to be fixed in icuexportdata.

Otherwise I don't really see the point of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't trust those names, icuexportdata is a hodgepodge

afaict icuexportdata uses the names that ICU4C uses. If not, that's a bug, but from the changes I made here it seems that they're all correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well neither of those defines short names. It would take significant extra work to chase down all the standards and compatible libraries of these names. You have already approved this PR, so I assume this is not blocking?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't blocking, but I also don't consider this PR to really be an improvement without that. Otherwise we're just shuffling things around, since Unicode is not very consistent about these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this isn't blocking, but I also don't consider this PR to really be an improvement without that. Otherwise we're just shuffling things around, since Unicode is not very consistent about these.

Copy link
Member Author

@robertbastian robertbastian Dec 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen Unicode is very consistent, I haven't seen alternative spellings anywhere expect for here.

This actually fixes new_for_ecma262, because that actually documents it uses the names from https://tc39.es/ecma262/#table-binary-unicode-properties, but some of them were not correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like icuexportdata to be something we consider trustworthy.



$prop_name:literal)),+,) => {
macro_rules! expand {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: please document this macro (specifically, the convert stuff). It's way harder to review changes to a complex macro when you don't know what it's supposed to do.

(Yes, I know it's previously undocumented)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trust? there's no data diff

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's fine. I would prefer if macros gained documentation over time especially when they have weird input syntax.

let bpt_trie = CodePointTrie::try_from(bpt_data).map_err(|e| {
DataError::custom("Could not parse CodePointTrie TOML").with_display_context(&e)
})?;
let bidi_m_cpinvlist = self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: this diff mostly looks like a cleanup, but it also switches get_binary_prop_for_code_point_set/etc over to accepting multiple names

.ok_or_else(|| DataErrorKind::MarkerNotFound.into_error())
.ok_or_else(|| DataErrorKind::MarkerNotFound.into_error())?;

if name != data.long_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: this is where the actual check happens

}

impl super::uprops_serde::binary::BinaryProperty {
pub(crate) fn build_inversion_list(&self) -> CodePointInversionList<'static> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: good refactor

(PropertyBinaryXdigitV1, "xdigit"),
(PropertyBinaryXidContinueV1, "XIDC"),
(PropertyBinaryXidStartV1, "XIDS")
(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we rustfmt::skip this to avoid wrapping?

alternatively perhaps the syntax of expand can be turned into something like PropType => PropMarker. Or we can have it imply icu::properties::props::. Or something.

@robertbastian
Copy link
Member Author

I found this PR a bit tedious to review: I didn't know where the actual important changes were vs changes downstream of that. It would be nice to have some context in PR bodies or a comment to help review.

I should have called this out I guess. If you agree with the changes in icu_properties, you can basically ignore the changes in datagen. They are correct (still generate the same data), and they triggered the changes in icu_properties.

@robertbastian
Copy link
Member Author

Merging because I want to pick this up for the patch release

@robertbastian robertbastian merged commit 9d734c2 into unicode-org:main Dec 8, 2025
32 checks passed
@robertbastian robertbastian deleted the names branch December 8, 2025 14:15
robertbastian added a commit to robertbastian/icu4x that referenced this pull request Dec 8, 2025
robertbastian added a commit that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants