Skip to content

Port sort key code from ICU4C (#2689) #6537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Jun 10, 2025
Merged

Port sort key code from ICU4C (#2689) #6537

merged 44 commits into from
Jun 10, 2025

Conversation

rs-sac
Copy link
Contributor

@rs-sac rs-sac commented May 6, 2025

I mostly translate the C++ code straight across, but I made some minor changes to match the mindset of the Rust codebase, to match Rust practice, to satisfy Clippy, and to reduce copy-and-paste.

A comment on the issue suggested handling the identical level by appending UTF-8. That's not what ICU4C does, but it's a whole lot simpler, so that's what I did.

At the end, some simple tests are included.

Fixes #2689

@rs-sac rs-sac requested review from hsivonen, echeran and a team as code owners May 6, 2025 16:21
@CLAassistant
Copy link

CLAassistant commented May 6, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @rs-sac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request ports the sort key code from ICU4C to Rust, adding a write_sort_key API to the Collator. The code was largely translated directly from C++, with minor adjustments to align with Rust conventions and address Clippy recommendations. The author implemented a simpler approach for handling the identical level by appending UTF-8, differing from ICU4C's method. The author also added some simple tests.

Highlights

  • Porting Sort Key Code: The primary focus is porting the sort key generation algorithm from ICU4C's C++ implementation to Rust, enhancing the icu_collator crate.
  • API Addition: A new write_sort_key API is added to the Collator struct, enabling the generation of sort keys.
  • UTF-8 Normalization: The implementation handles identical level comparisons by appending UTF-8 NFD normalization, a simplification compared to ICU4C.
  • Write Trait: A new Write trait is introduced to allow writing to buffer-like objects, since this crate does not have access to std.

Changelog

Click here to see the changelog
  • CHANGELOG.md
    • Added a changelog entry for the icu_collator crate, noting the addition of the write_sort_key API.
  • components/collator/src/comparison.rs
    • Added extern crate alloc and use alloc::vec::Vec to enable dynamic memory allocation.
    • Introduced constants for special sort key bytes, level separators, and compression terminators.
    • Defined an enum Level for specifying comparison levels.
    • Added the write_sort_key function to write sort key bytes up to the collator's strength.
    • Added the write_sort_key_up_to_quaternary function to write sort key bytes up to the collator's strength, with options to write the case level and stop writing levels based on a provided function.
    • Added the Write trait and implementations for Vec<u8> and SmallVec<[u8; N]>.
    • Added the SortKeyLevel struct to manage sort key levels.
    • Refactored compare_impl to use helper functions for tailoring and numeric primary handling.
    • Added helper functions tailoring_or_root, numeric_primary, collation_elements, and variable_top to improve code organization and readability.
  • components/collator/src/elements.rs
    • Added NO_CE_QUATERNARY constant.
    • Added ignorable function to NonPrimary struct.
  • components/collator/src/options.rs
    • No functional changes, only a minor comment update for clarity.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


From C++ to Rust,
A sort key finds new life,
Strings now find their place.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the sort key code from ICU4C to Rust, which is a significant enhancement for the icu_collator crate. The author has made a commendable effort to translate the C++ code while adapting it to Rust conventions and addressing Clippy's suggestions. The addition of tests and a line-by-line audit further contribute to the reliability of the code. Overall, this is a valuable contribution.

Summary of Findings

  • Missing compressibleBytes and minLevel: The author intentionally left out compressibleBytes and minLevel from the C++ code. It's important to confirm whether these omissions are acceptable and won't affect the functionality or performance of the ported code. Further investigation may be required to determine if these components are necessary for the Rust implementation.
  • Potential performance bottlenecks: The code contains several loops and macro usages that could potentially lead to performance bottlenecks. While the author has made an effort to optimize the code, it's important to carefully analyze these sections and identify any areas where further optimization may be necessary. Consider using profiling tools to measure the performance of the code and identify any hotspots.
  • Inconsistent naming: There are inconsistencies in the naming of variables and constants throughout the code. For example, some constants use snake_case while others use SCREAMING_SNAKE_CASE. It's important to establish a consistent naming convention and apply it throughout the code to improve readability and maintainability.

Merge Readiness

The pull request introduces a significant new feature and includes tests, which is a positive sign. However, the compressibleBytes and minLevel omissions need to be addressed, and the potential performance bottlenecks should be investigated. I am unable to approve this pull request, and recommend that the pull request not be merged until the high severity issues are addressed. Other reviewers should also review and approve this code before merging.

@rs-sac
Copy link
Contributor Author

rs-sac commented May 6, 2025

I see that most of the test failures boil down to my calling new_nfd outside of a #[cfg(feature = "compiled_data")]. If you could advise me on what I did wrong (call something else, add the feature attribute there, do something else), I'll endeavor to fix it.

@rs-sac rs-sac requested a review from Manishearth as a code owner May 6, 2025 19:36
@Manishearth Manishearth removed their request for review May 7, 2025 00:29
@Manishearth
Copy link
Member

(I don't tend to review collator stuff)

@hsivonen hsivonen added the C-collator Component: Collation, normalization label May 7, 2025
@hsivonen
Copy link
Member

hsivonen commented May 7, 2025

Thank you for the PR!

I acknowledge that this is my area to review, but I couldn't get to it today.

@sffc
Copy link
Member

sffc commented May 9, 2025

Thanks for the contribution! A few things:

  1. I received confirmation that your Corporate CLA was received. You can select the appropriate option for yourself on cla-assistant.io.
  2. This PR was opened just hours before Identical prefix optimization for the collator #6496 landed, so this PR will need to be merged/rebased.
  3. About the data provider CI errors: if you call any function that is #[cfg(feature = "compiled_data")], then your function also needs to be #[cfg(feature = "compiled_data")], and it should have a corresponding _with_buffer_provider and _unstable. We usually have an internal function that takes the resolved objects and then we write these three functions as thin wrappers on top of the internal function.

@Manishearth
Copy link
Member

3. About the data provider CI errors: if you call any function that is #[cfg(feature = "compiled_data")], then your function also needs to be #[cfg(feature = "compiled_data")], and it should have a corresponding _with_buffer_provider and _unstable. We usually have an internal function that takes the resolved objects and then we write these three functions as thin wrappers on top of the internal function.

I think in this case this is not the advice to follow: this code is deep within the implementation; it needs to use the decomposition data that is available within the Collator/CollatorBorrowed structs, and if it needs more then these structs should load more during data loading.

I don't have time to really go deeper into how it looks, but other people here should be able to.

@sffc
Copy link
Member

sffc commented May 9, 2025

I assume it's code like this which is causing the error:

impl CollatorBorrowed<'_> {
    /// Write the sort key bytes up to the collator's strength.
    ///
    /// For identical strength, the UTF-8 NFD normalization is appended for breaking ties.
    /// The key ends with a `TERMINATOR_BYTE`.
    pub fn write_sort_key<S>(&self, s: &str, sink: &mut S)
    where
        S: Write,
    {
        self.write_sort_key_up_to_quaternary(s.chars(), sink, |_| true);

        if self.options.strength() == Strength::Identical {
            let nfd = icu_normalizer::DecomposingNormalizerBorrowed::new_nfd();
            let s = nfd.normalize(s);
            sink.write_byte(LEVEL_SEPARATOR_BYTE);
            sink.write(s.as_bytes());
        }

        sink.write(&[TERMINATOR_BYTE]);
    }
}

If you need data beyond what is in the Collator, then you might need to load that data in the Collator constructor, or else create a new struct.

However, I think the data you need might already be there. CollatorBorrowed already contains a field decompositions: &'a DecompositionData<'a>, which you can use to create a Decomposition (see line 594) and presumably also a DecomposingNormalizerBorrowed.

@rs-sac
Copy link
Contributor Author

rs-sac commented May 9, 2025

Thanks for the pointers. I think I have fixed the problem, though I await output from the official tests.

@rs-sac
Copy link
Contributor Author

rs-sac commented May 9, 2025

Rather than add to the temporary diplomat exclusion file, I added my stuff to the more documented allow list.

I'm not an expert on diplomat, but I'm pretty sure it doesn't make sense to expose the internal use constructor I created. OTOH, ideally the sort key code would be exposed, but AFAICT as a non-expert, that can't happen due to using traits.

(Since I don't know what reviewers are necessary, I didn't add or change any reviewers, even though this web site says I did. It's completely making that up.)

@hsivonen
Copy link
Member

However, I think the data you need might already be there. CollatorBorrowed already contains a field decompositions: &'a DecompositionData<'a>, which you can use to create a Decomposition (see line 594) and presumably also a DecomposingNormalizerBorrowed.

The data is indeed already there. The collator already creates Decomposition using the data it already has.

It looks like a pub (but probably doc-hidden) constructor for creating a DecomposingNormalizerBorrowed from the data that the collator already holds is needed (plugging None and the correct u8 constants for NFD in the fields that need input not held by the collator).

@hsivonen
Copy link
Member

However, I think the data you need might already be there. CollatorBorrowed already contains a field decompositions: &'a DecompositionData<'a>, which you can use to create a Decomposition (see line 594) and presumably also a DecomposingNormalizerBorrowed.

The data is indeed already there. The collator already creates Decomposition using the data it already has.

It looks like a pub (but probably doc-hidden) constructor for creating a DecomposingNormalizerBorrowed from the data that the collator already holds is needed (plugging None and the correct u8 constants for NFD in the fields that need input not held by the collator).

Ah, nevermind, it's already in the changeset.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great, but I have some change requests inline.

I think the biggest concern is planning correctly around the compressibility table that ICU4X currently doesn't have.

I intend to discuss that aspect with other ICU4X developers on Thursday.

@hsivonen hsivonen added the discuss-priority Discuss at the next ICU4X meeting label May 14, 2025
@hsivonen
Copy link
Member

A couple of things that aren't really review comments but more general discussion points:

  1. Should the sink trait be called Write, because traits roughly like that are customarily called Write, or should the trait be called something else so that it's obvious from the unqualified name that it's not one of the other Writes?
  2. Should the methods on the trait return Result even though when implemented for Vec and SmallVec, the implementation would always return Ok(())? Is it valuable to be able to write fallible sinks when the internal SmallVecs can panic on OOM anyway? In the case of the normalizer, it's less likely to get internal SmallVec OOM, and since the sink trait can propagate OOM, I ended up propagating OOM in my WIP Gecko patch.

@rs-sac
Copy link
Contributor Author

rs-sac commented May 14, 2025

  1. Should the sink trait be called Write, because traits roughly like that are customarily called Write, or should the trait be called something else so that it's obvious from the unqualified name that it's not one of the other Writes?

As a newbie, I don't have a strong opinion. I took the name suggested in the original issue.

  1. Should the methods on the trait return Result even though when implemented for Vec and SmallVec, the implementation would always return Ok(())? Is it valuable to be able to write fallible sinks when the internal SmallVecs can panic on OOM anyway?

I had the same thoughts. It appeared to me that while the C++ interface supports errors, they were all related to OOM, which is just a crash in Rust (lame IMO, but nobody asked my opinion on that point). Since I didn't see any plausible scenario wherein an error would be returned, I skipped the error business, thinking that if I'd overlooked anything, it would be pointed out in review.

Thanks for the rest of the comments. It will take me a little while to work through them all.

@hsivonen
Copy link
Member

From the meeting today:

  • @hsivonen Overall, high quality. I found two cases of what appears to be minor typos. There are three things I'd like input on. The most substantive one is, this will need 256 bits of data exported from ICU4C into a new kind of data struct that we don't currently have. If we follow the ICU4X philosophy properly, then it would mean that this method would need to be on a different object, implying things like CollationKeyGenerator, CollationKeyGeneratorBorrowed. What should we do in this case?
  • @sffc 256 bits, not even bytes?
  • @hsivonen Yes. ICU4C uses bytes, but we can use bits.
  • @robertbastian I think we should add a new data marker, update the unstable constructor, and fall back to hardcoded data in the buffer constructor
  • @hsivonen We should ask @markusicu to verify whether the algorithm still works if we say if is_compressible is always false. And we should also verify what happens if you generate collation keys with the compressible data and without the compressible data. I'm not sure how the compression works, so I can't say. I think we should have a singleton data key for the compressibility of the bytes, and that's a data struct that contains an array of four u64s. If we fail to load that data struct, do we fail the construction, or do we will in all zeros? If we really want ICU4X 2.x to ingest 2.0 data that doesn't have this, and construct successfully, we have to say to fill in all zeros. But I think it's better to say that if the data key doesn't resolve, you don't get to produce collation keys, which is probably the better way forward.
  • @sffc Ideally we would just include these 256 bits in the main data struct. We don't have a black and white rule on when we split things into different data structs and different types, but generally if something needs data that is very small (less than ~1 KB, certainly less than ~100 B) and closely related to the functionality of the main type, it can just go onto the main type.
  • @hsivonen We could make a SpecialPrimariesV2 with the extra field.
  • @robertbastian we're allowed to add a new marker, and remove the old marker from the unstable constructor. our plan for custom baked data and buffer data is to keep the old key around, but given that we only released a week ago, it's probably fine to not do that
  • @hsivonen What zerovec type should I use? Do we have a fixed-length array?
  • @sffc You can use a VarZeroCow<(...)> if you want
  • @robertbastian I'd just use a ZeroVec<u8>
  • @hsivonen I'll use ZeroVec<u8>
  • @sffc Did we decide what to actually do?
  • @hsivonen I'll add a field compressible_bytes in SpecialPrimaries struct, and I update the marker name from V1 to V2.
  • @sffc Yeah, we should be strict with our semver policy, and the exception we're making is only in our "new code old data" policy, which I'm comfortable making.
  • @robertbastian intialize compressible_bytes with the CLDR 47 data in datagen if icuexportdata doesn't contain the data. then in the future add it to icuexportdata

@rs-sac
Copy link
Contributor Author

rs-sac commented Jun 3, 2025

I have incorporated a few of the new review comments.

A couple of the comments conflict with the direction I have received from my previous reviewer. Since that puts me, an outsider, in the middle of two more senior reviewers who disagree, I have left those things alone pending further discussion/consensus/direction.

@Manishearth
Copy link
Member

Manishearth commented Jun 5, 2025

Discussion from today's ICU4X WG meeting

Meeting notes
  • @sffc (introduces the issue)
  • @robertbastian There are two problems. First, should we use a sink or return an iterator? Second, if we use a sink, what is the error type?
  • @Manishearth What are the error cases?
  • @robertbastian They are sink errors.
  • @robertbastian I think the reason core::fmt::Write exists is because returning an iterator of chars is an overhead. Returning an iterator of bytes should not be an overhead compared to writing to a byte sink.
  • @Manishearth I think it isn't about overhead more than API choice. But you're right that there's no difference, but also std::io::Write exists. So it's more about, what do peope want here? Here's a question. We're treating this as a collation sink. A property it has is that the Write trait can append onto something existing. Is that something we care about here?
  • @robertbastian I think sort keys are homomorphic, right?
  • @Manishearth So you might end up writing to an existing string? Because if not, we could return FromIterator Bytes.
  • @sffc Normally we return Writeable types. Should we do that here?
  • @robertbastian In Normalizer we use the sink model.
  • @robertbastian Can you clarify how we could return a FromIterator?
  • @Manishearth Instead of returning an Iterator over bytes, you return a T where T: FromIterator Bytes, and you specify the T based on context.
  • @robertbastian That seems bad because then you can't use extend
  • @Manishearth Yeah. I was feeling out the design space.
  • @Manishearth We return Writeable for the purpose of, you need to do a bunch of calculation. Separating out the writing task from the collection task. I'm not sure that's a problem here. I don't think we have much of an API benefit here.
  • @robertbastian - we don't do this two-step thing with intermediate writeable types in normalizer: https://docs.rs/icu/latest/icu/normalizer/struct.ComposingNormalizerBorrowed.html#method.normalize_to
  • @robertbastian I don't like duplicating a trait just because we're no_std, and besides, std::io::Write isn't the right trait because std::io::Error is not the right type.
  • @Manishearth I think it's barely the right trait.
  • @sffc Is there a reason why in Normalizer we take a sink instead of returning a Writeable?
  • @robertbastian I didn't notice this until this week.
  • @Manishearth There's another wrinkle, which is that for &mut, you need to return filled bytes. I think the resolution is that the fn can return a usize of the number of bytes written.
  • @robertbastian - at that point I'd prefer a second method call for this
  • @sffc Would it help if we added writeable::WriteableBytes?
  • @Manishearth Writeable is a replacement for Display. Something that looks like Writeable doesn't help us here. If we had WriteableBytes, we'd need WriteBytes.
  • @sffc Another thing we could do is return a WriteableSortKey that doesn't implement Writeable but has its own associated functions that are similar in shape, and implements IntoIterator<u8>
  • @robertbastian What would the associated functions be?
  • @sffc It would be into_iter() and write_to() which still takes the sink trait, but then users can pick which one works better
  • @Manishearth I think this is solving a different problem. WriteableBytes is orthogonal. into_iter() and write_to() are equally powerful APIs; there is no need to have both. The benefit of Writeable doesn't really translate to this world.
  • @robertbastian The problem with adding a sink trait is that we need to provide all the impls for standard library types, which is annoying. Also, we cannot implement it for 3P types, so callers would often need newtypes and write wrapper impls
  • @sffc If we returned an iterator, it looks like extend can be used to append to a vec: https://doc.rust-lang.org/std/vec/struct.Vec.html#impl-Extend%3C%26T%3E-for-Vec%3CT,+A%3E
  • @Manishearth I think they're moving std::io::Write to core
  • @robertbastian It's still aways off

Currently in the PR:

impl CollatorBorrowed<'_> {
    pub fn write_sort_key<S>(&self, s: &str, sink: &mut S) -> Result<(), core::fmt::Error>
    where
        S: CollationKeySink,
}

proposal:

impl CollatorBorrowed<'a> {
 pub fn sort_key<'b>(&self, s: &'b str) -> impl Iterator<Item = u8> + 'a + 'b
}
  • @Manishearth I'm not sure how easy it would be to transform this function into a state machine. Looking at the impl, it seems nontrivial. It seems like a whole lot of work.
  • @robertbastian I think there are only 3 or 4 places where it writes the sink.
  • @Manishearth It's in a loop behind like 5 if blocks.
  • @sffc It seems harmless to have write_sort_key, and we could add sort_key in the future
  • @Manishearth The unicode segmentation crate has segmentation as a state machine, and it is a pain to maintain.

Moving to the second question about the shape of the trait:

  • @Manishearth If the trait lives in its own crate, it seems fine

  • @robertbastian what do we implement this on? if we only do Vec and ShortVec, this is actually infallible. Do we actually need &mut [u8] or can we address this with the iterator API later?

  • @echeran I'm thinking at a high level about the iterator versus the sink. A sort key is, each character has primary, secondary, and tertiary weights. It takes the primaries and puts them in a row, secondary and puts them in a row, and so forth. An iterator seems like you could stop reading when you no longer need the elements, but you still need to visit all the code points.

  • @robertbastian Yeah, it would require a custom sink impl

  • @sffc That's how Writeable comparision works

  • @Manishearth We have explicit APIs for doing this type of comparision. You shouldn't use this API to build a comparison.

  • @echeran What's the use case for comparing a string to a sort key?

  • @sffc That seems like the main use case actually. You have a database and you need to see where the string fits in.

  • @echeran Why not just get the sort key of the new string?

  • @sffc So you can exit early.

  • @sffc Only the Iterator API lets you exit early.

  • @robertbastian No, with the sink API, you can return an error to exit early. Which is motivation to have a custom error.

  • @Manishearth Whenever I have some sort of pluggable callbacky API I'm always frustrated when I can't propagate my errors through it.

    • If you're doing anything more than "write to container" (like comparing against a string) you will need custom errors
    • Even if you're doing "write to container" sometimes the container is a JavaScript string and you need to propagate custom errors.
    • Or even just a String, where you want to catch utf8 errors.
  • @sffc Is there a reason to not prefer CollationKeySink?

  • @robertbastian If there's some other API in icu_collator in the future that needs it

  • @sffc How about CollationByteSink?

  • @robertbastian What should we implement this on? Just Vec?

  • @sffc Why not mut slice?

  • @robertbastian Because then we'd need to return a usize so you know how many bytes were written

  • @sffc It seems like it's good hygiene to return the usize anyway

  • @robertbastian Do you like the proposal 2 above?

  • @sffc Yeah, S::Success is cool. I think it could even be used to return an Ordering.

  • @Manishearth The type Success needs a place to be stored. I think we should make that a follow-up issue.

  • @sffc Yeah, OK. I think we should return usize as a stopgap.

  • @robertbastian I think that's a suboptimal API. Are we stabilizing this or putting it behind an unstable feature?

  • @sffc The priority should be landing the PR. It's fine if it's behind an unstable feature.

  • @robertbastian normalize_to doesn't return a usize and doesn't allow writing to anything non-allocating

  • @sffc It's an interesting observation. I don't think it's fully transferrable: writing a string to &mut [u8] has funny safety invariants.

  • @robertbastian Writeable doesn't return the usize.

  • @Manishearth Since we control the trait (unlike Write) we can even provide our own impl on &mut (&mut [u8], usize) for convenience.

  • @sffc Yes, I like that.

  • @robertbastian OK. Do we still need the error?

  • @sffc We still want it to short-circuit.

  • @Manishearth And if you're writing to something completely different like a String or a JSString.star

  • @sffc For the name of the trait, can we keep it CollationKeySink, and if we need it for something else in the future, we can rename it WriteBytes and add the type alias

impl CollatorBorrowed<'a> {
    pub fn write_sort_key_to<S>(&self, s: &str, sink: &mut S) -> Result<(), S::Error>
    where
        S: CollationKeySink,
}

impl CollationKeySink for Vec<u8> { type Error = core::convert::Infallible; }
impl CollationKeySink for ShortVec<u8> { type Error = core::convert::Infallible; }
impl CollationKeySink for VecDecque<u8> { type Error = core::convert::Infallible; }
// maybe later
// impl<'a> CollationKeySink for &'a mut (&'a mut [u8], usize) { type Error = TooSmall; }

Conclusion:

  • Use the API shape written above, including names of things
  • Methods have _to suffix

LGTM: @sffc @robertbastian @Manishearth

@rs-sac Interested in your thoughts on this. We can land the PR with these changes (assuming you agree with the plan), but if you'd prefer for us to make them we're happy to oblige as well. Your work so far is greatly appreciated!

@rs-sac
Copy link
Contributor Author

rs-sac commented Jun 6, 2025

The suggestions all seem fine to me. I have attempted to address them in a new set of patches.

There was a little impedance mismatch when it came to calling the normalizer, which I don't think any of us noticed would happen, because the normalizer has a similar sink which is always fallible (and which returns the here-undesired core::fmt::Error), because that code didn't parameterize whether an error is returned via an associated type. Anyway, have a look at my attempt to handle the problem semi-reasonably.

@hsivonen
Copy link
Member

hsivonen commented Jun 6, 2025

Sorry about my unavailability for yesterday's meeting.

The reason why the normalizer uses a sink instead of returning a collectable iterator is being able to pass through already-normalized slices. (icu_normalizer writing to a String is faster than collecting a unicode-normalization-returned iterator into a String.)

When the input is &str or &[u8] and Strength is Identical this passthrough from the normalizer is relevant in collation key generation operation, too.

And, as noted in the meeting (if I'm reading right), since the bytes from the lower strength levels between Primary and Identical are first buffered, it makes sense to copy them to the caller-visible sink via slices.

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Just two small things

// which is impossible.
//
// This shape is so it fits neatly into Result::map_err.
fn error(err: core::fmt::Error) -> Self::Error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut adapter = SinkAdapter::new(sink, &mut state);
normalize(nfd, &mut adapter).map_err(S::error)?;
//                                   ^^^^^^^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we didn't add this to the trait just because SinkAdapter needs it. It again puts core::fmt::Write on the API. This also doesn't return the error that the sink returns (that gets discarded in line 2355), but a new one.

The adapter should store the error returned by the sink and return it later:

let mut adapter = SinkAdapter::new(sink, &mut state); 
let _ignored = normalize(nfd, &mut adapter);
adapter.finalize()?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I used the name "finish" instead of "finalize" since I had already used it elsewhere.

@echeran
Copy link
Contributor

echeran commented Jun 6, 2025

The conclusion of the discussion after I left looks good. I asked the nearest collation expert I know about whether it makes sense to compare a string to a sort key, and here's what I learned from that discussion. I think the ICU4X meeting discussion conclusion aligns with what I learned, so I'm just relaying this information to provide more context for any future discussions going forward:

  • The usual way to compare strings in collator is to compare the Collation Elements of the respective strings. You can't compare strings directly. You have to firstly fetch the Collation Elements of the strings.
  • Performing comparison follows the collation algorithm: from the Collation Elements, you compare all the primary weights first, then you compare all the secondary weights, then tertiary weights, etc. You exit early when you get to a difference.
  • The point of sort keys is to allow for an optimization based on a time-space tradeoff: if you perform comparisons on the same strings enough, then it might eventually become worth it to create a version of the Collation Elements that can be bytewise compared (faster). That new version/representation is the sort key, of course. Of course, that sort key has to be stored somewhere (more space).
  • DBs are one obvious place where the comparison of strings may happen over and over again, but that's not necessarily the only place where this optimization tradeoff would make sense.
  • There is no early exit when constructing a sort key. Constructing a sort key requires you to visit the whole string because you have to visit all of the primary weights (and serialize them) before visiting all of the secondary weights, and then the tertiary weights, and so on. You have to process the entire string to construct that accurately. When you think about how sort keys get used, which is serializing to a persistent storage for reuse, it makes sense that you need to process the entire string before storing it.
  • ICU will compress the bytes that it creates for a sort key. It does this by having knowledge of how the Collation Element weights are constructed so that the compression still preserves the outcome.
  • Comparing a string to a sort key doesn't make sense. Instead, you would naturally convert the string into a sort key and then do the easy bytewise comparison on the other sort key. Trying to compare a string to a sort key without converting the string into its own sort key implies that you would be doing duplicate work: you would need to reconstitute Collation Elements from the compressed bytes of the sort key somehow, and compare them to the Collation Elements of the string. But the sort key already is the product of the work to get the Collation Elements and rearrange the constituent weights for comparison purposes. Furthermore, I'm not sure if it's possible that you could reconstitute the Collation Elements from the compressed bytes of the sort key. So in the end, if you "enter sort key space", you stay there and commit to it. Presumably, you would only have a scenario with a sort key in the first place because of a previous decision to commit to the space-time tradeoff optimization in which sort keys are useful, as justified by the use case that would need to preexist in that scenario.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your great work!

Comment on lines +2164 to +2171
/// The type of error the sink may return.
type Error;

/// An intermediate state object used by the sink, which must implement [`Default`].
type State;

/// A result value indicating the final state of the sink (e.g. a number of bytes written).
type Output;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: I'm fine with the three-associated-types solution, although I think it is achievable by implementing the trait on a type that itself contains a slot for intermediate state, so adding the associated types for State and Output doesn't in itself unlock new functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not wrong, and the issue approaches personal taste. My taste is to often avoid tuples (which were also mentioned as a possibility) because the numbered fields have no identifiers saying what they are. Another new type is an option, though someone might object "why do I need to wrap my slice in this other thing to write to it?" And even though the slice impl is relatively simple, I didn't want to require that the state and output types be the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the "tuple of mutable slice and usize" pattern is a common one, and that's basically the only main case for these two associated types, so I may try and write a PR that goes in that direction to see if people want to discuss it further, but I don't want to have that discussion block this feature.

@Manishearth
Copy link
Member

Are we ready to merge this? We have two approvals and @hsivonen approved a while ago too. (I'm reluctant to hit the button here since I haven't been actively involved in this change)

@rs-sac thank you so much for bearing with us as we figured this out, and thanks for all of the work!

@sffc
Copy link
Member

sffc commented Jun 10, 2025

I think it's mergeable; I'll let @hsivonen click the button as the primary owner of this component.

@hsivonen hsivonen merged commit 2f5134f into unicode-org:main Jun 10, 2025
29 checks passed
@hsivonen
Copy link
Member

Merged. Thank you for the code and thank you for your patience with the review comments.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Jun 17, 2025
@sffc
Copy link
Member

sffc commented Aug 13, 2025

For posterity, some belated notes from 2025-05-15 ICU4X WG about the Write trait:

  • @hsivonen There's a new trait called Write. Unfortunately there isn't a trait of exactly the right shape in core. So we need to mint a new trait. Should we call it Write so that it signals that it is similar to std::io::Write, or should we call it something like CollationKeySink? And should the methods return a Result like the core/std ones do? Is it useful to allow for fallible sink implementations? In the Normalizer use case, we use regular core::fmt::Write which is fallible.
  • @sffc I prefer CollationKeySink. If it were named Write, I feel like it should go into a crate like core_bytes_write which I'm not really interested in making. It seems fine for this to be collator-specific.
  • @sffc I think it should be fallible like all the other Write traits. Maybe you're serializing into a fixed-length field in a database and you want to know when you run out of room.
  • @sffc Maybe use core::fmt::Error as the error type instead of creating a new one?
  • @hsivonen Yeah that aligns with how we handle strings internally (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider exposing sort keys
7 participants