-
Notifications
You must be signed in to change notification settings - Fork 220
Port sort key code from ICU4C (#2689) #6537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @rs-sac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request ports the sort key code from ICU4C to Rust, adding a write_sort_key
API to the Collator
. The code was largely translated directly from C++, with minor adjustments to align with Rust conventions and address Clippy recommendations. The author implemented a simpler approach for handling the identical level by appending UTF-8, differing from ICU4C's method. The author also added some simple tests.
Highlights
- Porting Sort Key Code: The primary focus is porting the sort key generation algorithm from ICU4C's C++ implementation to Rust, enhancing the
icu_collator
crate. - API Addition: A new
write_sort_key
API is added to theCollator
struct, enabling the generation of sort keys. - UTF-8 Normalization: The implementation handles identical level comparisons by appending UTF-8 NFD normalization, a simplification compared to ICU4C.
- Write Trait: A new
Write
trait is introduced to allow writing to buffer-like objects, since this crate does not have access tostd
.
Changelog
Click here to see the changelog
- CHANGELOG.md
- Added a changelog entry for the
icu_collator
crate, noting the addition of thewrite_sort_key
API.
- Added a changelog entry for the
- components/collator/src/comparison.rs
- Added
extern crate alloc
anduse alloc::vec::Vec
to enable dynamic memory allocation. - Introduced constants for special sort key bytes, level separators, and compression terminators.
- Defined an enum
Level
for specifying comparison levels. - Added the
write_sort_key
function to write sort key bytes up to the collator's strength. - Added the
write_sort_key_up_to_quaternary
function to write sort key bytes up to the collator's strength, with options to write the case level and stop writing levels based on a provided function. - Added the
Write
trait and implementations forVec<u8>
andSmallVec<[u8; N]>
. - Added the
SortKeyLevel
struct to manage sort key levels. - Refactored
compare_impl
to use helper functions for tailoring and numeric primary handling. - Added helper functions
tailoring_or_root
,numeric_primary
,collation_elements
, andvariable_top
to improve code organization and readability.
- Added
- components/collator/src/elements.rs
- Added
NO_CE_QUATERNARY
constant. - Added
ignorable
function toNonPrimary
struct.
- Added
- components/collator/src/options.rs
- No functional changes, only a minor comment update for clarity.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
From C++ to Rust,
A sort key finds new life,
Strings now find their place.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request ports the sort key code from ICU4C to Rust, which is a significant enhancement for the icu_collator
crate. The author has made a commendable effort to translate the C++ code while adapting it to Rust conventions and addressing Clippy's suggestions. The addition of tests and a line-by-line audit further contribute to the reliability of the code. Overall, this is a valuable contribution.
Summary of Findings
- Missing
compressibleBytes
andminLevel
: The author intentionally left outcompressibleBytes
andminLevel
from the C++ code. It's important to confirm whether these omissions are acceptable and won't affect the functionality or performance of the ported code. Further investigation may be required to determine if these components are necessary for the Rust implementation. - Potential performance bottlenecks: The code contains several loops and macro usages that could potentially lead to performance bottlenecks. While the author has made an effort to optimize the code, it's important to carefully analyze these sections and identify any areas where further optimization may be necessary. Consider using profiling tools to measure the performance of the code and identify any hotspots.
- Inconsistent naming: There are inconsistencies in the naming of variables and constants throughout the code. For example, some constants use snake_case while others use SCREAMING_SNAKE_CASE. It's important to establish a consistent naming convention and apply it throughout the code to improve readability and maintainability.
Merge Readiness
The pull request introduces a significant new feature and includes tests, which is a positive sign. However, the compressibleBytes
and minLevel
omissions need to be addressed, and the potential performance bottlenecks should be investigated. I am unable to approve this pull request, and recommend that the pull request not be merged until the high severity issues are addressed. Other reviewers should also review and approve this code before merging.
I see that most of the test failures boil down to my calling |
(I don't tend to review collator stuff) |
Thank you for the PR! I acknowledge that this is my area to review, but I couldn't get to it today. |
Thanks for the contribution! A few things:
|
I think in this case this is not the advice to follow: this code is deep within the implementation; it needs to use the decomposition data that is available within the Collator/CollatorBorrowed structs, and if it needs more then these structs should load more during data loading. I don't have time to really go deeper into how it looks, but other people here should be able to. |
I assume it's code like this which is causing the error: impl CollatorBorrowed<'_> {
/// Write the sort key bytes up to the collator's strength.
///
/// For identical strength, the UTF-8 NFD normalization is appended for breaking ties.
/// The key ends with a `TERMINATOR_BYTE`.
pub fn write_sort_key<S>(&self, s: &str, sink: &mut S)
where
S: Write,
{
self.write_sort_key_up_to_quaternary(s.chars(), sink, |_| true);
if self.options.strength() == Strength::Identical {
let nfd = icu_normalizer::DecomposingNormalizerBorrowed::new_nfd();
let s = nfd.normalize(s);
sink.write_byte(LEVEL_SEPARATOR_BYTE);
sink.write(s.as_bytes());
}
sink.write(&[TERMINATOR_BYTE]);
}
} If you need data beyond what is in the Collator, then you might need to load that data in the Collator constructor, or else create a new struct. However, I think the data you need might already be there. |
Thanks for the pointers. I think I have fixed the problem, though I await output from the official tests. |
Rather than add to the temporary diplomat exclusion file, I added my stuff to the more documented allow list. I'm not an expert on diplomat, but I'm pretty sure it doesn't make sense to expose the internal use constructor I created. OTOH, ideally the sort key code would be exposed, but AFAICT as a non-expert, that can't happen due to using traits. (Since I don't know what reviewers are necessary, I didn't add or change any reviewers, even though this web site says I did. It's completely making that up.) |
The data is indeed already there. The collator already creates It looks like a |
Ah, nevermind, it's already in the changeset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, but I have some change requests inline.
I think the biggest concern is planning correctly around the compressibility table that ICU4X currently doesn't have.
I intend to discuss that aspect with other ICU4X developers on Thursday.
A couple of things that aren't really review comments but more general discussion points:
|
As a newbie, I don't have a strong opinion. I took the name suggested in the original issue.
I had the same thoughts. It appeared to me that while the C++ interface supports errors, they were all related to OOM, which is just a crash in Rust (lame IMO, but nobody asked my opinion on that point). Since I didn't see any plausible scenario wherein an error would be returned, I skipped the error business, thinking that if I'd overlooked anything, it would be pointed out in review. Thanks for the rest of the comments. It will take me a little while to work through them all. |
From the meeting today:
|
I have incorporated a few of the new review comments. A couple of the comments conflict with the direction I have received from my previous reviewer. Since that puts me, an outsider, in the middle of two more senior reviewers who disagree, I have left those things alone pending further discussion/consensus/direction. |
Discussion from today's ICU4X WG meeting Meeting notes
Currently in the PR: impl CollatorBorrowed<'_> {
pub fn write_sort_key<S>(&self, s: &str, sink: &mut S) -> Result<(), core::fmt::Error>
where
S: CollationKeySink,
} proposal: impl CollatorBorrowed<'a> {
pub fn sort_key<'b>(&self, s: &'b str) -> impl Iterator<Item = u8> + 'a + 'b
}
Moving to the second question about the shape of the trait:
impl CollatorBorrowed<'a> {
pub fn write_sort_key_to<S>(&self, s: &str, sink: &mut S) -> Result<(), S::Error>
where
S: CollationKeySink,
}
impl CollationKeySink for Vec<u8> { type Error = core::convert::Infallible; }
impl CollationKeySink for ShortVec<u8> { type Error = core::convert::Infallible; }
impl CollationKeySink for VecDecque<u8> { type Error = core::convert::Infallible; }
// maybe later
// impl<'a> CollationKeySink for &'a mut (&'a mut [u8], usize) { type Error = TooSmall; } Conclusion:
LGTM: @sffc @robertbastian @Manishearth @rs-sac Interested in your thoughts on this. We can land the PR with these changes (assuming you agree with the plan), but if you'd prefer for us to make them we're happy to oblige as well. Your work so far is greatly appreciated! |
The suggestions all seem fine to me. I have attempted to address them in a new set of patches. There was a little impedance mismatch when it came to calling the normalizer, which I don't think any of us noticed would happen, because the normalizer has a similar sink which is always fallible (and which returns the here-undesired |
Sorry about my unavailability for yesterday's meeting. The reason why the normalizer uses a sink instead of returning a collectable iterator is being able to pass through already-normalized slices. ( When the input is And, as noted in the meeting (if I'm reading right), since the bytes from the lower strength levels between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Just two small things
// which is impossible. | ||
// | ||
// This shape is so it fits neatly into Result::map_err. | ||
fn error(err: core::fmt::Error) -> Self::Error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut adapter = SinkAdapter::new(sink, &mut state);
normalize(nfd, &mut adapter).map_err(S::error)?;
// ^^^^^^^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we didn't add this to the trait just because SinkAdapter
needs it. It again puts core::fmt::Write
on the API. This also doesn't return the error that the sink returns (that gets discarded in line 2355), but a new one.
The adapter should store the error returned by the sink and return it later:
let mut adapter = SinkAdapter::new(sink, &mut state);
let _ignored = normalize(nfd, &mut adapter);
adapter.finalize()?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I used the name "finish" instead of "finalize" since I had already used it elsewhere.
The conclusion of the discussion after I left looks good. I asked the nearest collation expert I know about whether it makes sense to compare a string to a sort key, and here's what I learned from that discussion. I think the ICU4X meeting discussion conclusion aligns with what I learned, so I'm just relaying this information to provide more context for any future discussions going forward:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your great work!
/// The type of error the sink may return. | ||
type Error; | ||
|
||
/// An intermediate state object used by the sink, which must implement [`Default`]. | ||
type State; | ||
|
||
/// A result value indicating the final state of the sink (e.g. a number of bytes written). | ||
type Output; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: I'm fine with the three-associated-types solution, although I think it is achievable by implementing the trait on a type that itself contains a slot for intermediate state, so adding the associated types for State and Output doesn't in itself unlock new functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not wrong, and the issue approaches personal taste. My taste is to often avoid tuples (which were also mentioned as a possibility) because the numbered fields have no identifiers saying what they are. Another new type is an option, though someone might object "why do I need to wrap my slice in this other thing to write to it?" And even though the slice impl is relatively simple, I didn't want to require that the state and output types be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the "tuple of mutable slice and usize" pattern is a common one, and that's basically the only main case for these two associated types, so I may try and write a PR that goes in that direction to see if people want to discuss it further, but I don't want to have that discussion block this feature.
I think it's mergeable; I'll let @hsivonen click the button as the primary owner of this component. |
Merged. Thank you for the code and thank you for your patience with the review comments. |
For posterity, some belated notes from 2025-05-15 ICU4X WG about the Write trait:
|
I mostly translate the C++ code straight across, but I made some minor changes to match the mindset of the Rust codebase, to match Rust practice, to satisfy Clippy, and to reduce copy-and-paste.
A comment on the issue suggested handling the identical level by appending UTF-8. That's not what ICU4C does, but it's a whole lot simpler, so that's what I did.
At the end, some simple tests are included.
Fixes #2689