8365675: Add String Unicode Case-Folding Support #27628

xuemingshen-oracle · 2025-10-03T19:56:22Z

Summary

Case folding is a key operation for case-insensitive matching (e.g., string equality, regex matching), where the goal is to eliminate case distinctions without applying locale or language specific conversions.

Currently, the JDK does not expose a direct API for Unicode-compliant case folding. Developers now rely on methods such as:

String.equalsIgnoreCase(String)

Unicode-aware, locale-independent.
Implementation uses Character.toLowerCase(Character.toUpperCase(int)) per code point.
Limited: does not support 1:M mapping defined in Unicode case folding.

Character.toLowerCase(int) / Character.toUpperCase(int)

Locale-independent, single code point only.
No support for 1:M mappings.

String.toLowerCase(Locale.ROOT) / String.toUpperCase(Locale.ROOT)

Based on Unicode SpecialCasing.txt, supports 1:M mappings.
Intended primarily for presentation/display, not structural case-insensitive matching.
Requires full string conversion before comparison, which is less efficient and not intended for structural matching.

1:M mapping example, U+00DF (ß)

String.toUpperCase(Locale.ROOT, "ß") → "SS"
Case folding produces "ss", matching Unicode caseless comparison rules.

jshell> "\u00df".equalsIgnoreCase("ss")
$22 ==> false

jshell> "\u00df".toUpperCase(Locale.ROOT).toLowerCase(Locale.ROOT).equals("ss")
$24 ==> true

Motivation & Direction

Add Unicode standard-compliant case-less comparison methods to the String class, enabling & improving reliable and efficient Unicode-aware/compliant case-insensitive matching.

Unicode-compliant full case folding.
Simpler, stable and more efficient case-less matching without workarounds.
Brings Java's string comparison handling in line with other programming languages/libraries.

This PR proposes to introduce the following comparison methods in String class

boolean equalsFoldCase(String anotherString)
int compareToFoldCase(String anotherString)
Comparator UNICODE_CASEFOLD_ORDER

These methods are intended to be the preferred choice when Unicode-compliant case-less matching is required.

*Note: An early draft also proposed a String.toCaseFold() method returning a new case-folded string.
However, during review this was considered error-prone, as the resulting string could easily be mistaken for a general transformation like toLowerCase() and then passed into APIs where case-folding semantics are not appropriate.

The New API

See CSR https://bugs.openjdk.org/browse/JDK-8369017

Usage Examples

Sharp s (U+00DF) case-folds to "ss"

    "straße".equalsIgnoreCase("strasse");             // false
    "straße".compareToIgnoreCase("strasse");          // != 0
    "straße".equalsFoldCase("strasse");               // true

Performance

The JMH microbenchmark StringCompareToIgnoreCase has been updated to compare performance of compareToFoldCase with the existing compareToIgnoreCase().

Benchmark                                         Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.asciiGreekLower         avgt   15  20.195 ± 0.300  ns/op
StringCompareToIgnoreCase.asciiGreekLowerCF       avgt   15  11.051 ± 0.254  ns/op
StringCompareToIgnoreCase.asciiGreekUpperLower    avgt   15   6.035 ± 0.047  ns/op
StringCompareToIgnoreCase.asciiGreekUpperLowerCF  avgt   15  14.786 ± 0.382  ns/op
StringCompareToIgnoreCase.asciiLower              avgt   15  17.688 ± 1.396  ns/op
StringCompareToIgnoreCase.asciiLowerCF            avgt   15  44.552 ± 0.155  ns/op
StringCompareToIgnoreCase.asciiUpperLower         avgt   15  13.069 ± 0.487  ns/op
StringCompareToIgnoreCase.asciiUpperLowerCF       avgt   15  58.684 ± 0.274  ns/op
StringCompareToIgnoreCase.greekLower              avgt   15  20.642 ± 0.082  ns/op
StringCompareToIgnoreCase.greekLowerCF            avgt   15   7.255 ± 0.271  ns/op
StringCompareToIgnoreCase.greekUpperLower         avgt   15   5.737 ± 0.013  ns/op
StringCompareToIgnoreCase.greekUpperLowerCF       avgt   15  11.100 ± 1.147  ns/op
StringCompareToIgnoreCase.lower                   avgt   15  20.192 ± 0.044  ns/op
StringCompareToIgnoreCase.lowerrCF                avgt   15  11.257 ± 0.259  ns/op
StringCompareToIgnoreCase.supLower                avgt   15  54.801 ± 0.415  ns/op
StringCompareToIgnoreCase.supLowerCF              avgt   15  15.207 ± 0.418  ns/op
StringCompareToIgnoreCase.supUpperLower           avgt   15  14.431 ± 0.188  ns/op
StringCompareToIgnoreCase.supUpperLowerCF         avgt   15  19.149 ± 0.985  ns/op
StringCompareToIgnoreCase.upperLower              avgt   15   5.650 ± 0.051  ns/op
StringCompareToIgnoreCase.upperLowerCF            avgt   15  14.338 ± 0.352  ns/op
StringCompareToIgnoreCase.utf16SubLower           avgt   15  14.774 ± 0.200  ns/op
StringCompareToIgnoreCase.utf16SubLowerCF         avgt   15   2.669 ± 0.041  ns/op
StringCompareToIgnoreCase.utf16SupUpperLower      avgt   15  16.250 ± 0.099  ns/op
StringCompareToIgnoreCase.utf16SupUpperLowerCF    avgt   15  11.524 ± 0.327  ns/op

Refs

Unicode Standard 5.18.4 Caseless Matching
Unicode® Standard Annex #44: 5.6 Case and Case Mapping
Unicode Technical Standard #18: Unicode Regular Expressions RL1.5: Simple Loose Matches
Unicode SpecialCasing.txt
Unicode CaseFolding.txt

Other Languages

Python string.casefold()

The str.casefold() method in Python returns a casefolded version of a string. Casefolding is a more aggressive form of lowercasing, designed to remove all case distinctions in a string, particularly for the purpose of caseless string comparisons.

Perl’s fc()

Returns the casefolded version of EXPR. This is the internal function implementing the \F escape in double-quoted strings.
Casefolding is the process of mapping strings to a form where case differences are erased; comparing two strings in their casefolded form is effectively a way of asking if two strings are equal, regardless of case.
Perl only implements the full form of casefolding, but you can access the simple folds using "casefold()" in Unicode::UCD] ad "prop_invmap()" in Unicode::UCD].

ICU4J UCharacter.foldCase (Java)

Purpose: Provides extensions to the standard Java Character class, including support for more Unicode properties and handling of supplementary characters (code points beyond U+FFFF).
Method Signature (String based): public static String foldCase(String str, int options)
Method Signature (CharSequence & Appendable based): public static A foldCase(CharSequence src, A dest, int options, Edits edits)
Key Features:
Case Folding: Converts a string to its case-folded equivalent.
Locale Independent: Case folding in UCharacter.foldCase is generally not dependent on locale settings.
Context Insensitive: The mapping of a character is not affected by surrounding characters.
Turkic Option: An option exists to include or exclude special mappings for Turkish/Azerbaijani text.
Result Length: The resulting string can be longer or shorter than the original.
Edits Recording: Allows for recording of edits for index mapping, styled text, and getting only changes.

u_strFoldCase (C/C++)

A lower-level C API function for case folding a string.
Case Folding Options: Similar options as UCharacter.foldCase for controlling case folding behavior.
Availability: Found in the ustring.h and unistr.h headers in the ICU4C library.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change requires CSR request JDK-8369017 to be approved
Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issues

JDK-8365675: Add String Unicode Case-Folding Support (Enhancement - P3)
JDK-8369017: Add String Unicode Case-Folding Support (CSR)

Reviewers

Magnus Ihse Bursie (@magicus - Reviewer) 🔄 Re-review required (review applies to 1abb0228)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27628/head:pull/27628
$ git checkout pull/27628

Update a local copy of the PR:
$ git checkout pull/27628
$ git pull https://git.openjdk.org/jdk.git pull/27628/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27628

View PR using the GUI difftool:
$ git pr show -t 27628

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27628.diff

Using Webrev

Link to Webrev Comment

to update api

bridgekeeper · 2025-10-03T19:58:13Z

👋 Welcome back sherman! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-10-03T19:59:24Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

openjdk · 2025-10-03T20:00:24Z

@xuemingshen-oracle The following labels will be automatically applied to this pull request:

build
core-libs
i18n

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-10-03T20:09:19Z

Webrevs

01: Full - Incremental (9d9997dc)
00: Full (1abb0228)

magicus

Build changes look fine.

/reviewers 2

openjdk · 2025-10-06T12:36:41Z

@magicus
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

naotoj · 2025-10-07T18:38:59Z

While working on Unicode 17 upgrade, I noticed that they changed the example from "MASSE"/"Maße" to "FUSS"/"Fuß" (https://www.unicode.org/L2/L2025/25085.htm#183-A59), so you might want to switch them as well

RogerRiggs

The API looks good.

Is the performance comparable to equalsIgnoreCase?

RogerRiggs · 2025-10-07T22:10:50Z

src/java.base/share/classes/java/lang/StringLatin1.java

+        char[] folded1 = null;
+        char[] folded2 = null;
+        int k1 = 0, k2 = 0, fk1 = 0, fk2 = 0;
+        while ((k1 < len1 || folded1 != null && fk1 < folded1.length) &&


Many suggestions come to mind here on the algorithm, to optimize performance.
For example, many strings will have identical prefixes. Using Arrays.mismatch could quickly skip over the identical prefix.
Consider using code points (or a long, packing 4 chars) for the folded replacements, to avoid having to step through chars in char arrays. CaseFolding.foldIfDefined could return the full expansion as a long.
It may be profitable to use Arrays.mismatch again after expanded characters are determined to be equal.

Take another look at the data structure storing and doing the lookup of foldIfDefined both to increase the lookup performance.

RogerRiggs · 2025-10-07T22:18:58Z

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template

+
+     private static class CaseFoldingEntry {
+        final int cp;
+        final char[] folding;


Consider storing the folding as a int or long directly to avoid the overhead of small char arrays.
Arrange to be able to compare the whole replacement with another codePoint, etc.

I misunderstood the algorithm when comparing folded characters against non-folded sequences.
I still think a fast path for single character replacements will lower memory costs and improve performance.
The case of single-codepoint to single-codepoint dominates the case folding mappings.

RogerRiggs · 2025-10-07T22:20:19Z

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template

+            return depth;
+        }
+
+        private void add(CaseFoldingEntry entry) {


CDS can map whole objects/data structures into the heap; consider how to make this data structure so it can be mapped and not re-computed each startup.

liach

Given this patch obviously has so many performance optimization opportunities, I recommend handling those in subsequent RFEs so that we can review this purely from a specification point of view.

liach · 2025-10-08T13:57:17Z

make/modules/java.base/gensrc/GensrcCharacterData.gmk

 ################################################################################

+
+GENSRC_STRINGCASEFOLDING := $(SUPPORT_OUTPUTDIR)/gensrc/java.base/jdk/internal/java/lang/CaseFolding.java


Can we target the package jdk.internal.lang instead of jdk.internal.java.lang? I think the previous one is the convention set forth by stable values.

RogerRiggs · 2025-10-08T14:35:43Z

Given this patch obviously has so many performance optimization opportunities, I recommend handling those in subsequent RFEs so that we can review this purely from a specification point of view.

There is adequate time before RPD1 (Dec 4, 25) to improve performance, but the feature should not be included in JDK 26 unless the performance is comparable to the existing compareToIgnoreCase and equalsIgnoreCase.

ecki · 2025-10-08T16:31:53Z

Great progress thanks. Did you also consider a startsWith/containsCaseFold, I missed the case ignoring variants of those already. Or maybe provide an API to implement them on the cases folded intermediate buffers? If the API footprint gets too big on String as CaseFoldString.contains() helper maybe?

liach · 2025-10-08T18:20:47Z

Did you also consider a startsWith/containsCaseFold, I missed the case ignoring variants of those already.

I think for this purpose, we should rather introduce an API to case fold a string - we can use these operations on the case-fold-normalized strings.

RogerRiggs · 2025-10-08T18:38:07Z

The new APIs mentioned would be more effective, leveraging the underlying implementation without needing to create new Strings. Earlier discussions of the support for folding, raised a concern about tempting developers to a more ambiguous situation in which folded and unfolded strings exist and can be confused.

liach · 2025-10-08T18:55:29Z

I don't think it's a good idea to have an explosion of case-folding variants of string operations if we are adding a case-folding overload for every operation. In that case, the confusion of case folding applicability would be less of a problem compared to the API bloat.

ecki · 2025-10-09T16:08:25Z

src/java.base/share/classes/jdk/internal/lang/CaseFolding.java.template

+import static java.util.Map.entry;
+
+/**
+ * Utility class for {@code String.toCaseFold()} that handles Unicode case folding


Maybe make it clear this is a planned api (or not refer to it?)

xuemingshen-oracle added 2 commits October 3, 2025 10:47

8365675: Add String Unicode Case-Folding Support

e1a93af

to update api

Merge branch 'master' of https://git.openjdk.org/jdk into JDK-8365675

1abb022

openjdk bot added csr Pull request needs approved CSR before integration build [email protected] core-libs [email protected] i18n [email protected] labels Oct 3, 2025

xuemingshen-oracle marked this pull request as ready for review October 3, 2025 20:04

openjdk bot added the rfr Pull request is ready for review label Oct 3, 2025

magicus approved these changes Oct 6, 2025

View reviewed changes

RogerRiggs reviewed Oct 7, 2025

View reviewed changes

minor api doc updates

9d9997d

liach reviewed Oct 8, 2025

View reviewed changes

ecki reviewed Oct 9, 2025

View reviewed changes

		################################################################################


		GENSRC_STRINGCASEFOLDING := $(SUPPORT_OUTPUTDIR)/gensrc/java.base/jdk/internal/java/lang/CaseFolding.java

8365675: Add String Unicode Case-Folding Support #27628

Are you sure you want to change the base?

8365675: Add String Unicode Case-Folding Support #27628

Conversation

xuemingshen-oracle commented Oct 3, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation & Direction

The New API

Usage Examples

Performance

Refs

Other Languages

Progress

Issues

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Oct 3, 2025

Uh oh!

openjdk bot commented Oct 3, 2025

Uh oh!

openjdk bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

magicus left a comment

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Oct 6, 2025

Uh oh!

naotoj commented Oct 7, 2025

Uh oh!

RogerRiggs left a comment

Choose a reason for hiding this comment

Uh oh!

RogerRiggs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

RogerRiggs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

RogerRiggs Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

RogerRiggs Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

liach left a comment

Choose a reason for hiding this comment

Uh oh!

liach Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

RogerRiggs commented Oct 8, 2025

Uh oh!

ecki commented Oct 8, 2025

Uh oh!

liach commented Oct 8, 2025

Uh oh!

RogerRiggs commented Oct 8, 2025

Uh oh!

liach commented Oct 8, 2025

Uh oh!

ecki Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xuemingshen-oracle commented Oct 3, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Oct 3, 2025 •

edited

Loading

mlbridge bot commented Oct 3, 2025 •

edited

Loading