Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(text): handle code points > U+FFFF in levenshteinDistance #6014

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lionel-rowe
Copy link
Contributor

Fixes #6013

@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Sep 18, 2024

Benchmarks:

import { levenshteinDistance as current } from "jsr:@std/[email protected]/levenshtein-distance";
import { levenshteinDistance as next } from "https://raw.githubusercontent.com/lionel-rowe/std/levenshtein-unicode/text/levenshtein_distance.ts";

for (const [name, fn] of Object.entries({ current, next })) {
  Deno.bench(`${name} (ASCII)`, () => {
    fn("a".repeat(10), "b".repeat(10));
    fn("a".repeat(100), "b".repeat(100));

    fn("a".repeat(10), "");
    fn("a".repeat(100), "");
    fn("", "b".repeat(10));
    fn("", "b".repeat(100));

    fn(
      "a".repeat(100) + "b".repeat(100) + "a".repeat(100),
      "b".repeat(100) + "a".repeat(100) + "b".repeat(100),
    );
  });
}

for (const [name, fn] of Object.entries({ current, next })) {
  // will give wrong result with `current`, but just testing perf here
  Deno.bench(`${name} (with emoji)`, () => {
    fn(
      "a".repeat(100) + "💩".repeat(100) + "a".repeat(100),
      "💩".repeat(100) + "a".repeat(100) + "💩".repeat(100),
    );
  });
}

for (const len of [1e0, 1e1, 1e2, 1e3, 1e4]) {
  for (const [name, fn] of Object.entries({ current, next })) {
    Deno.bench(`${name} (string length ${len.toLocaleString("en-US")})`, () => {
      fn("a".repeat(len), "b".repeat(len));
    });
  }
}

Performance is almost identical on my machine, typical run:

benchmark                        time/iter (avg)        iter/s      (min … max)           p75      p99     p995
-------------------------------- ----------------------------- --------------------- --------------------------
current (ASCII)                          34.4 µs        29,030 ( 24.3 µs …   1.4 ms)  31.1 µs 128.5 µs 184.8 µs
next (ASCII)                             35.8 µs        27,930 ( 24.4 µs … 893.6 µs)  31.8 µs 134.5 µs 197.2 µs
current (with emoji)                     33.0 µs        30,280 ( 22.5 µs … 682.6 µs)  29.4 µs 129.3 µs 183.6 µs
next (with emoji)                        36.8 µs        27,200 ( 24.2 µs … 656.6 µs)  39.8 µs 110.0 µs 183.1 µs
current (string length 1)                67.9 ns    14,720,000 ( 49.7 ns … 194.9 ns)  75.8 ns 142.3 ns 155.8 ns
next (string length 1)                   69.1 ns    14,470,000 ( 52.6 ns … 206.7 ns)  73.9 ns 139.5 ns 161.2 ns
current (string length 10)              391.7 ns     2,553,000 (310.8 ns … 587.3 ns) 421.5 ns 580.7 ns 587.3 ns
next (string length 10)                 386.4 ns     2,588,000 (318.8 ns … 641.8 ns) 404.6 ns 610.2 ns 641.8 ns
current (string length 100)               6.0 µs       167,000 (  5.7 µs …   6.5 µs)   6.1 µs   6.5 µs   6.5 µs
next (string length 100)                  6.1 µs       163,500 (  5.6 µs …   7.2 µs)   6.3 µs   7.2 µs   7.2 µs
current (string length 1,000)           222.4 µs         4,496 (185.2 µs …   1.4 ms) 222.7 µs 397.6 µs 569.9 µs
next (string length 1,000)              212.3 µs         4,711 (174.1 µs …   1.1 ms) 214.1 µs 413.4 µs 525.8 µs
current (string length 10,000)           18.1 ms          55.3 ( 16.8 ms …  19.9 ms)  18.6 ms  19.9 ms  19.9 ms
next (string length 10,000)              17.4 ms          57.5 ( 16.1 ms …  19.0 ms)  17.7 ms  19.0 ms  19.0 ms

text/levenshtein_distance.ts Outdated Show resolved Hide resolved
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@a550998). Learn more about missing BASE report.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6014   +/-   ##
=======================================
  Coverage        ?   96.29%           
=======================================
  Files           ?      494           
  Lines           ?    39541           
  Branches        ?     5837           
=======================================
  Hits            ?    38076           
  Misses          ?     1423           
  Partials        ?       42           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lionel-rowe
Copy link
Contributor Author

Previous fix was still buggy as it's tough to keep the various increments and length measurements in sync, given that sometimes you want the code point length and sometimes the code unit length. As a result I just switched to using [...str] char arrays, which surprisingly still has no measurable impact on perf (updated benchmarks above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

levenshteinDistance doesn't correctly handle code points over U+FFFF
2 participants