Replacing fancy-regex for faster encode #460

MadMax129 · 2025-10-31T01:31:27Z

I noticed that fancy-regex was mentioned to be a major slowdown in the encode/decode tokenizer process. Similarly, on another project which also uses fancy-regex for tokinizer training, this was the same case. I ported my custom C implementation that specifically parses the cl100k pattern to Rust, along with a demo fuzz tester. I also temporarily added some options to 'lib.rs' to test between the fancy-regex backend and the custom one I provided.

Changes

Replaced the use of fancy-regex "find_iter" with my custom text parser for the cl100k pattern
cl100k.rs stores the custom parser, focused on efficient code, limited runtime allocations, and expandability to other OpenAI regex patterns
Added a fuzzing/benchmark module (cl100k_fuzz.rs) that can be used to verify against the existing fancy-regex implementation

Reproducing

Running the benchmark, 100 iterations, on a demo 1MB file

cargo run --release --bin cl100k_fuzz -- file big.txt 100

Fancy total:   36.75904642s
Custom total:  18.261776294s
Average/iter:  fancy=0.367590s custom=0.182618s
Custom speedup vs Fancy: 2.013x

Running the fuzzer 50000 steps, generating a random 2028 length input text

cargo run --release --bin cl100k_fuzz -- bpe 50000 2028

Completed 50000/50000 cases
Finished 50000 cases with no mismatches.
Total fancy_regex encode time: 12.791220842s
Total custom parser encode time: 8.021205081s
Average per case: fancy_regex=0.000256s, custom_parser=0.000160s
Custom speedup vs Fancy: 1.595x

Notes

This only supports the cl100k regex pattern however can be expanded to support o200k as well!
Would appriciate feedback on the Rust code/modifications to existing modules.
cl100k_fuzz.rs serves as a messy demonstration that the custom regex parser does diverage from the existing fancy-regex parser

Majdoddin · 2025-11-07T18:50:29Z

@MadMax129
PR (#331) addresses this same issue.
It uses the standard Regex crate instead of fancy-regex by refactoring the pattern to eliminate lookaheads (handling whitespace via scripting instead). This achieves a ~6x overall speedup while
keeping the solution general across all patterns (o200k, cl100k, etc.) with mathematically provable equivalent output.

Some points of comparison:

Speedup: Your custom parser achieves ~2x, while # 331 gets ~6x (14x speedup in pattern matching)
Generality: Your solution is specific to cl100k, while # 331 works for all tiktoken patterns
Simplicity: # 331 adds 81 lines vs your 1155 (custom parser is more complex to maintain)

MadMax129 added 3 commits October 30, 2025 20:43

faster regex

429b1c4

cleanup

a48bd47

cleanup

c7bad6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replacing fancy-regex for faster encode #460

Replacing fancy-regex for faster encode #460

Uh oh!

MadMax129 commented Oct 31, 2025

Uh oh!

Majdoddin commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Replacing fancy-regex for faster encode #460

Are you sure you want to change the base?

Replacing fancy-regex for faster encode #460

Uh oh!

Conversation

MadMax129 commented Oct 31, 2025

Changes

Reproducing

Notes

Uh oh!

Majdoddin commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants