Skip to content

Conversation

@MadMax129
Copy link

I noticed that fancy-regex was mentioned to be a major slowdown in the encode/decode tokenizer process. Similarly, on another project which also uses fancy-regex for tokinizer training, this was the same case. I ported my custom C implementation that specifically parses the cl100k pattern to Rust, along with a demo fuzz tester. I also temporarily added some options to 'lib.rs' to test between the fancy-regex backend and the custom one I provided.

Changes

  • Replaced the use of fancy-regex "find_iter" with my custom text parser for the cl100k pattern
  • cl100k.rs stores the custom parser, focused on efficient code, limited runtime allocations, and expandability to other OpenAI regex patterns
  • Added a fuzzing/benchmark module (cl100k_fuzz.rs) that can be used to verify against the existing fancy-regex implementation

Reproducing

Running the benchmark, 100 iterations, on a demo 1MB file

cargo run --release --bin cl100k_fuzz -- file big.txt 100

Fancy total:   36.75904642s
Custom total:  18.261776294s
Average/iter:  fancy=0.367590s custom=0.182618s
Custom speedup vs Fancy: 2.013x

Running the fuzzer 50000 steps, generating a random 2028 length input text

cargo run --release --bin cl100k_fuzz -- bpe 50000 2028

Completed 50000/50000 cases
Finished 50000 cases with no mismatches.
Total fancy_regex encode time: 12.791220842s
Total custom parser encode time: 8.021205081s
Average per case: fancy_regex=0.000256s, custom_parser=0.000160s
Custom speedup vs Fancy: 1.595x

Notes

  • This only supports the cl100k regex pattern however can be expanded to support o200k as well!
  • Would appriciate feedback on the Rust code/modifications to existing modules.
  • cl100k_fuzz.rs serves as a messy demonstration that the custom regex parser does diverage from the existing fancy-regex parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant