Replacing fancy-regex for faster encode #460
Open
+1,155
−37
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed that fancy-regex was mentioned to be a major slowdown in the encode/decode tokenizer process. Similarly, on another project which also uses fancy-regex for tokinizer training, this was the same case. I ported my custom C implementation that specifically parses the cl100k pattern to Rust, along with a demo fuzz tester. I also temporarily added some options to 'lib.rs' to test between the fancy-regex backend and the custom one I provided.
Changes
Reproducing
Running the benchmark, 100 iterations, on a demo 1MB file
Running the fuzzer 50000 steps, generating a random 2028 length input text
Notes