The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

chenying99 · 2025-01-23T21:19:50Z

vocabulary size 6400

text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

   
end = time.time()
print(end - start)

3.8366940021514893 seconds

The text was updated successfully, but these errors were encountered:

mehtamansi29 · 2025-01-28T14:38:56Z

Hi @chenying99 - Thanks for reporting the issue. Can you please help me with which tokenizer class from a library like nltk or transformers ? And also please provide the more sample code here to reproduce the issue.

I tried with transformer tokenizer (bert-base-uncased) and it is working fine. I am getting 0.0023925304412841797 sec(2.39 ms). That's pretty correct time.

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') 

text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

end = time.time()
print(end - start)

chenying99 · 2025-01-31T13:46:00Z

I am testing the BytePairTokenizer in this project (keras_hub), not the tokenizer from the transformers library.

I am training the large language model in this project, using the BytePairTokenizer provided by the project, where tokenization takes up most of the time.

import time
import keras_hub

tokenizer = keras_hub.models.GPT2Tokenizer.from_preset("gpt2_base_en")

# or tokenizer = keras_hub.models.WhisperTokenizer.from_preset("whisper_tiny_multi")


text = "Are you OK? "
start = time.time()
for i in range(10):
    tokenizer.tokenize(text + str(i))

   
end = time.time()
print(end - start)

mehtamansi29 · 2025-02-03T08:10:22Z

Hi @chenying99 -

Thanks for reporting the issue. I have raised PR for this issue.
Here in your code you are using loop processing instead of batch processing.
Once this PR will merge, If you can try tokenize text using batch processing then tokenizing would be faster.

import time
import keras_hub

tokenizer = keras_hub.models.GPT2Tokenizer.from_preset("gpt2_base_en")

text = ["Are you OK? " + str(i) for i in range(10)]  

start = time.time()
tokenized_texts = tokenizer.tokenize(text) 
end = time.time()

print(end - start)

chenying99 · 2025-02-03T12:43:10Z

In actual training of the large language model, a batch of data is input at a time rather than a single data point; that is, batch processing is used. Even with this approach, it seems that tokenization takes longer than the model's execution time.

I have tested using tf.function, but it seems to take even longer (possibly due to my improper usage). @mehtamansi29

mehtamansi29 self-assigned this Jan 28, 2025

mehtamansi29 added type:Bug Something isn't working stat:awaiting response from contributor labels Jan 28, 2025

mehtamansi29 mentioned this issue Feb 3, 2025

Update with tf.function to speed up BytePairTokenizer class #2068

Closed

mehtamansi29 added type:Bug Something isn't working and removed type:Bug Something isn't working labels Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

chenying99 commented Jan 23, 2025 •

edited

Loading

mehtamansi29 commented Jan 28, 2025

chenying99 commented Jan 31, 2025 •

edited

Loading

mehtamansi29 commented Feb 3, 2025

chenying99 commented Feb 3, 2025

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056

Comments

chenying99 commented Jan 23, 2025 • edited Loading

mehtamansi29 commented Jan 28, 2025

chenying99 commented Jan 31, 2025 • edited Loading

mehtamansi29 commented Feb 3, 2025

chenying99 commented Feb 3, 2025

chenying99 commented Jan 23, 2025 •

edited

Loading

chenying99 commented Jan 31, 2025 •

edited

Loading