-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The BytePairTokenizer class is extremely, extremely slow at tokenizing #2056
Comments
Hi @chenying99 - Thanks for reporting the issue. Can you please help me with which tokenizer class from a library like nltk or transformers ? And also please provide the more sample code here to reproduce the issue. I tried with transformer tokenizer (bert-base-uncased) and it is working fine. I am getting 0.0023925304412841797 sec(2.39 ms). That's pretty correct time.
|
I am testing the BytePairTokenizer in this project (keras_hub), not the tokenizer from the transformers library. I am training the large language model in this project, using the BytePairTokenizer provided by the project, where tokenization takes up most of the time.
|
Hi @chenying99 - Thanks for reporting the issue. I have raised PR for this issue.
|
In actual training of the large language model, a batch of data is input at a time rather than a single data point; that is, batch processing is used. Even with this approach, it seems that tokenization takes longer than the model's execution time. I have tested using tf.function, but it seems to take even longer (possibly due to my improper usage). @mehtamansi29 |
vocabulary size 6400
3.8366940021514893 seconds
The text was updated successfully, but these errors were encountered: