Tokenization is an oft-neglected part of natural language processing. With the recent blow-up of interest in language models, it might be good to step back and really get into the guts of what tokenization is. This repo is meant to serve as a deep dive into different aspects of tokenization. It's been organized as bite-size chapters for easy navigation, with some code samples and (poorly designed) walkthrough notebooks. This is NOT meant to be a complete reference in itself, and is meant accompany other excellent resources like HuggingFace's NLP course. The following topics are covered:
- Intro: A quick introduction on tokens and the different tokenization algorithms out there.
- BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model.
- 🤗 Tokenizer: The internals of HuggingFace tokenizers! We look at state (what's saved by a tokenizer), data structures (how does it store what it saves), and methods (what functionality do you get). We also implement a minimal <200 line version of the 🤗 Tokenizer in Python for GPT2.
- Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
- Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc.
- PostProcessing and more: A look at special tokens and postprocessing, glitch tokens and why you might want to shrink your tokenizer.
- Galactica: Thinking about tokenizer design by diving into the Galactica paper.
- Chat templates: Some tokenization tips and tricks while dealing with chat-templating for chat models.
To run the notebooks in the repo, you only need two libraries: transformers
and tiktoken
:
pip install transformers tiktoken
Code has been tested with transformers==4.35.0
and tiktoken==0.5.1
.
A basic understanding of language models and tokenization is a must:
- A Hackers' Guide to Language Models by Prof. Jeremy Howard.
- What makes LLM tokenizers different from each other? by Jay Alammar.
- ChatGPT has Never Seen a SINGLE Word (Despite Reading Most of The Internet). Meet LLM Tokenizers. by Jay Alammar
- [Optional] Chapter on tokenizers from The 🤗 NLP Course
If you notice any mistake/bug, or feel you could make an improvement to any section of the repo, please open an issue or make a PR 🙏