[WIP] support different tokenizers #1318

H-Huang · 2025-06-18T16:20:25Z

Requesting feedback since this will change the tokenizer directory for existing users. We need to add support more multiple tokenizers in titan (we currently only have the one used in llama3)

What happens currently:

Users call scripts/download_tokenizer.py
Saves the tokenizer to assets/tokenizer/original/tokenizer.model
Users uses the tokenizer by adding tokenizer.model in the .toml configs (https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/train_configs/llama3_8b.toml#L21
We retrieve tokenizer.model and wrap with our own tiktoken tokenizer (https://github.com/pytorch/torchtitan/blob/main/torchtitan/datasets/tokenizer/tiktoken.py)

New workflow:

Users call scripts/download_tokenizer.py
Saves tokenizer configs to a directory named assets/tokenizer/<model_name>/
Users uses the tokenizer by referencing the directory mentioned in the previous step in the .toml configs
Using hugging face tokenizer lib to load the tokenizer

Pros:

Supports additional/existing preloaded tokenizers
Can remove tiktoken dependency

Cons:

Breaks current users depending on original/tokenizer.model
Adds new dependency on HF tokenizer

H-Huang · 2025-06-18T16:44:52Z

scripts/use_tokenizer_example.py

@@ -0,0 +1,225 @@
+#!/usr/bin/env python3


Will remove this file, i was just using to test

wwwjn · 2025-06-18T16:47:03Z

scripts/download_tokenizer.py

+    model_name = repo_id.split("/")[-1]
+    model_dir = os.path.join(local_dir, model_name)
+
+    # Common tokenizer files to download


This might not be comprehensive and need user's double-check, eg, CLIP tokenizer used in FLUX needs special_tokens_map.json and vocab.json: https://huggingface.co/openai/clip-vit-large-patch14/tree/main

Thanks, good point! Let me look into how image tokenizers work too

CLIP is a text encoder instead of image encoder, and I feel like the "transformers" repo supports more tokenizers than the tokenizer repo

fegin · 2025-06-18T18:43:24Z

README.md

+python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --hf_token=...
+
+# DeepSeek tokenizer (automatically downloads tokenizer.json and tokenizer_config.json)
+python scripts/download_tokenizer.py --repo_id deepseek-ai/DeepSeek-V3


Just curious, is hf_token not required?

[WIP] support different tokenizers

f6ab45f

H-Huang requested a review from tianyu-l June 18, 2025 16:20

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2025

H-Huang requested a review from wwwjn June 18, 2025 16:20

H-Huang commented Jun 18, 2025

View reviewed changes

scripts/use_tokenizer_example.py

@@ -0,0 +1,225 @@

#!/usr/bin/env python3

Copy link

Member Author

H-Huang Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove this file, i was just using to test

wwwjn reviewed Jun 18, 2025

View reviewed changes

fegin reviewed Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] support different tokenizers #1318

[WIP] support different tokenizers #1318

Uh oh!

H-Huang commented Jun 18, 2025 •

edited

Loading

Uh oh!

H-Huang Jun 18, 2025

Uh oh!

wwwjn Jun 18, 2025

Uh oh!

H-Huang Jun 18, 2025

Uh oh!

wwwjn Jun 18, 2025

Uh oh!

fegin Jun 18, 2025

Uh oh!

Uh oh!

[WIP] support different tokenizers #1318

Are you sure you want to change the base?

[WIP] support different tokenizers #1318

Uh oh!

Conversation

H-Huang commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

H-Huang Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

H-Huang commented Jun 18, 2025 •

edited

Loading