Skip to content

[WIP] support different tokenizers #1318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: dsv3-model
Choose a base branch
from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Jun 18, 2025

Requesting feedback since this will change the tokenizer directory for existing users. We need to add support more multiple tokenizers in titan (we currently only have the one used in llama3)


What happens currently:

New workflow:

  • Users call scripts/download_tokenizer.py
  • Saves tokenizer configs to a directory named assets/tokenizer/<model_name>/
  • Users uses the tokenizer by referencing the directory mentioned in the previous step in the .toml configs
  • Using hugging face tokenizer lib to load the tokenizer

Pros:

  • Supports additional/existing preloaded tokenizers
  • Can remove tiktoken dependency

Cons:

  • Breaks current users depending on original/tokenizer.model
  • Adds new dependency on HF tokenizer

@H-Huang H-Huang requested a review from tianyu-l June 18, 2025 16:20
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2025
@H-Huang H-Huang requested a review from wwwjn June 18, 2025 16:20
@@ -0,0 +1,225 @@
#!/usr/bin/env python3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove this file, i was just using to test

model_name = repo_id.split("/")[-1]
model_dir = os.path.join(local_dir, model_name)

# Common tokenizer files to download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be comprehensive and need user's double-check, eg, CLIP tokenizer used in FLUX needs special_tokens_map.json and vocab.json: https://huggingface.co/openai/clip-vit-large-patch14/tree/main

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point! Let me look into how image tokenizers work too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLIP is a text encoder instead of image encoder, and I feel like the "transformers" repo supports more tokenizers than the tokenizer repo

python scripts/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3.1-8B --hf_token=...

# DeepSeek tokenizer (automatically downloads tokenizer.json and tokenizer_config.json)
python scripts/download_tokenizer.py --repo_id deepseek-ai/DeepSeek-V3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is hf_token not required?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants