-
Notifications
You must be signed in to change notification settings - Fork 967
Open
Description
Hi,
I have had the issue multiple times, so I assume I am doing something wrong.
Versions:
- tokenizers==0.21.4
- transformers==4.55.4
Training script
from transformers import PreTrainedTokenizerFast
from pathlib import Path
from read import get_texts_iter_for_tokenizer
from tokenizers import SentencePieceBPETokenizer, normalizers, pre_tokenizers
def main():
output_dir = Path("hf_tokenizer")
output_dir.mkdir(parents=True, exist_ok=True)
# Dump texts to a file
texts = get_texts_iter_for_tokenizer()
# Train SentencePiece model
tokenizer = SentencePieceBPETokenizer()
# Adding normalization and pre_tokenizer
tokenizer.normalizer = normalizers.Sequence([normalizers.NFD()])
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
# Adding special tokens and creating trainer instance
special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]
# Training from iterator REMEMBER it's training on test set...
tokenizer.train_from_iterator(texts, special_tokens=special_tokens, show_progress=True)
fast_tokenizer = PreTrainedTokenizerFast(
tokenizer_object=tokenizer,
unk_token="<unk>",
pad_token="<pad>",
cls_token="<cls>",
sep_token="<sep>",
mask_token="<mask>"
)
fast_tokenizer.save_pretrained(str(output_dir))
Script to reproduce bug:
from transformers import PreTrainedTokenizerFast
hf_tokenizer = PreTrainedTokenizerFast.from_pretrained("hf_tokenizer")
# Test
print(hf_tokenizer.tokenize("⁊ĩ rẽ dñi u̾sum"))
# ['âģĬ', 'i', 'Ìĥ', 'Ġre', 'Ìĥ', 'Ġdn', 'Ìĥ', 'i', 'Ġu', '̾', 'sum']
print(hf_tokenizer.decode(hf_tokenizer.encode("⁊ĩ rẽ dñi u̾sum"))
# âģĬiÌĥĠreÌĥĠdnÌĥiĠu̾sum
I assume I am doing something wrong around preprocessing / postprocessing ?
Metadata
Metadata
Assignees
Labels
No labels