Skip to content

SentencePieceBPE + Unicode NFD preprocessing leads to noise ? #1851

@PonteIneptique

Description

@PonteIneptique

Hi,
I have had the issue multiple times, so I assume I am doing something wrong.

Versions:

  • tokenizers==0.21.4
  • transformers==4.55.4

Training script

from transformers import PreTrainedTokenizerFast
from pathlib import Path
from read import get_texts_iter_for_tokenizer
from tokenizers import SentencePieceBPETokenizer, normalizers, pre_tokenizers

def main():
    output_dir = Path("hf_tokenizer")
    output_dir.mkdir(parents=True, exist_ok=True)

    # Dump texts to a file
    texts = get_texts_iter_for_tokenizer()

    # Train SentencePiece model
    tokenizer = SentencePieceBPETokenizer()

    # Adding normalization and pre_tokenizer
    tokenizer.normalizer = normalizers.Sequence([normalizers.NFD()])
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

    # Adding special tokens and creating trainer instance
    special_tokens = ["<unk>", "<pad>", "<cls>", "<sep>", "<mask>"]

    # Training from iterator REMEMBER it's training on test set...
    tokenizer.train_from_iterator(texts, special_tokens=special_tokens, show_progress=True)

    fast_tokenizer = PreTrainedTokenizerFast(
        tokenizer_object=tokenizer,
        unk_token="<unk>",
        pad_token="<pad>",
        cls_token="<cls>",
        sep_token="<sep>",
        mask_token="<mask>"
    )
    fast_tokenizer.save_pretrained(str(output_dir))

Script to reproduce bug:

from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast.from_pretrained("hf_tokenizer")

# Test
print(hf_tokenizer.tokenize("⁊ĩ rẽ dñi u̾sum"))
# ['âģĬ', 'i', 'Ìĥ', 'Ġre', 'Ìĥ', 'Ġdn', 'Ìĥ', 'i', 'Ġu', '̾', 'sum']
print(hf_tokenizer.decode(hf_tokenizer.encode("⁊ĩ rẽ dñi u̾sum"))
# âģĬiÌĥĠreÌĥĠdnÌĥiĠu̾sum

I assume I am doing something wrong around preprocessing / postprocessing ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions