Skip to content

Fix type notation of merges in BPE Python binding #1766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Coqueue
Copy link

@Coqueue Coqueue commented Apr 21, 2025

Hi, this change is to correct type notations of merges in Python binding of BPE as the counterpart in Rust implementation expects a file name string or Vec<(String, String)> instead of a mapping from a tuple of integers to another.

This is a second PR with the same change as PR#1685, which got closed with replies to questions from an reviewer that got ignored. I'll include the reasoning for this changes below.

Taking the Python SentencePieceBPETokenizer class as an example, the merges annotated as Optional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]] is only used to initialize the BPE class:

if vocab is not None and merges is not None:
        tokenizer = Tokenizer(BPE(vocab, merges, dropout=dropout, unk_token=unk_token, fuse_unk=fuse_unk))

where the BPE class is generated from the Rust implementation mentioned in the original post.

BPE class also contains below comment:

class BPE(Model):
    """
    An implementation of the BPE (Byte-Pair Encoding) algorithm

    Args:
        vocab (:obj:`Dict[str, int]`, `optional`):
            A dictionary of string keys and their ids :obj:`{"am": 0,...}`

        merges (:obj:`List[Tuple[str, str]]`, `optional`):
            A list of pairs of tokens (:obj:`Tuple[str, str]`) :obj:`[("a", "b"),...]`

This comment also gives a different type annotation for merges.

I encountered an issue initializing SentencePieceBPETokenizer with the given type annotation, but succeeded following the type annotation given in the comment of BPE class.

The type annotation of merges being Dict[Tuple[int, int], Tuple[int, int]]]] should actually be for MergesMap in Rust, as an alias of HashMap<Pair, (u32, u32)>, which is built internally in Rust from merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant