Fix type notation of merges in BPE Python binding #1766
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, this change is to correct type notations of merges in Python binding of BPE as the counterpart in Rust implementation expects a file name string or Vec<(String, String)> instead of a mapping from a tuple of integers to another.
This is a second PR with the same change as PR#1685, which got closed with replies to questions from an reviewer that got ignored. I'll include the reasoning for this changes below.
Taking the Python SentencePieceBPETokenizer class as an example, the
merges
annotated asOptional[Union[str, Dict[Tuple[int, int], Tuple[int, int]]]]
is only used to initialize theBPE
class:where the
BPE
class is generated from the Rust implementation mentioned in the original post.BPE
class also contains below comment:This comment also gives a different type annotation for
merges
.I encountered an issue initializing
SentencePieceBPETokenizer
with the given type annotation, but succeeded following the type annotation given in the comment ofBPE
class.The type annotation of
merges
beingDict[Tuple[int, int], Tuple[int, int]]]]
should actually be forMergesMap
in Rust, as an alias ofHashMap<Pair, (u32, u32)>
, which is built internally in Rust frommerges
.