-
Notifications
You must be signed in to change notification settings - Fork 980
Description
Hello, I've noticed a bug in the implementation of the ByteLevel
post-processor.
The documentation of this post-processor only lists trim_offsets
as a configurable parameter: https://huggingface.co/docs/tokenizers/en/api/post-processors#tokenizers.processors.ByteLevel
However, the Rust source code shows that the post-processor also uses an add_prefix_space
parameter under the hood. The default value for this param is true
:
process_offsets(encoding, self.add_prefix_space); |
This creates a problem when the ByteLevel
pre-tokenizer is configured with add_prefix_space=false
(in this case this param is correctly listed in the documentation). In this scenario, the pre-tokenizer and post-processor are misaligned, as the post-processor incorrectly uses the default add_prefix_space=true
. This can lead to incorrect offset mappings when post-processor is used with trim_offsets=true
and when input text starts from a whitespace.
The easiest fix for this is to allow users to configure the add_prefix_space
parameter of the ByteLevel
post-processor and describe this option accordingly in the documentation.