-
Notifications
You must be signed in to change notification settings - Fork 980
Description
Hello, I've noticed a bug in the implementation of the ByteLevel post-processor.
The documentation of this post-processor only lists trim_offsets as a configurable parameter: https://huggingface.co/docs/tokenizers/en/api/post-processors#tokenizers.processors.ByteLevel
However, the Rust source code shows that the post-processor also uses an add_prefix_space parameter under the hood. The default value for this param is true:
| process_offsets(encoding, self.add_prefix_space); |
This creates a problem when the ByteLevel pre-tokenizer is configured with add_prefix_space=false (in this case this param is correctly listed in the documentation). In this scenario, the pre-tokenizer and post-processor are misaligned, as the post-processor incorrectly uses the default add_prefix_space=true. This can lead to incorrect offset mappings when post-processor is used with trim_offsets=true and when input text starts from a whitespace.
The easiest fix for this is to allow users to configure the add_prefix_space parameter of the ByteLevel post-processor and describe this option accordingly in the documentation.