Skip to content

Bug with add_prefix_space parameter for ByteLevel post-processor #1819

@megatron6000

Description

@megatron6000

Hello, I've noticed a bug in the implementation of the ByteLevel post-processor.

The documentation of this post-processor only lists trim_offsets as a configurable parameter: https://huggingface.co/docs/tokenizers/en/api/post-processors#tokenizers.processors.ByteLevel

However, the Rust source code shows that the post-processor also uses an add_prefix_space parameter under the hood. The default value for this param is true:

process_offsets(encoding, self.add_prefix_space);

This creates a problem when the ByteLevel pre-tokenizer is configured with add_prefix_space=false (in this case this param is correctly listed in the documentation). In this scenario, the pre-tokenizer and post-processor are misaligned, as the post-processor incorrectly uses the default add_prefix_space=true. This can lead to incorrect offset mappings when post-processor is used with trim_offsets=true and when input text starts from a whitespace.

The easiest fix for this is to allow users to configure the add_prefix_space parameter of the ByteLevel post-processor and describe this option accordingly in the documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions