Bug with `add_prefix_space` parameter for ByteLevel post-processor

Hello, I've noticed a bug in the implementation of the `ByteLevel` **post-processor**.

The documentation of this post-processor only lists `trim_offsets` as a configurable parameter: https://huggingface.co/docs/tokenizers/en/api/post-processors#tokenizers.processors.ByteLevel

However, the Rust source code shows that the post-processor also uses an `add_prefix_space` parameter under the hood. The default value for this param is `true`: https://github.com/huggingface/tokenizers/blob/dd4fc3df1a8a7cd135eecca2158db018d85f94f1/tokenizers/src/pre_tokenizers/byte_level.rs#L187

This creates a problem when the `ByteLevel` **pre-tokenizer** is configured with `add_prefix_space=false` ([in this case this param is correctly listed in the documentation](https://huggingface.co/docs/tokenizers/en/api/pre-tokenizers#tokenizers.pre_tokenizers.ByteLevel)). In this scenario, the pre-tokenizer and post-processor are misaligned, as the post-processor incorrectly uses the default `add_prefix_space=true`. This can lead to incorrect offset mappings when post-processor is used with `trim_offsets=true` and when input text starts from a whitespace.

The easiest fix for this is to allow users to configure the `add_prefix_space` parameter of the `ByteLevel` post-processor and describe this option accordingly in the documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug with `add_prefix_space` parameter for ByteLevel post-processor #1819

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug with add_prefix_space parameter for ByteLevel post-processor #1819

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug with `add_prefix_space` parameter for ByteLevel post-processor #1819