Add option for TransformerWordEmbeddings to respect whitespace offsets #3646

alanakbik · 2025-03-24T05:06:46Z

This PR adds a new parameter to TransformerWordEmbeddings that optionally allows you to choose to embed the "raw text" of sentences. This means that whitespace information is preserved in the embeddings.

This is best illustrated with this snippet in which the sentences "I love Berlin." and "I love Berlin ." normally get the same embeddings, but with the parameter set get different embeddings, reflecting the inserted whitespace:

import flair
from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

flair.device = "cpu"
embeddings = TransformerWordEmbeddings("microsoft/deberta-v3-xsmall")

# two example sentences, difference is whether the period is offset with a whitespace
sentence_1 = Sentence("I love Berlin.")
sentence_2 = Sentence("I love Berlin .")

# We embed both and find the embeddings are the same
embeddings.embed([sentence_1, sentence_2])
print(sentence_1[2].embedding[:5])
print(sentence_2[2].embedding[:5])

# repeat this experiment, but this time use raw text (= preserve the whitespace information)
embeddings = TransformerWordEmbeddings("microsoft/deberta-v3-xsmall", use_raw_text_as_input=True)

# two example sentences, difference is whether the period is offset with a whitespace
sentence_1 = Sentence("I love Berlin.")
sentence_2 = Sentence("I love Berlin .")

# Not the two sentences have different embeddings
embeddings.embed([sentence_1, sentence_2])
print(sentence_1[2].embedding[:5])
print(sentence_2[2].embedding[:5])

Theoretically, accurately reflecting whitespaces should give us better performance.

However, in my limited experiments I am finding the opposite to be the case. The reason seems to be twofold:

The vocabulary for tokens that do not follow a whitespace is impoverished. For instance, the above transformer model has an entry for "▁diplomacy" (the word "diplomacy" following a whitespace) but no entry for "diplomacy", causing the word to be split into diploma and cy in a sentence like "number of public-diplomacy officers". This is suboptimal.
The regular implementation always ensures that the tokenization of the dataset is exactly used, including "unrealistic" tokenization. For instane, in the UD dataset I used to test, the word "gotta" ("I gotta have this") is tokenized as "got" and "ta". Our standard implementation uses these tokens as subtokens, while the whitespace-preserving option will subtokenize this as a single subtoken gotta, which results in the the two UD tokens getting the same embedding. This will impact evaluation numbers but actually have no impact on practical application, since most tokenizers do not split mid-word.

My feeling is that these two issues do more harm than the advantage given by accurately reflecting whitespaces. More testing is needed though to confirm.

…d embeddings

alanakbik added 2 commits March 24, 2025 12:47

GH-3400: add implementation for whitespace-preserving transformer wor…

0a55e74

…d embeddings

GH-3400: take out debug code and add more unit tests

47faa24

alanakbik changed the title ~~Gh 3400 whitespace offsets~~ Add option for TransformerWordEmbeddings to respect whitespace offsets Mar 24, 2025

alanakbik added 2 commits March 24, 2025 23:03

GH-3400: fix mypy errors

b26307a

Merge branch 'master' into GH-3400-whitespace-offsets

02208da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add option for TransformerWordEmbeddings to respect whitespace offsets #3646

Add option for TransformerWordEmbeddings to respect whitespace offsets #3646

alanakbik commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

Add option for TransformerWordEmbeddings to respect whitespace offsets #3646

Are you sure you want to change the base?

Add option for TransformerWordEmbeddings to respect whitespace offsets #3646

Conversation

alanakbik commented Mar 24, 2025

Uh oh!

Uh oh!