Skip to content

Add option for TransformerWordEmbeddings to respect whitespace offsets #3646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

alanakbik
Copy link
Collaborator

Closes #3400

This PR adds a new parameter to TransformerWordEmbeddings that optionally allows you to choose to embed the "raw text" of sentences. This means that whitespace information is preserved in the embeddings.

This is best illustrated with this snippet in which the sentences "I love Berlin." and "I love Berlin ." normally get the same embeddings, but with the parameter set get different embeddings, reflecting the inserted whitespace:

import flair
from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

flair.device = "cpu"
embeddings = TransformerWordEmbeddings("microsoft/deberta-v3-xsmall")

# two example sentences, difference is whether the period is offset with a whitespace
sentence_1 = Sentence("I love Berlin.")
sentence_2 = Sentence("I love Berlin .")

# We embed both and find the embeddings are the same
embeddings.embed([sentence_1, sentence_2])
print(sentence_1[2].embedding[:5])
print(sentence_2[2].embedding[:5])

# repeat this experiment, but this time use raw text (= preserve the whitespace information)
embeddings = TransformerWordEmbeddings("microsoft/deberta-v3-xsmall", use_raw_text_as_input=True)

# two example sentences, difference is whether the period is offset with a whitespace
sentence_1 = Sentence("I love Berlin.")
sentence_2 = Sentence("I love Berlin .")

# Not the two sentences have different embeddings
embeddings.embed([sentence_1, sentence_2])
print(sentence_1[2].embedding[:5])
print(sentence_2[2].embedding[:5])

Theoretically, accurately reflecting whitespaces should give us better performance.

However, in my limited experiments I am finding the opposite to be the case. The reason seems to be twofold:

  1. The vocabulary for tokens that do not follow a whitespace is impoverished. For instance, the above transformer model has an entry for "▁diplomacy" (the word "diplomacy" following a whitespace) but no entry for "diplomacy", causing the word to be split into diploma and cy in a sentence like "number of public-diplomacy officers". This is suboptimal.

  2. The regular implementation always ensures that the tokenization of the dataset is exactly used, including "unrealistic" tokenization. For instane, in the UD dataset I used to test, the word "gotta" ("I gotta have this") is tokenized as "got" and "ta". Our standard implementation uses these tokens as subtokens, while the whitespace-preserving option will subtokenize this as a single subtoken gotta, which results in the the two UD tokens getting the same embedding. This will impact evaluation numbers but actually have no impact on practical application, since most tokenizers do not split mid-word.

My feeling is that these two issues do more harm than the advantage given by accurately reflecting whitespaces. More testing is needed though to confirm.

@alanakbik alanakbik changed the title Gh 3400 whitespace offsets Add option for TransformerWordEmbeddings to respect whitespace offsets Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Whitespace offsets not properly utilized in TransformerEmbeddings
1 participant