Add option for TransformerWordEmbeddings to respect whitespace offsets #3646
+169
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #3400
This PR adds a new parameter to TransformerWordEmbeddings that optionally allows you to choose to embed the "raw text" of sentences. This means that whitespace information is preserved in the embeddings.
This is best illustrated with this snippet in which the sentences "
I love Berlin.
" and "I love Berlin .
" normally get the same embeddings, but with the parameter set get different embeddings, reflecting the inserted whitespace:Theoretically, accurately reflecting whitespaces should give us better performance.
However, in my limited experiments I am finding the opposite to be the case. The reason seems to be twofold:
The vocabulary for tokens that do not follow a whitespace is impoverished. For instance, the above transformer model has an entry for "
▁diplomacy
" (the word "diplomacy" following a whitespace) but no entry for "diplomacy
", causing the word to be split intodiploma
andcy
in a sentence like "number of public-diplomacy officers
". This is suboptimal.The regular implementation always ensures that the tokenization of the dataset is exactly used, including "unrealistic" tokenization. For instane, in the UD dataset I used to test, the word "
gotta
" ("I gotta have this
") is tokenized as "got
" and "ta
". Our standard implementation uses these tokens as subtokens, while the whitespace-preserving option will subtokenize this as a single subtokengotta
, which results in the the two UD tokens getting the same embedding. This will impact evaluation numbers but actually have no impact on practical application, since most tokenizers do not split mid-word.My feeling is that these two issues do more harm than the advantage given by accurately reflecting whitespaces. More testing is needed though to confirm.