Paper, Tags: #nlp, #embeddings
This paper demonstrates a general semi-supervised method for adding pre-trained context embeddings from bidirectional language models to augment token representations in sequence tagging models.
Sequence tagging: assign a categorical label to each member of a sequence of observed values. Current (2017) state-of-the-art ST models include a bidirectional RNN that encodes token sequences into a context sensitive representation. The parameters of this biRNN are learnt only on labeled data.
Our approach doesn't require additional labeled data. We use a neural language model (LM), pre-trained on a large, unlabeled corpus to compute an encoding of the context at each position in the sequence (LM embedding) and use it in the supervised sequence tagging model.
- The context sensitive representation captured in the LM embeddings is useful in the supervised sequence tagging setting.
- Using both forward and backward LM embeddings boosts performance over a forward only LM.
- Domain specific pre-training isn't necessary by applying a LM trained in the news domain to scientific papers
- Pretrain word embeddings and language model on a large, unlabeled corpora
- Prepare word embedding and LM embedding for each token in the input sequence
- Use both word embeddings and LM embeddings in the sequence tagging model (model receives 2 inputs)
- Word embedding model
- Character representation captures morphological information and is either a CNN or a RNN
- Token embeddings, obtained as a lookup, initialized using pre-trained word embeddings
- Recurrent language model / encoding of context: multiple layers of biRNNs
Computes the probability of a token sequence. Pass a token representation (from a CNN over characters or as token embeddings) through multiple layers of LSTMs to embed the history into a fixed dimensional vector -> forward LM embedding, predicts the probability of token tk+1. A backward LM embedding predicts the previous token given the future context.
Both embeddings are pre-trained separatedly, and then are shallowly concatenated to form the bidirectional LM embeddings.
TagLM uses the LM embeddings as additional inputs to the sequence tagging model.
- The LM embeddings amount to an average absolute improvement of 1.06 and 1.37 F1
- The improvements obtained by adding LM embeddings are larger than the improvements previously obtained by adding other forms of transfer or joint learning
- Adding backward LM embeddings consistently outperforms forward-only LM embeddings
- LM size is important, replacing by a bigger model improves F1 as much as adding backward LM
- LM embeddings can improve performance of a sequence tagger even when the data comes from a different domain