Skip to content

Latest commit

 

History

History
79 lines (43 loc) · 4.64 KB

1807.2020.md

File metadata and controls

79 lines (43 loc) · 4.64 KB

A Named Entity Recognition Shootout for German, Riedl and Padó, 2018

Paper, Tags: #nlp, #architectures, #ner, #datasets

We ask how to practically build a model for German NER that performs at the state of the art for both contemporary and historical texts (big data vs. small data). The two best-performing model families are pitted against each other (linear-chain CRFs vs. BiLSTM) to observe tradeoffs.

BiLSTM outperforms CRF with large datasets, and performs worse with small datasets. BiLSTM profits substantially from transfer learning.

CRFs

These models profit from hand-crafted features and can easily incorporate language- and domain-specific knowledge from dictionaries or gazetteers.

They require only moderate amounts of training data.

They form the basis of two widely user name entity recognizers:

  • StanfordNER
  • GermaNER

BiLSTMs

These models identify informative features directly from the data, presented as word and/or character embedding.

They require considerably annotated data to learn proper representations.

Our implementation only uses word and character embeddings. We train character embeddings while training but use pre-trained word embeddings. To alleviate issues with oov words, we use character- and subword-based word embeddings computed with fastText.

Our contribution

This paper investigates the question of whether it is generally a better idea to choose different mode families for different settings, or whether one model family can be optimized to perform well across settings, on a set of German corpora including two large, contemporary corpora and two small historical corpora. We compare both sets of models, performing 3 experiments. Our results:

  • BiLSTM performs best on contemporary corpora, within and across domains
  • BiLSTM performs worse than CRF for the smallest historical corpus due to lack of data
  • By applying transfer learning, RNN outperforms CRFs for all corpora.

Datasets

Contemporary German

We use the CoNLL 2003 shared task dataset, which consists of 220k tokens of annotated newspaper documents. We also use GermEval 2014 shared task dataset, 450k tokens of Wikipedia articles.

Historical German

We extract historical texts from the Europeana collection of historical newspapers.

  • Our first corpus are newspapers from LFT, with 87k tokens from 1926.
  • Our second corpus is a collection of Austrian newspaper texts from ONB, with 35k tokens between 1710 and 1873.

These corpora are considerably smaller than the above ones, contain a different language variety (19th century Austrian German), and include a high rate of OCR errors.

Experiments

Experiment 1. Contemporary German

  • BiLSTM
    • Pre-trained embeddings computed on Wikipedia
    • Computed embeddings from 1.5 billion tokens of historic German from Europeana

On the CoNLL dataset, GermaNER outperforms the currently best-performing RNN-based system. BiLSTM models can outperform CRF models when there's sufficient training data to profit from distributed representations.

Experiment 2. Cross-Corpus performance

In BiLSTMs, learnt models can be more text type soecific, due to the high capacity of the models. This experiment shows how well the models do when trained on one corpus and tested on another, including historical corpora.

GermaNER outperforms StanfordNER, and the CRFs significantly outperform the BiLSTM models on ONB and performs comparable on the larger LFT dataset. The type of embeddings used by BiLSTM plays a minor role for the historical corpora.

BiLSTM models run into trouble when faced with very small training datasets, while CRF-based methods are more robust.

Experiment 3. Transfer Learning

Start training on one corpus and at some point switch to another corpus. We train on large contemporary 'source' corpora until convergence and then train additional 15 epochs on the target corpus from the omain on which we evaluate.

There is a significan improvement for the CoNLL dataset but performance drops for GermEval. Combining contemporary sources with historic target corpora yields to consistent benefits. The GermEval corpus is more appropriate as a source corpus.

Transfer learning is beneficial for BiLSTMs, especially when training data for the target domain is scarce. Such improvements were not visible on CRFs.

Conclusion

Combining BiLSTMs with a CRF as top layer outperforms CRFs with hand-coded features consistently when enough data is available. Even though RNNs struggle with small datasets, transfer learning is a simple and effective remedy to achieve state-of-the-art performance even for such datasets. Modern RNNs yield the best performance.