Data Sources

Jump to bottom

TAJSchaaf edited this page Aug 14, 2025 · 5 revisions

This project uses two distinct datasets to test the accuracy of each NLP model. Each dataset provides a gold standard (GS) for lemmatisation and part-of-speech (POS) tagging.

Early medieval prose from the Latin Latin Charter Treebank

Data: ~25,000 tokens from the Latin Latin Charter Treebank dev file

Early medieval glosses from GAMS Gloss-Vibe

Data: 202 glosses (666 tokens) from the Venerable Bede’s De Temporum Ratione