Interactive visualizations accompanying the dissertation are available at: π lindabeutel.github.io/Naming-analysis
This repository provides a Python-based research infrastructure for the semi-automatic collection, normalization, categorization, and analysis of naming variants in Middle High German texts.
The tool supports the systematic investigation of proper names, antonomasia, epithets, and figure-specific collocations based on TEI-encoded texts. Automatic detection is rule-based and consistently complemented by manual validation.
The program serves as a methodological research tool for structured data collection and analysis within literary studies.
Status: BETA release
Developed within the dissertation project:
"Naming poetics of Middle High German epic. An analysis of the use of
names, antonomasia, and epithets in selected texts of the
German-speaking Middle Ages."
(Benennungspoetiken mittelhochdeutscher Epik -- Eine Analyse der
Verwendung von Namen, Antonomasien und Epitheta in ausgewΓ€hlten Texten
des deutschsprachigen Mittelalters, working title)
PhD project since 2022
University of Salzburg
Department of German Studies
The dissertation itself is primarily literary-hermeneutic; this tool functions as a structured research infrastructure supporting data collection and analysis.
Parts of the dataset were supplemented by existing annotations, particularly work conducted by Nora Ketschik (MHG4SNA), created within the CRETA/CretAnno environment.
The present infrastructure constitutes an independent tool that extends, systematizes, and further processes previously automated annotations for the specific research focus on naming practices.
The CretAnno project is currently no longer active.
A naming variant refers to any lexical form used to refer to a character within the narrative context. The term includes both proper names and antonomasia.
In the German data model of this project, the term "Bezeichnung" is used synonymously with naming variant.
Antonomasia is understood as the replacement of a proper name by a characterizing descriptive expression (cf.Β Schirren 2009; Drews 2013).
Epithets are understood as attributive additions to a designation (cf.Β Gondos 2013).
The system operates across different language layers:
- CLI interface: English
- Data structure (Excel, JSON): German
- Analytical outputs (CSV, visualizations): English
This distinction emerged historically during development and has been retained for consistency.
The repository contains:
- curated JSON files
- a single Excel template (
template_excel.xlsx) - normalization resources
- a naming dictionary
Not included:
- TEI files from the Middle High German Conceptual Database (MHDBDB)
TEI files should preferably be obtained from the MHDBDB, as the collection logic is tailored to its TEI structure.
Depending on the specific encoding, adjustments to verse counting
may be necessary. For example, in works such as Parzival, verse
numbering restarts within each book (e.g., every 29 verses the attribute
.//tei:l[@n] begins again at 1). In such cases, a more robust
identification via xml:id should be used to ensure consistent verse
tracking across structural divisions.
The datasets contained in the data/ directory are sufficient to reproduce all analytical and visualization outputs. No TEI source files are required for analysis execution. The dataset is evolving dynamically during research.
run.py
βββ controller.py
βββ Foundation Layer
β βββ project_setup.py
β βββ config.py
β βββ project_types.py
βββ Infrastructure Layer
β βββ tei_utils.py
β βββ io_utils.py
β βββ shared.py
βββ Collection Layer
β βββ collection.py
β βββ validation.py
β βββ savers.py
βββ Analysis Layer
β βββ analysis.py
βββ Export Layer
βββ exporter.py
- Initialize a work project for a specific text
- Select a TEI file
- Create or load an Excel working file
- Perform verse-wise collection (naming variants, collocations, categorization)
- Persist results in JSON files
- Conduct analytical processing
- Export results (CSV, HTML visualizations)
Recommended: Python 3.13 in a virtual environment.
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python run.py
The program was developed and tested exclusively under Windows.
Execution should be performed from the project root directory.
- Download a TEI file (preferably from the MHDBDB).
- Run
run.py. - Enter the name of the work.
- Optionally extend the naming dictionary.
- Create a new Excel file from the provided template.
- Select the TEI file.
- Start "Collect new data."
- Select collection modes.
For each work, a progress_<work>.json file is created.
This file records the last processed verse per module.
The progress file can be manually edited to reset verse counters.
During each iteration, existing entries in result JSON files
(missing_naming_variants.json, categorization.json,
collocations.json) are checked to prevent duplicates.
Important: Manual edits must be saved before restarting a collection loop; otherwise, they may be overwritten.
Automatic detection relies on:
- regex-based word-boundary matching
- normalization (lowercasing, graphemic harmonization)
- comparison against a naming dictionary
- string-based plausibility checks (substring and token-set comparisons)
Normalization is primarily informed by:
- Mittelhochdeutsches HandwΓΆrterbuch by Matthias Lexer
- frequent variants documented in the MHDBDB
All automatically detected entries require manual confirmation.
The workflow is entirely rule-based and dictionary-driven.
No machine learning models are implemented or trained.
The project does not aim at computational innovation but at providing a transparent, research-oriented infrastructure for philological analysis.
The analysis module includes:
- Wordlists
- Keyword analysis (inspired by AntConc; Anthony 2024)
- Figure profiles
- Collocation analysis
- Plotly-based visualizations
While this repository contains curated Middle High German data, the underlying workflow is in principle language-independent and adaptable to other TEI-based corpora.
Beutel-Thurow, L. (2026). Naming-analysis (Version v0.1.0-beta) [Computer software]. https://doi.org/10.5281/zenodo.18770138
Code, JSON files, the Excel template, and generated visualizations are licensed under:
CC BY-NC-SA 4.0
TEI files from the MHDBDB are not part of this repository and remain subject to their respective license (also CC BY-NC-SA 4.0).
The project aims to support:
- transparency of data processing
- reproducibility of analysis
- disclosure of normalization and matching logic
- versioning via Zenodo
Planned for the main release:
- implementation of a centralized validation layer
Anthony, Laurence. 2024. AntConc (Version 4.3.1).
Drews, Lydia. 2013. "Antonomasie."
Gondos, Lisa. 2013. "Epitheton."
Institut fΓΌr maschinelle Sprachverarbeitung. 2022. "Tools und Demos |
CRETA."
Ketschik, Nora. 2023. MHG4SNA.
Lexer, Matthias. Mittelhochdeutsches HandwΓΆrterbuch.
Middle High German Conceptual Database (MHDBDB). University of
Salzburg.
Schirren, Thomas. 2009. "Tropes in Classical Rhetoric."
Schmid, Helmut. 2019. "Deep Learning-Based Morphological Taggers and
Lemmatizers."