This repository contains the source code of a coreference resolver trained on Hungarian data.
We use Poetry for dependency management and packaging. Install it, clone this repository, then run
cd bert_coref_hu
poetry install
We briefly describe below how we fine-tuned a pre-trained BERT model to support coreference resolution.
To see the full argument lists of our scripts, run
python3 <script_name>.py -h
Our original data was contained in TSV-style document with the following fields:
- form
- koref
- anas
- lemma
- xpostag
- upostag
- feats
- id
- deprel
- head
- cons
Each row corresponded to a single token and sentences were separated by blank lines. The values of the koref
field were gold standard coreference labels. The rest of the linguistic annotation was generated with the emagyar text processing system.
The following scripts were applied to adjust the data to our task:
src/bert_coref/group_sentences.py
: This splits the input into parts each of which meets the maximum sequence length constraint of a specific model (in our case,huBERT
). The resulting subsequences are separated by double blank lines.src/bert_coref/relabel.py
: It relabels the input based on POS-based filter rules.src/bert_coref/conll2jsonl.py
: It converts the TSV-style input tojsonlines
(one line per document, documents separated by double blank lines in the input).jsonlines
was the data format that we used to fine-tunehuBERT
. Spcify the--create-triplets
flag to relabel the data for fine-tuning with the triplet loss objective.
The training script is src/bert_coref/triplet_train.py
. Note that it uses WandB for logging, which means that running it requires a WandB account.
The triplet loss implementation can be found in src/bert_coref/losses.py
.
Fine-tuned contextualized embeddings can be used as the inputs of a clustering algorithm (and clustering is expected to be equivalent to coreference resolution). This can be done by running src/bert_coref/clustering/cluster.py
. This module requires jsonlines
-style input (see above) and a fine-tuned encoder model. The key labels
for gold standard corefence labels must be provided in the input, but these labels will not be used. For inference, it is sufficient to specify arbitrary values as coreference labels. The output is a TSV-style file with a new koref_cluster
field added to the annotation. This field includes the inferred coreference labels.
Clustering outputs can be evaluated against gold standard labels with the src/bert_coref/clustering/evaluate_clustering.py
script.
You can use the src/bert_coref/clustering/plot_coref.py
script to visualize the clustering input.
Find our model here.
- initial learning rate: 0.0000137873
- minimal learning rate: 0.000003778
- AdamW weight decay: 0.000085443
- batch size: 32
- epochs: 2
- margin: 0.7
If you use our tool, please cite our paper:
Vadász Noémi, Nyéki Bence (2022): Koreferenciafeloldás magyar szövegeken BERT-tel. XIX. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2023). Megjelenés alatt.
@inproceedings{bert_coref_hu,
author = {Vadász, Noémi and Nyéki, Bence},
title = {Koreferenciafeloldás magyar szövegeken {BERT}-tel},
booktitle = {{XIX}. {M}agyar {S}zámítógépes {N}yelvészeti {K}onferencia ({MSZNY} 2023)},
editor = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika},
pages = {119--131},
publisher = {Szegedi Tudományegyetem, TTIK, Informatikai Intézet},
address = {Szeged},
year = {2023}
}
GNU General Public License v3.0