Coreference Resolution on Hungarian Data with BERT

This repository contains the source code of a coreference resolver trained on Hungarian data.

Installation

We use Poetry for dependency management and packaging. Install it, clone this repository, then run

cd bert_coref_hu
poetry install

Training BERT

We briefly describe below how we fine-tuned a pre-trained BERT model to support coreference resolution.

To see the full argument lists of our scripts, run

python3 <script_name>.py -h

Data conversion

Our original data was contained in TSV-style document with the following fields:

form
koref
anas
lemma
xpostag
upostag
feats
id
deprel
head
cons

Each row corresponded to a single token and sentences were separated by blank lines. The values of the koref field were gold standard coreference labels. The rest of the linguistic annotation was generated with the emagyar text processing system.

The following scripts were applied to adjust the data to our task:

src/bert_coref/group_sentences.py: This splits the input into parts each of which meets the maximum sequence length constraint of a specific model (in our case, huBERT). The resulting subsequences are separated by double blank lines.
src/bert_coref/relabel.py: It relabels the input based on POS-based filter rules.
src/bert_coref/conll2jsonl.py: It converts the TSV-style input to jsonlines (one line per document, documents separated by double blank lines in the input). jsonlines was the data format that we used to fine-tune huBERT. Spcify the --create-triplets flag to relabel the data for fine-tuning with the triplet loss objective.

Fine-tuning the model

The training script is src/bert_coref/triplet_train.py. Note that it uses WandB for logging, which means that running it requires a WandB account.

The triplet loss implementation can be found in src/bert_coref/losses.py.

Clustering

Fine-tuned contextualized embeddings can be used as the inputs of a clustering algorithm (and clustering is expected to be equivalent to coreference resolution). This can be done by running src/bert_coref/clustering/cluster.py. This module requires jsonlines-style input (see above) and a fine-tuned encoder model. The key labels for gold standard corefence labels must be provided in the input, but these labels will not be used. For inference, it is sufficient to specify arbitrary values as coreference labels. The output is a TSV-style file with a new koref_cluster field added to the annotation. This field includes the inferred coreference labels.

Clustering outputs can be evaluated against gold standard labels with the src/bert_coref/clustering/evaluate_clustering.py script.

You can use the src/bert_coref/clustering/plot_coref.py script to visualize the clustering input.

Download

Find our model here.

Optimal hyperparameters

initial learning rate: 0.0000137873
minimal learning rate: 0.000003778
AdamW weight decay: 0.000085443
batch size: 32
epochs: 2
margin: 0.7

Citation

If you use our tool, please cite our paper:

Vadász Noémi, Nyéki Bence (2022): Koreferenciafeloldás magyar szövegeken BERT-tel. XIX. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2023). Megjelenés alatt.

@inproceedings{bert_coref_hu,
    author = {Vadász, Noémi and Nyéki, Bence},
    title = {Koreferenciafeloldás magyar szövegeken {BERT}-tel},
    booktitle = {{XIX}. {M}agyar {S}zámítógépes {N}yelvészeti {K}onferencia ({MSZNY} 2023)},
    editor = {Berend, Gábor and Gosztolya, Gábor and Vincze, Veronika},
    pages = {119--131},
    publisher = {Szegedi Tudományegyetem, TTIK, Informatikai Intézet},
    address = {Szeged},
    year = {2023}
}

Licence

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docker		docker
examples		examples
scripts_utils		scripts_utils
src/bert_coref		src/bert_coref
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coreference Resolution on Hungarian Data with BERT

Installation

Training BERT

Data conversion

Fine-tuning the model

Clustering

Download

Optimal hyperparameters

Citation

Licence

About

Releases

Packages

Contributors 2

Languages

License

nytud/bert_coref_hu

Folders and files

Latest commit

History

Repository files navigation

Coreference Resolution on Hungarian Data with BERT

Installation

Training BERT

Data conversion

Fine-tuning the model

Clustering

Download

Optimal hyperparameters

Citation

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages