Code for the paper XL-WSD: An Extra-Large and Cross-Lingual Evaluation Frameworkfor Word Sense Disambiguation. Please visit https://sapienzanlp.github.io/xl-wsd/ for more info and to download the data. Pretrained models are available at the bottom of this page.
First setup the python environment. Be sure that anaconda is already installed.
git clone https://github.com/SapienzaNLP/xl-wsd-code.git
conda create --name xl-wsd-code python=3.7
conda activate xl-wsd-code
cd xl-wsd-code
pip install -r requirements.txt
conda install pytorch==1.5.0 torchtext==0.6.0 cudatoolkit=10.1 -c pytorchThen, download and install WordNet.
cd /tmp
wget http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz
tar xvzf WordNet-3.0.tar.gz
sudo mv WordNet-3.0 /opt/
rm WordNet-3.0.tar.gzIn case you do not want to move WordNet to the /opt/ directory, then set the WORDNET_PATH variable in src/datasets/__init__.py to the full path of your WordNet-3.0 directory.
wget --header="Host: doc-04-b8-docs.googleusercontent.com" --header="User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36" --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-GB,en-US;q=0.9,en;q=0.8" --header="Referer: https://drive.google.com/" --header="Cookie: AUTH_bjalsfn9vp89mfmro6spe8un3che13a6_nonce=nlgjg7mf6jn9e" --header="Connection: keep-alive" "https://doc-04-b8-docs.googleusercontent.com/docs/securesc/qpect75hpbjc0ojmotm96i6g1v6ev8i1/ssjeem6krjiq45h1lb2t3k0t4uh11fea/1617902700000/13518213284567006193/13518213284567006193/19YTL-Uq95hjiFZfgwEpXRgcYGCR_PQY0?e=download&authuser=1&nonce=nlgjg7mf6jn9e&user=13518213284567006193&hash=j6hh86p5arl35lijpmf4oak2hnhvrr20" -c -O 'xl-wsd-data.zip'
tar xvzf xl-wsd-data.tar.gzif wget does not work, download the data from https://drive.google.com/file/d/19YTL-Uq95hjiFZfgwEpXRgcYGCR_PQY0/view?usp=sharing
open config/config_en_semcor_wngt.train.yaml with your favourite text editor.
edit the paths for inventory_dir, test_data_root and train_data_root.
train_data_root is a dictionary from language code, e.g., en, to a list of paths.
Each path can be a dataset in the standard format of this framework.
test_names is also a dictionary from language code to a list of names of test set folders that can be found in test_data_root.
For example, considering the following directories that are in the evaluation_datasets dir,
ls xl-wsd-dataset/evaluation_datasets
test-en
test-en-coarse
dev-en
...
test-zh
dev-zhone can set the test_names variable in the config as:
en:
- test-en
- test-en-coarse
it:
- test-it
zh:
- test-zhEvaluation on each test set is perfromed at the end of each training epoch.
The dev_name variable in the config, instead, is a pair (language, devset name).
It can be set as follows:
dev_name:
- en
- dev-enoutpath is the path to a directory where a new folder for the newly trained model can be created and where the checkpoints and information about the model will be stored.
encoder_name may be any transformer model supported by allen nlp 1.0.
model_name may be any name you would like to give to the model.
All the other options can also be changed and are pretty self-explainatory.
cd xl-wsd-code
PYTHONPATH=. python src/training/wsd_trainer.py --config config/config_en_semcor_wngt.train.yamlThe wsd_trainer.py script takes also the following parameters as input:
--dryrun | --no-wandb-log which disables the logging to wandb.
--no_checkpoint which disables the saving of checkpoints.
--reload_checkopint which allows the program to reload weights that were previously saved in the same directory, i.e., outpath/checkpoints.
--cpu to run the script in CPU.
Some other parameters are also allowed and would overwrite those in the config if specified:
--weight_decay sets the weight decay.
--learning_rate sets the learning rate.
--gradient_clipping sets the gradient clipping threshold.
During training, the folder checkpoints is created within outpath an checkpoints are saved
thereing.
At the end of training the best model found (with the lowest loss on the dev) is reloaded
and tested on the test set defined in test_names and in outpath/evaluation/ a file for each
test set is created with the predictions. The files contain a row for each test-set instance
with the id and the predicted synset separated by space.
To evaluate a saved model, it will be enough to run this command
PYTHONPATH=. python src/evaluation/evaluate_model.py --config config/config_en_semcor_wngt.test.yamlwhere config_en_semcor_wngt.test.yaml is a configuration file similar to config_en_semcor_wngt.train.yaml with all the test set one is interested in specified in the test_names field.
The best.th set of weights within outpath/checkpoints/ will be evaluated and the results for each
dataset printed at the console. The predictions will be saved in outpath/evaluation.
the evaluate_model.py takes also the following parameters that would, in case, override those in the config file:
--checkpoint_path containing the path to a specific checkpoint.
--output_path to specifiy a different path where to store the predictions.
--pos a list containing all the POS tags on which one wants to perform a separate evaluation. POS tags allowed are {n,v,r,a} where n = noun, v = verb, r = adverb and a = adjective.
--verbose to make the script print more info.
--debug to print debug files.
--cpu to run the evaluation in CPU rather then on GPU.
For example,
PYTHONPATH=. python src/evaluation/evaluate_model.py --config config/config_en_semcor_wngt.test.yaml --pos nwould print only the results on the test sets computed on nominal instances only.
PYTHONPATH=. python src/evaluation/evaluate_model.py --config config/config_en_semcor_wngt.test.yaml --pos n vwould instead print results computed separately on nouns and verbs.
| Encoder | Training Data | Link |
|---|---|---|
| XLM-Roberta Large | SemCor+WNGT | link |
| XLM-Roberta Base | SemCor+WNGT | link |
| Multilingual BERT | SemCor+WNGT | link |
| ALL | SemCor+WNGT | link |
This project is released under the CC-BY-NC 4.0 license (see LICENSE). If you use the code in this repo, please link it.
The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union's Horizon 2020 research and innovation programme.
The authors gratefully acknowledge the support of the ERC Consolidator Grant FoTran No. 771113 under the European Union's Horizon 2020 research and innovation programme.
The authors gratefully acknowledge the support of the ELEXIS project No. 731015 under the European Union's Horizon 2020 research and innovation programme.
Authors also thank the CSC - IT Center for Science (Finland)for the computational resources.