You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to train my own entity disambiguation model. I used this script for the preprocessing
#!/bin/bash
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
DATASET_PATH=$1
MODEL_PATH=$2
echo "Processing ${DATASET}"
cd ../fairseq
# BPE preprocessing.
for SPLIT in train dev; do
for LANG in "source" "target"; do
python -m examples.roberta.multiprocessing_bpe_encoder\
--encoder-json "$MODEL_PATH/encoder.json" \
--vocab-bpe "$MODEL_PATH//vocab.bpe" \
--inputs "$DATASET_PATH/$SPLIT.$LANG" \
--outputs "$DATASET_PATH/$SPLIT.bpe.$LANG" \
--workers 60 \
--keep-empty;
done
done
cd ..
# Binarize the dataset.
fairseq-preprocess --joined-dictionary \
--source-lang "source" --target-lang "target" \
--trainpref "$DATASET_PATH/train.bpe" \
--validpref "$DATASET_PATH/dev.bpe" \
--destdir "$DATASET_PATH/bin" \
--workers 60
I created a dataset where train.source contains I am going to the [START_ENT] States [END_ENT]. and train.target has United States
Since it is not reported I used GPT-2 encoder.json and vocab.bpe but open to suggestions here, I might have done wrong there.
I used --joined-dictionary option otherwise I got an error at training stating that source and target should be joined dictionary.
Without --source-lang and --target-lang I also get an error that is why I put those in.
After doing that I used this bash file for training
#!/bin/bash
# Copyright (c) Facebook, Inc. and its affiliates.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
fairseq-train /home/ubuntu/kemal/entity_disambiguation_data/v0/bin/ \
--save-dir models/$NAME \
--tensorboard-logdir tensorboard_logs/$NAME \
--restore-file models/bart.large/model.pt \
--arch bart_large \
--task translation \
--criterion label_smoothed_cross_entropy \
--source-lang source \
--target-lang target \
--truncate-source \
--label-smoothing 0.1 \
--max-tokens 1024 \
--update-freq 1 \
--max-update 200000 \
--required-batch-size-multiple 1 \
--dropout 0.1 \
--attention-dropout 0.1 \
--relu-dropout 0.0 \
--weight-decay 0.01 \
--optimizer adam \
--adam-betas "(0.9, 0.999)" \
--adam-eps 1e-08 \
--clip-norm 0.1 \
--lr-scheduler polynomial_decay \
--lr 3e-05 \
--total-num-update 200000 \
--warmup-updates 500 \
--ddp-backend no_c10d \
--num-workers 20 \
--reset-meters \
--reset-optimizer \
--layernorm-embedding \
--share-decoder-input-output-embed \
--share-all-embeddings \
--skip-invalid-size-inputs-valid-test \
--log-format json \
--log-interval 10 \
--patience 200
and training started. However what I want to ask is:
Are these configurations correct for disambiguation? Especially pre-processing part?
When I checked the checkpoint.pt files I can see there is a data.pkl which i will use later for inference I think as specified here https://github.com/facebookresearch/GENRE/tree/main/examples_genre. However I cannot find the actual model.pt file in the checkpoint.pt. When I go into the checkpoint file there is an archive file and when I open that there is data.pkl and data folder which has additional binary files inside but there is no model.pt thus I cannot import the model and use it. How can I solve this problem?
The evaluation metric is perplexity I want to know the KB accuracy scores how can I change it to that?
The text was updated successfully, but these errors were encountered:
I am trying to train my own entity disambiguation model. I used this script for the preprocessing
train.source
containsI am going to the [START_ENT] States [END_ENT].
and train.target hasUnited States
--joined-dictionary
option otherwise I got an error at training stating that source and target should be joined dictionary.--source-lang
and--target-lang
I also get an error that is why I put those in.After doing that I used this bash file for training
and training started. However what I want to ask is:
The text was updated successfully, but these errors were encountered: