❗ The code and experiments in this repository were developed beginning of 2022 as part of a master's thesis and may therefore be outdated.
Extraction of contact entities from address blocks and imprints with Extractive Question Answering.
The goal is convert an input like
Dr. Max Mustermann
Hauptstraße 123
97070 Würzburg
to the following output:
entities = {
"city" : "Würzburg",
"email" : "",
"fax" : "",
"firstName" : "Max",
"lastName" : "Mustermann",
"mobile" : "",
"organization" : "",
"phone" : "",
"position" : "",
"street" : "Hauptstraße 123",
"title" : "Dr.",
"website" : "",
"zip" : "97070"
}
- Create a dataset
- Train a model
- Use the model for predictions
Due to data protection reasons, no dataset is included in this repository. You need to create a dataset in the SQuAD format, see https://huggingface.co/datasets/squad.
Create the dataset in the jsonl-format where one line looks like this:
{
'id': '123',
'title': 'mustermanns address',
'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
'question': 'firstName',
'answers': {
'answer_start': [4],
'text': ['Max']
}
}Questions with no answers should look like this:
{
'id': '123',
'title': 'mustermanns address',
'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
'question': 'phone',
'answers': {
'answer_start': [-1],
'text': ['EMPTY']
}
}Split the dataset into a train-, validation- and test-dataset and save them in a directory with the name crawl, email or expected, like this:
├── data
│ ├── crawl
│ │ ├── crawl-test.jsonl
│ │ ├── crawl-train.jsonl
│ │ ├── crawl-val.jsonl
If you allow unanswerable questions like in SQuAD v2.0, add a -na behind the directory name, like this:
├── data
│ ├── crawl-na
│ │ ├── crawl-na-test.jsonl
│ │ ├── crawl-na-train.jsonl
│ │ ├── crawl-na-val.jsonl
Example command for training and evaluating a dataset inside the crawl-na directory:
python app/qa-pipeline.py \
--batch_size 4 \
--checkpoint xlm-roberta-base \
--dataset_name crawl \
--dataset_path="../data/" \
--deactivate_map_caching \
--doc_stride 128 \
--epochs 3 \
--gpu_device 0 \
--learning_rate 0.00002 \
--max_answer_length 30 \
--max_length 384 \
--n_best_size 20 \
--n_jobs 8 \
--no_answers \
--overwrite_output_dir;