This projects explores Question Generation in German. We train several Transformer-based machine learning models from the T5 architecture. We open source two models on the Huggingface model hub.
The task is to generate a question from a textual input where the answer is highlighted with a <hl>
token and
prepended with generate question:
.
Example:
generate question: Der Monk Sour Drink ist ein somit eine aromatische Überraschung,
die sowohl <hl>im Sommer wie auch zu Silvester<hl> funktioniert."
Zu welchen Gelegenheiten passt der Monk Sour gut?
We open source two trained models: german-qg-t5-quad and german-qg-t5-drink600.
Based on valhalla/t5-base-qg-hl, which is trained on the SQuAD dataset to generate English questions. We further fine-tuned it on the GermanQUAD dataset, which contains 13’722 question and answer pairs.
It achieves a BLEU-4 score of 11.30 on the GermanQuAD test set (n=2204).
We also fine-tuned the original t5-base model, which only achieved a BLEU-4 score of 10.12.
Based on german-qg-t5-quad, but further fine-tuned on a dataset of 603 German question/answer pairs that we annotated on drink receipts from Mixology ("drink600"). We have not yet open sourced the dataset, since we do not own copyright on the source material.
It achieves a BLEU-4 score of 29.80 on the drink600 test set (n=120) and 11.30 on the GermanQUAD test set. Thus, fine-tuning on drink600 did not affect performance on GermanQuAD.
In comparison, german-qg-t5-quad achieves a BLEU-4 score of 10.76 on the drink600 test set.
Both idea and code are partly inspired by this repository by Suraj Patil and from Hugging Face. The GermanQUAD dataset was created by deepset, as well as the annotation tool that we used.