Note: The English translations are of the original source texts and the "back" translations are translations of the professional English translations back into the original source language.
The input parallel corpus (.jsonl) must have the following four fields: id, lang, text, en_tr.
The output corpus (.jsonl) will have entries with the following fields:
tr_id: Unique identifier for the translation. Use UUID.
pair_id: The parallel text pair's identifier (id field in parallel text corpus)
src_lang: Language of the source text
tr_lang: Language of the translation text
model: The model used for this translation
src_text: Text to be translated
ref_text: Reference translation text
tr_text: The machine translated text
For each text-translation pair in the input corpus (i.e., row), two entries in the output will be produced.
- Original Text --> English: The professional English translation will serve as the reference text.
- English Translation --> Source language: In this case the source text will be the English translation and the original text will be the reference text.
Depends on #11