As a user, I want a script that takes a parallel text corpus and produces a text corpus of (1) English translations and (2)"back" translations using a specified HuggingFace model, so that I can build a corpus of machine translations for evaluation.

Note: The English translations are of the original source texts and the "back" translations are translations of the professional English translations back into the original source language.

The input parallel corpus (`.jsonl`) must have the following four fields: `id`, `lang`, `text`, `en_tr`.

The output corpus (`.jsonl`) will have entries with the following fields:
- `tr_id`: Unique identifier for the translation. Use UUID.
- `pair_id`: The parallel text pair's identifier (`id` field in parallel text corpus)
- `src_lang`: Language of the source text
- `tr_lang`: Language of the translation text
- `model`: The model used for this translation
- `src_text`: Text to be translated
- `ref_text`: Reference translation text
- `tr_text`: The machine translated text

For each text-translation pair in the input corpus (i.e., row), two entries in the output will be produced.
1. Original Text --> English: The professional English translation will serve as the reference text.
2. English Translation --> Source language: In this case the source text will be the English translation and the original text will be the reference text.

Depends on #11 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a user, I want a script that takes a parallel text corpus and produces a text corpus of (1) English translations and (2)"back" translations using a specified HuggingFace model, so that I can build a corpus of machine translations for evaluation. #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

As a user, I want a script that takes a parallel text corpus and produces a text corpus of (1) English translations and (2)"back" translations using a specified HuggingFace model, so that I can build a corpus of machine translations for evaluation. #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions