Skip to content

As a user, I want a script that takes a parallel text corpus and produces a text corpus of (1) English translations and (2)"back" translations using a specified HuggingFace model, so that I can build a corpus of machine translations for evaluation. #12

@laurejt

Description

@laurejt

Note: The English translations are of the original source texts and the "back" translations are translations of the professional English translations back into the original source language.

The input parallel corpus (.jsonl) must have the following four fields: id, lang, text, en_tr.

The output corpus (.jsonl) will have entries with the following fields:

  • tr_id: Unique identifier for the translation. Use UUID.
  • pair_id: The parallel text pair's identifier (id field in parallel text corpus)
  • src_lang: Language of the source text
  • tr_lang: Language of the translation text
  • model: The model used for this translation
  • src_text: Text to be translated
  • ref_text: Reference translation text
  • tr_text: The machine translated text

For each text-translation pair in the input corpus (i.e., row), two entries in the output will be produced.

  1. Original Text --> English: The professional English translation will serve as the reference text.
  2. English Translation --> Source language: In this case the source text will be the English translation and the original text will be the reference text.

Depends on #11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions