diff --git a/docs/data-design.md b/docs/data-design.md new file mode 100644 index 0000000..0d37fcb --- /dev/null +++ b/docs/data-design.md @@ -0,0 +1,88 @@ +# Data Design Document + +*This serves as a living document to record the current designs of the various data +produced during the workflows of this project.* + +## Parallel Text Corpus + +These corpora serve as the initial data form for Phase 1. A parallel text corpus is a JSONL +file whose records correspond to non-English musical theoretical texts paired with their *professional* English translations. + +Phase 1 uses two corpora: a sentence-level corpus built from Notion annotation data and a paragraph-level-corpus build from side-by-side translations for 5 MTO articles. + +### Required Fields + +A parallel text corpus has the following four fields regardless of the contents of the corpus: + +| Field Name | Type | Description | +| --- | --- | --- | +| id | int | Unique identifier for the parallel text pair | +| lang | str | Language of the original text | +| text | str | Original text | +| en_tr | str | Professional English translation of the text | + +In Phase 1, the languages of the original text are limited to Chinese (zh), Japanese (ja), and Spanish (es). + +### Optional Fields + +Sentence- and paragraph-level corpora will have additional optional fields for providing additional + +#### Parallel Sentence Corpus Fields + +A parallel sentence corpus includes contains the following additional information extracted from the Notion annotations: + +| Field Name | Type | Description | +| --- | --- | --- | +| cite | str | Citation for the original sentence | +| tr_cite | str | Citation for the translated sentence | +| term | str | The musical-theoretical term associated with the sentence | + + + + +#### Parallel Paragraph Corpus Fields + +A parallel paragraph corpus includes the following additional information about the associated Music Theory Online (MTO) articles: + +| Field Name | Type | Description | +| --- | --- | --- | +| doc_id | str | MTO article ID (e.g., 30.4.12) | +| par_id | str | MTO paragraph ID (e.g., 1.20) | + +## Machine Translation Corpus + +A machine translation corpus is a JSONL file whose records correspond to individual translations by a specific translation model. For Phase 1, a machine translation corpus contains the translations of (1) the English translations of a the original texts of a parallel text corpus and (2) translations of the professional English translation back into the language of the original text (i.e., back translation). + +### Fields + +At this time there are no optional fields for a machine translation corpus. + +| Field Name | Type | Description | +| --- | --- | --- | +| tr_id | str | Unique identifier (UUID) for the machine translation | +| pair_id | str | Parallel text record ID | +| model | str | Model used for translation | +| src_lang | str | Language of the source text | +| tr_lang | str | Language of the translation | +| src_text | str | Source text | +| ref_text | str | Reference translation | +| tr_text | str | Machine translated text | + +So, the records for the English translation and back translation of a parallel text record will have the following values: + +| Field | English Translation | Back Translation | +| --- | :-: | :-: | +| src_lang | `lang` of parallel text record | en | +| tr_lang | en | `lang` of the parallel text record | +| src_text | `text` of parallel text record | `en_tr` of parallel text record | +| ref_text | `en_tr` of parallel text record | `text` of parallel text record | + + +## Machine Translation Metric Scores +To evaluate the machine translations generated during Phase 1, machine translation metrics are computed for each translation. These scores are saved as a CSV with the following form: + +| Field Name | Type | Description | +| --- | --- | --- | +| tr_id | str | Machine translation record ID | +| chrf | str | ChrF score for the translation | +| comet | str | COMET score for the translation |