You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.
The dataset can be used in the following ways:
Train a partially diacritized text i.e sentences with correct lower marks as input and corresponding fully diacritized sentences as output. I believe this will give better accuracy than what we already have. If this gives very high accuracy, we can now consider
Training a non-diacritized text to output a partially diacritized text, and from the output we train the fully diacritized text i.e [non-diacritized text] ====> [partially diacritized text] ====> [fully diacritized text]
Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.
The text was updated successfully, but these errors were encountered:
Fully diacritized ==> used for [partial, fully] diacritized training pairs. This can based on the current ADR dataset, enhanced with Kola and Timilehin's fully diacritized contributions. The rule for creating the partial set will include decomposing the fully diacritized text, so just the accents (not the under-dots) are removed.
To more easily normalize Yoruba wikipedia articles, create a partially diacritized dataset with diacritic marks below the vowels.
The dataset can be used in the following ways:
Motivation:
From my observation about the writing of Yorùbá text, majority of people especially young people don't know the tonal marks (high, mid, and low) above the vowel letters but many people know how (and want to be able) to distinguish between symbol with/without lower mark e.g E vs Ẹ, O vs Ọ and S vs Ṣ especially with the availability of Google Gboard on android phones.
The text was updated successfully, but these errors were encountered: