Resources for conservation, development, and documentation of low resource (human) languages.
-
Updated
May 9, 2024 - TeX
Resources for conservation, development, and documentation of low resource (human) languages.
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
NLP pipelines for Tagalog using spaCy
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
Exploring the Limits of Low-Resource Neural Machine Translation
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
Curated list of publicly available parallel corpus for Indian Languages
SemEval2024-task 11: Bridging the Gap in Text-Based Emotion Detection
📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
The EveryVoice TTS Toolkit - Text To Speech for your language
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
A pipeline to isolate and transcribe one language in mixed-language speech
Add a description, image, and links to the low-resource-languages topic page so that developers can more easily learn about it.
To associate your repository with the low-resource-languages topic, visit your repo's landing page and select "manage topics."