Available parallel data for training machine translation models in indic languages: Hindi, Bengali, Gujarati, Gondi, Kannada, Manipuri, Marathi, Malayalam, Oriya, Punjabi, Sanskrit, Tamil, Telugu.
- Samaantar Corpus
- As-En PMIndia Corpus
- As-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row asm-eng.
- Samaantar Corpus
- Bn-En BEUT Parallel corpus: 2.75million pairs of bengali-english sentences @EMNLP 2020
- Bn-En Project Anuvaad
- Bn-En Indian Parallel Corpora
- CVIT-IIITH PIB Multilingual Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- Bn-En Indian-Language Dataset
- Bn-En Asian Language Treebank (ALT) Parallel Corpus
- Bn-En PMIndia Corpus
- Bn-En OPUS: Set source as
en
and target asbn
- Bn-En SUPARA 0.8M: Requires an IEEE DataPort Subscription
- Bn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ben-eng.
- Samaantar Corpus
- Gu-En WikiTitles Parallel Corpus : wikititles-v1.gu-en.tsv.gz
- Gu-En Project Anuvaad
- Gu-En Tsardia
- CVIT-IIITH PIB Multilingual Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- Gu-En Shahparth123
- Gu-En PMIndia Corpus
- Gu-En Bible Corpus
- Gu-En OPUS: Set source as
en
and target asgu
- Gu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row guj-eng.
- Samaantar Corpus
- Hi-En IITB Parallel Corpus: v3.0 released !!
- Hi-En Project Anuvaad
- Hi-En Indian Parallel Corpora
- CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- Hi-En Asian Language Treebank (ALT) Parallel Corpus
- Hi-En PMIndia Corpus
- Hi-En Bible Corpus
- Hi-En Wiki Matrix Comparable Corpus
- Hi-En OPUS: Set source as
en
and target ashi
. [ Some of the corpus are part of IITB Parallel Corpus.] - Hi-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row hin-eng.
- IIITH Code-Mix Hi-En Corpus
- Hi-En Flickr 8k: Multimodal Dataset
- Hi-San parallel corpus: Hindi-Sanskrit monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.
- Samaantar Corpus
- Kn-En Project Anuvaad
- Kn-En PMIndia Corpus
- Kn-En Bible Corpus
- OPUS: Set source as
en
and target askn
- Kn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row kan-eng.
- Samaantar Corpus
- Mr-En Project Anuvaad
- CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- Mr-En PMIndia Corpus
- Mr-En Bible Corpus
- Mr-En OPUS: Set source as
en
and target asmr
- Mr-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mar-eng.
- Samaantar Corpus
- Ml-en Project Anuvaad
- Indian Parallel Corpora
- CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- Ml-en Indian-Language Dataset
- Ml-en English_Malayalam_ParallelCorpora
- Ml-en PMIndia Corpus
- Ml-en Bible Corpus
- Ml-en OPUS: Set source as
en
and target asml
- Ml-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mal-eng.
- Samaantar Corpus
- Or-En MTEnglish2Odia
- Or-En OdiEnCorp 2.0
- Or-En OdiEnCorp 1.0
- Or-En IndoWordnet Parallel Corpus
- CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
- Or-En PMIndia Corpus
- Or-En OPUS: Set source as
en
and target asor
- Or-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ori-eng.
- Samaantar Corpus
- Pu-En Project Anuvaad
- Pu-En Punjabi-English Corpus
- Pu-En PMIndia Corpus
- Pu-En OPUS: Set source as
en
and target aspa
- Pu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row pan-eng.
- San-Hi parallel corpus: Sanskrit Hindi monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.
- Samaantar Corpus
- Ta-En Project Anuvaad
- Ta-En Indian Parallel Corpora
- Ta-En National Language Process Center
- Ta-En EnTam
- CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
- Ta-En Indian-Language Dataset
- Ta-En Multiple Dataset Links
- Ta-En PMIndia Corpus
- Ta-En Parallel Corpus
- Ta-En PMIndia Corpus
- Ta-En OPUS: Set source as
en
and target asta
- Ta-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tam-eng.
- Samaantar Corpus
- Te-En Project Anuvaad
- Te-En Indian Parallel Corpora
- CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
- CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
- Te-En Indian-Language Dataset
- Te-En PMIndia Corpus
- Te-En Bible Corpus
- Te-En OPUS: Set source as
en
and target aste
- Te-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tel-eng.
- PMIndia Parallel Corpus Creation: Code for creating a parallel corpus from pmindia.gov.in. [Paper Link]