Skip to content

Foysal87/Bangla-NLP-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇧🇩 Bangla NLP Dataset

Bangla NLP Datasets License Contribution

A comprehensive, URL-validated collection of Bangla / Bengali NLP datasets, models, tools, and corpora — for researchers, students, and developers.

বাংলা ভাষার এনএলপি গবেষণা, শিক্ষা এবং প্রায়োগিক কাজের জন্য একটি যাচাইকৃত সম্পদ-সংগ্রহ।


🔄 Our sbnltk dataset is in LFS mode — clone the repository to download data.

🚀 All deep-learning-era datasets are linked below; we'll keep adding new releases.

📑 Table of Contents

📖 About

This repository contains the sbnltk datasets used in the Bangla NLP toolkit sbnltk, and serves as a comprehensive, URL-validated catalogue of publicly available Bangla NLP resources contributed by the worldwide Bangla research community.

Validation note: Every link in this document was tested between 2025–2026. Resources that previously appeared here under fabricated GitHub paths (e.g. github.com/poetry-bangla/corpus, github.com/medical-bangla/medical-translation, etc.) have been removed. If you find a dead link, please open an issue.

🎯 sbnltk Dataset List (DUMP & HUMAN Evaluated)

Dataset Description Link
Number List Bangla number list 📥 Download
Root Word List Bangla root word list 📥 Download
Word List Bangla word list (highest → lowest occurrence) 📥 Download
Wiki Dump Bangla wiki dump words 📥 Download
POS Tag Static Bangla POS-tag static dataset (single word) 📥 Download
NER Static Bangla NER static dataset (single word) 📥 Download
Stop Words Bangla stop-word list 📥 Download
Dump POS Tag Bangla dump POS-tag 📥 Download
Question Classification Bangla dump question classification dataset 📥 Download
Sentiment Analysis Bangla dump sentiment analysis 📥 Download
Translation Dataset Google translation dataset 📥 Download
NER Enhanced Existing NER dataset (modified + Date entity) 📥 Download
News Articles News article dataset 📥 Download
POS Converted POS-tag converted data 📥 Download
POS Human Evaluated POS-tag human-evaluated data 📥 Download
NER Dump (Both) Dump NER (active + passive) 📥 Download
NER Dump (Active) Dump NER (active only) 📥 Download
Extractive Summarization Extractive text summarization 🔗 GitHub
Abstractive Summarization Abstractive summarization (newspaper) 📥 Drive · 📊 Kaggle
Text Classification News article classification 📥 Drive · 📊 Kaggle
Keywords Classification Topic-keyword classification 📥 Drive · 📊 Kaggle

🤖 Pre-trained Language Models

BERT-style Encoders

Model Description Params Link
BanglaBERT ELECTRA discriminator, SOTA Bangla NLU (BUET CSE NLP) 110M 🤗 HF · 🔗 GitHub
BanglaBERT (Small) Lightweight variant 13M 🤗 HF
BanglaBERT (Large) Large variant, top scores on BLUB 335M 🤗 HF
BanglishBERT Bilingual (Bangla + English) 110M 🤗 HF
Bangla BERT Base (sagorsarker) Popular community BERT 110M 🤗 HF
mBERT-Bengali-NER Multilingual BERT fine-tuned for NER 🤗 HF
mBERT-Bengali-TyDiQA-QA mBERT fine-tuned for QA 🤗 HF
sahajBERT ALBERT-based collaborative training 18M 🤗 HF
MuRIL Google multilingual (17 Indian) 236M 🤗 HF
IndicBERT AI4Bharat (12 Indian) 🤗 HF

Generative / Seq2Seq Models

Model Description Params Link
BanglaT5 T5-style seq2seq (BUET) 247M 🤗 HF
BanglaT5 (small) Small T5 variant 60M 🤗 HF
BanglaT5 NMT bn↔en Translation seq2seq 🤗 bn→en · 🤗 en→bn
BanglaT5-Paraphrase Paraphrase seq2seq 🤗 HF
BanglaByT5 Byte-level T5 small 📄 arXiv 2505.17102
GPT-2 Bengali Flax-community GPT-2 117M 🤗 HF

Bangla LLMs (2025)

Model Description Params Link
TigerLLM-1B-it Bangla instruction-tuned LLM 1B 🤗 HF
TigerLLM-9B-it Larger variant, beats GPT-3.5 on Bangla 9B 🤗 HF
TituLLMs (1B / 3B) Family of Bangla LLMs with benchmarks 1B / 3B 📄 arXiv 2502.11187
TigerLLM Paper ACL 2025 short paper 📄 arXiv 2503.10995 · 📄 ACL 2025
BanglaLLaMA-3-8B-BnWiki-Instruct Llama-3 fine-tuned on Bn Wiki 8B 🤗 HF
Bangla LLaMA (saiful9379) LoRA-tuned LLaMA 🔗 GitHub

Speech Models

Model Description Performance Link
Wav2Vec2-Bengali (300M) Self-supervised ASR 17.8 % WER 🤗 HF
Wav2Vec2-XLSR Bengali XLSR fine-tune 🤗 HF
BanglaConformer Conformer ASR by Bengali.AI 🤗 HF
BanglaASR Whisper fine-tuned for Bengali 14.73 % WER 🤗 HF · 🔗 GitHub
Whisper (multilingual) OpenAI base model — Bn supported various sizes 🤗 HF

Word & Sentence Embeddings

Resource Description Link
Bangla FastText (sagorsarker) 20 M-token wiki-trained skipgram + CBOW 🤗 HF
Bangla Word2Vec (sagorsarker) 100-d Wikipedia embeddings 🤗 HF
fastText 157-language Bengali Facebook 300-d Wiki + CC 🌐 fastText
Spark NLP bengali_cc_300d Production embedding 🔗 Spark NLP
BanglaEmbed Cross-lingual distilled sentence embeddings 📄 arXiv 2411.15270

📚 Latest 2024–2026 Datasets

These are the most relevant new releases — cite the original authors when used.

Dataset Task Size / Notes Link
Bangla-Instruct Instruction-tuning 342 K instruction–response pairs 🤗 HF
Bangla-TextBook LM pretraining 9.9 M tokens, 163 NCTB textbooks 🤗 HF
BanglaSTEM Technical-domain MT 5 K Bn-En STEM sentence pairs 📄 arXiv 2511.03498
NCTB-QA Educational QA 87 805 QA pairs (grade 1–10) 📄 arXiv 2603.05462
BanglaQuAD Open-domain QA 30 808 QA pairs 📄 arXiv 2410.10229
ANCHOLIK-NER Regional-dialect NER 17 405 sentences, 5 regions 📄 arXiv 2502.11198
BanNERD (NAACL 2025) NER, 10 classes / 29 domains 85 K sentences, 991 K tokens 🔗 GitHub
ONUBAD Dialect→Standard MT Chittagong / Sylhet / Barisal 🔗 ScienceDirect
BanglaDial Dialect text corpus 60 729 entries × 11 dialects 🔗 PMC
BIDWESH Regional hate speech Multi-region 📄 arXiv 2507.16183
BanglaTLit Romanized→Bn back-transliteration 42.7 K + 245.7 K pretrain 📄 ACL 2024
BanglishRev E-commerce code-mix reviews 1.74 M Daraz reviews 📄 arXiv 2412.13161
BengaliSent140 Hate vs non-hate fusion 140 K speeches 📄 arXiv 2601.20129 · 🔗 IEEE DataPort
BLUCK LLM cultural-knowledge benchmark 2 366 MCQs / 23 categories 📄 arXiv 2505.21092
BNLI (refined) NLI Curated entail/contra/neutral 📄 arXiv 2511.08813
MultiBanAbs Multi-domain abstractive sum. Multi-corpus 📄 arXiv 2511.19317
MultiBanFakeDetect Multimodal fake news Text + image 🔗 ScienceDirect
BanFakeNews-2.0 Fake news (2024) 47 K real + 13 K fake 📊 Mendeley
BanglaHealth Health-domain paraphrase 200 K sentences 🔗 ScienceDirect
BanglaCHQ-Summ Consumer-health-question summary 2 350 pairs (BLP-2023) 🔗 GitHub
Bangla-MedER Medical NER 2 980 texts, 6 entity types 📊 Mendeley
BanglaSarc3 Sarcasm (ternary) 12 089 FB comments 🔗 ScienceDirect
VACASPATI Bangla literature corpus 11 M sentences / 115 M words 📄 arXiv 2307.05083
MixSarc Code-mix sarcasm/humor/offence Bn-En transliterated 📄 arXiv 2602.21608
EmoMix-3L Code-mix emotion 1 071 Bn-Hi-En instances 🔗 GitHub
Bangla-ToCo Context-aware toxic 1 004 FB news comments 🔗 ScienceDirect
BanglaDocAtlas Document-layout, 8 classes annotated complex docs 🔗 IEEE
FoodBD BD cuisine images, 67 categories 3 523 polygon-annotated meals 🔗 Springer 2025
DeshiFoodBD BD traditional food images 5 425 images / 19 dishes 🔗 Springer

📊 Benchmarking and Evaluation

BLUB — Bangla Language Understanding Benchmark

The first comprehensive Bangla NLU benchmark, introduced with BanglaBERT (NAACL 2022).

Task Dataset Metric Best Model Score
Sentiment Classification SentNoB Macro-F1 BanglaBERT 72.89
Natural Language Inference XNLI-bn / BNLI Accuracy BanglaBERT (Large) 83.41
Named Entity Recognition MultiCoNER Micro-F1 BanglaBERT (Large) 79.20
Question Answering SQuAD-bn / TyDiQA EM / F1 BanglaBERT (Large) 76.10 / 81.50

📄 BLUB code & leaderboard: github.com/csebuetnlp/banglabert

BLUCK — Bangla LLM Cultural & Linguistic Benchmark (2025)

2 366 multiple-choice questions across 23 categories covering Bangladesh culture, history, and Bangla linguistics — designed to probe LLM cultural knowledge. 📄 arXiv 2505.21092

Recent Benchmark Datasets

Dataset Task Size Link
BanglaBook Sentiment 158 065 reviews 🔗 GitHub
SentMix-3L / OffMix-3L Code-mix sentiment / offence ~1 K each 📄 ACL
MultiCoNER (Bangla) Multilingual complex NER task 🔗 multiconer.github.io

📰 News, Corpora & Pretraining Data

Dataset Size Link
Bangla2B+ (BanglaBERT pretraining corpus) 27.5 GB / 110 sites 🔗 GitHub
BanglaLM (data-mining corpus) 14 GB 📄 IEEE
BdNC – Bangladesh National Corpus 40 GB / 3 B+ words 🔗 corpus.bangla.gov.bd
VACASPATI literary corpus 11 M sentences 📄 arXiv 2307.05083
CC-100 Bangla 8.3 GB 🔗 StatMT
OSCAR Bangla 12 GB+ 🔗 OSCAR
AI4Bharat IndicCorp 9 B tokens incl. Bangla 🔗 site
AI4Bharat IndicNLP corpus + catalog meta-resource 🔗 corpus · 🔗 catalog
Bangla Wikipedia Corpus (Kaggle) wiki-text 📊 Kaggle
Wikipedia bnwiki dumps latest dumps 🔗 dumps.wikimedia.org
Leipzig Bengali corpora (2021) 1.65 M sentences 🔗 corpora.uni-leipzig.de
Wiki Articles (Kaggle) wiki snapshot 📊 Kaggle
40k News Articles 40 K 📊 Kaggle
Largest Bangla Newspaper large multi-paper 📊 Kaggle
bdNews24 corpus bdnews24 articles 📊 Kaggle
Bangladesh Protidin Bangladesh Protidin news 📊 Kaggle
csebuetnlp/xlsum XL-Sum (Bangla subset) 🤗 HF
csebuetnlp/dailydialogue_bn Daily-dialogue translated 🤗 HF
goru001/nlp-for-bengali ULMFiT model + Wiki / news data 🔗 GitHub
masiur/Bangla-Corpus Open community corpus 🔗 GitHub

🔄 Machine Translation & Paraphrase

Dataset Description Link
csebuetnlp/BanglaNMT 2.38 M Bn-En pairs (133 MB) 🤗 HF · 🔗 GitHub
AI4Bharat Samanantar 49.6 M sentence pairs across Indic languages 🤗 HF · 🔗 site
SUPara0.8M Balanced En-Bn corpus 🔗 IEEE DataPort
BanglaSTEM 5 K STEM Bn-En pairs 📄 arXiv 2511.03498
WMT24 Bangla seed dataset High-quality manual translation 📄 ACL 2024
BanglaParaphrase 466 K paraphrase pairs (AACL 2022) 🤗 HF · 🔗 GitHub
OPUS Collections Multi-source parallel corpora 🔗 OPUS
Bengali Visual Genome 1.0 29 K image-caption multimodal 🔗 LINDAT
Google Dakshina 12 South-Asian transliteration / parallel 🔗 GitHub
BanglaTLit Romanized → Bangla 📄 ACL 2024
bntranslit Transliteration toolkit 🔗 GitHub
Bengali Dictionary (Minhas Kamal) Dictionary 🔗 GitHub
TED2020 (Bangla) TED multilingual 📥 TSV

🎤 Speech (ASR / TTS / Emotion)

ASR / Recognition

Dataset Description / Size Link
OpenSLR-53 Large Bengali ASR — 196 K utterances / 14.6 GB (Google) 🔗 OpenSLR
OpenSLR (HF mirror) All OpenSLR languages 🤗 HF
OpenSLR-104 Multilingual code-switching 🔗 OpenSLR
Bengali Common Voice (Mozilla) 399 h+ / 19 817 contributors (v9.0) 🔗 Mozilla
Bengali.AI OOD-Speech 1 177 h / 22 645 speakers — largest Bn ASR 🔗 Bengali.AI
FLEURS Bangla Cross-lingual 12 h 🤗 HF
BanglaASR Dataset Fine-tuned ASR 🔗 GitHub
SUST-CSE-Speech / banspeech TTS / ASR corpus 🤗 HF
SKNahin / open-large-bengali-asr-data Large open ASR 🤗 HF
Bangla-Speech-Corpora (Bangla-Language-Processing) Cleaned TTS-ready speech 🔗 GitHub

TTS

Dataset Description Link
OpenSLR-37 High-quality Google Bengali TTS 🔗 OpenSLR
Bengali.AI TTS dataset Studio-quality TTS 🔗 site
bangla-tts (zabir-nabil) Real-time multilingual synthesis 🔗 GitHub

Speech Emotion

Dataset Description Link
SUBESCO 7 K utterances / 7 emotions, gender-balanced 📄 PLOS One
BanglaSER 1 467 utterances / 34 speakers / 5 emotions 🔗 ScienceDirect
KBES Realistic speech-emotion w/ intensity 🔗 ScienceDirect
BANSpEmo Bangla emotional speech 📄 arXiv 2312.14020

Speech-to-Text Toolkits

Tool Notes Link
BanglaSpeech2Text Whisper-FT offline ASR (mp3/mp4/wav) 🔗 GitHub

😊 Sentiment / Emotion / Sarcasm

Dataset Size / Notes Link
BanglaBook 158 K book reviews 🔗 GitHub
SentNoB Noisy social-media sentiment 📊 Kaggle
EmoNoBa 22 698 comments / 6 emotions (AACL 2022) 📊 Kaggle · 📄 ACL
BanglaEmotion (shaoncsecu) Emotion benchmark 🔗 GitHub
MONOVAB Multi-label emotion 📄 paper
BAN-ABSA / BANGLA-ABSA 9 009 aspect-level comments 📄 arXiv 2012.00288 · 📊 Mendeley
BanglaSenti (lexicon) 61 582 polarity words 🔗 GitHub
banglanlp/bangla-sentiment-classification Compiled benchmarking sets 🔗 GitHub
Ayubur sentiment datasets Multiple legacy datasets 🔗 GitHub
BanglaSarc 5 112 sarcasm samples 📊 Kaggle
BanglaSarc3 12 089 ternary-class sarcasm 🔗 ScienceDirect
BnSentMix 20 K Bn-En code-mix sentiment 📄 paper
SentMix-3L / OffMix-3L Bn-En-Hi code-mix 📄 ACL
Drama Review Bengali drama reviews 📊 Figshare
YouTube Sentiment / Emotion YT comments 📊 Kaggle
News Comments Sentiment Bn news comments 📊 Kaggle
News Headline Categories Headline classification 📊 Kaggle
Big News Classification Large news classifier 📊 Kaggle
News Article Classification (IndicNLP) Indic news classifier 📊 Kaggle
Bengali-Banglish Emotion Mixed-script 📊 Mendeley
E-commerce Sentiment + Emotion 78 130 Daraz / Pickaboo reviews 🔗 ScienceDirect
BanglishRev 1.74 M Daraz code-mix reviews 📄 arXiv 2412.13161

🛡️ Hate Speech / Toxic / Cyberbullying

Dataset Size / Notes Link
Bengali-Hate-Speech (rezacsedu) 6 418 / 5 categories 🔗 GitHub · 📊 UCI
Bengali Hate Speech (naurosromim) Annotated hate dataset 📊 Kaggle
BIDWESH Regional-based hate speech 📄 arXiv 2507.16183
BengaliSent140 140 K hate vs non-hate 🔗 IEEE DataPort
ToxLex_bn 1 959 bigrams from 2.2 M FB comments 📊 Mendeley
Multi-Labeled Bengali Toxic 16 073 / 7 labels 🔗 GitHub
Bangla-ToCo 1 004 context-aware toxic 🔗 ScienceDirect
Bangla Multilabel Cyberbully 12 557 (5 classes) 📊 Mendeley
Bangla Social-Media Cyberbullying YT/FB/IG/TikTok 🔗 IEEE DataPort
Code-mixed Chaos (Banglish toxic) 10 234 multi-label 📊 Mendeley
BanglaMedia 7 725 YT comments — 10 topics, 4 sentiments 📊 Mendeley
VITD (BLP-2023 violence) Violence-inciting text — 3 classes 🔗 BLP

🕵️ Fake News & Misinformation

Dataset Size Link
BanFakeNews (LREC 2020) 50 K news 📊 Kaggle · 📄 arXiv 2004.08789
BanFakeNews-2.0 (2024) 47 K real + 13 K fake 📊 Mendeley
MultiBanFakeDetect Multimodal text + image 🔗 ScienceDirect
Rowan1224/FakeNews Code & data 🔗 GitHub
DataCOVID19 14 571 COVID misinformation 🔗 Springer

🏷️ NER / POS / Parsing

Dataset Size / Notes Link
B-NER (IEEE Access) 22 144 sentences 🔗 IEEE
BanNERD (NAACL 2025) 85 K sentences / 991 K tokens / 10 classes / 29 domains 🔗 GitHub
NER-Bangla-Dataset (MISabic) 70 K sentences / 5 types 🔗 GitHub
bnlp-resources NER Train/dev/test splits 🌐 banglanlp.github.io
ANCHOLIK-NER Regional NER, 5 regions 📄 arXiv 2502.11198
celloscope_bangla_ner_dataset 319 K NER 🤗 HF
celloscope_bangla_ner_dataset (small) 6.57 K 🤗 HF
Bangla-MedER Medical NER 2 980 / 6 types 📊 Mendeley
Bangla NER (towhidahmedfoysal) 400 K word-level 📊 Kaggle
POS — 3 K sentences abhishekgupta92 🔗 GitHub
POS — 100 K+ words (towhidahmedfoysal) Word-level POS 📊 Kaggle
UD_Bengali-BRU treebank 14 UPOS tags, UD v2.9+ 🌐 universaldependencies.org

❓ Question Answering

Dataset Size Link
csebuetnlp/squad_bn 118 K train QA 🤗 HF
BanglaRQA 14 889 QA / 3 K passages (EMNLP 2022) 🔗 GitHub
Bengali QA (Mayeesha) SQuAD 2.0–style Bn QA 📊 Kaggle
BanglaQuAD 30 808 QA pairs 📄 arXiv 2410.10229
NCTB-QA 87 805 educational QA 📄 arXiv 2603.05462
doctor_qa_bangla 5.14 K medical QA 🤗 HF
TyDiQA (Bengali subset) Cross-lingual QA 🤗 HF

📝 Text Summarization

Dataset Size / Notes Link
csebuetnlp/xlsum XL-Sum (Bangla subset, BBC) 🤗 HF
BanglaCHQ-Summ 2 350 health-question summary pairs 🔗 GitHub
MultiBanAbs Multi-domain abstractive 📄 arXiv 2511.19317
BNLPC + NCTB EACL 2021 unsupervised abstractive 🔗 GitHub
BANSData (Prithwiraj) News abstractive 📊 Kaggle
Bengali Text Summarization (Hasan Moni) Extractive + abstractive 📊 Kaggle
BUSUM-BNLP Multi-document update summarization 📊 Kaggle
bnSum_gemma7b-it Gemma-7B inst news summary system 🔗 GitHub
Bangla-Text-summarization-Dataset (Abid) Extractive 🔗 GitHub
3 Human-Evaluated articles (BNLPC) Reference summaries 🌐 BNLPC

🖊️ OCR, Handwriting & Document Layout

Dataset Size / Notes Link
NumtaDB 85 K handwritten Bn-digit images 📊 Kaggle
Bengali.AI Handwritten Grapheme Classification Kaggle competition 📊 Kaggle
Ekush Handwritten Bn characters (largest) 🌐 rabby.dev/ekush · 📊 Kaggle
BN-HTRd 788 pages / 150 writers / 108 K words (HTR) 📊 Mendeley
BanglaWriting Multi-purpose offline HTR 📄 paper
Bayanno Multi-purpose handwriting 📊 Mendeley
Bongabdo Bn handwritten script 📄 arXiv 2101.00204
CMATERdb (1 / 2.1.2) 5 K word imgs / 18 K Bn city names 🔗 site
BaDLAD 33 695 layout samples / 6 domains 📄 paper
BanglaDocAtlas 8-class complex Bn document layout 🔗 IEEE
DL Sprint 2.0 (BUET CSE Fest 2023) Layout segmentation 📊 Kaggle
Bangla License Plate 2.5 K 2 519 plate images 🔗 Zenodo
BD-ALPDR 725 high-res LP images 🌐 site
Govt. Bangla OCR service Free Bangla OCR 🌐 ocr.bangla.gov.bd

✋ Sign Language & Multimodal Vision

Dataset Size / Notes Link
BdSLW60 60 sign words / 9 307 video trials 📄 arXiv 2402.08635
KU-BdSL 30 classes / 38 consonants 📊 Mendeley
BAUST Lipi 18 K imgs / 36 alphabets 📄 arXiv 2408.10518
BDSL 49 29 490 imgs / 49 labels 📄 arXiv 2208.06827
BdSL36 4 M+ imgs / 36 cats 📄 paper
Ego-SLD Egocentric Bn sign-language video 🔗 IEEE DataPort
Bengali Visual Genome 1.0 Multimodal MT + captioning 🔗 LINDAT
BAN-Cap English-Bangla image-description 📄 arXiv 2205.14462
BNATURE Bn image captioning 📊 Kaggle
BanglaView 31 783 imgs / 158 K captions 📊 Mendeley
csebuetnlp/illusionVQA Bn-aware VQA 🤗 HF
DeshiFoodBD 5 425 BD-cuisine images / 19 dishes 🔗 Springer
FoodBD 3 523 polygon-annotated meals (2025) 🔗 Springer
BnLiT — Bangla Image-to-Text Natural-language image-text 📊 Kaggle

🗺️ Regional Dialects

Dataset Coverage Link
ANCHOLIK-NER Barishal, Chittagong, Mymensingh, Noakhali, Sylhet 📄 arXiv 2502.11198
ONUBAD Chittagong / Sylhet / Barisal → Standard Bn 🔗 ScienceDirect
BanglaDial 11 dialects / 60 729 entries 🔗 PMC
BIDWESH Regional hate speech 📄 arXiv 2507.16183
BanglaCHQ-Summ + dialect benchmarks Sylheti / Chittagonian 📄 ACL 2025
Sylheti → Standard NMT Sylheti corpus 🔗 ScienceDirect

🧪 NLI / Bias / Misc

Dataset Description Link
csebuetnlp/xnli_bn Bangla NLI translated from XNLI 🤗 HF
BNLI (refined) Curated entail/contra/neutral 📄 arXiv 2511.08813
csebuetnlp/BanglaSocialBias Social-bias evaluation 🤗 HF
csebuetnlp/BanglaContextualBias Contextual-bias evaluation 🤗 HF
csebuetnlp/CrossSum Cross-lingual summarization 🤗 HF
stopwords-iso/stopwords-bn Stopword list 🔗 GitHub
Bangla Plagiarism Dataset 59.9 K 🤗 HF
Banking 14-intents (en + bn + banglish) 16.5 K intent samples 🤗 HF
Massive intent (bn-BD) 16.5 K 🤗 HF
BanglaMusicStylo 2 824 lyrics / 211 lyricists 🔗 IEEE DataPort
Bangla Song Lyrics (genres + artists) Bn song-lyric corpus 📊 Kaggle
Bn Numbers w/ Words Number-name dataset 📊 Kaggle
likhonsheikh/BanglaNLP 120 K parallel news pairs 🤗 HF

🔧 NLP Tools & Libraries

Python Libraries

Library Description Link
BNLP (sagorbrur) Comprehensive Bengali NLP toolkit 🔗 GitHub · pip install bnlp_toolkit
BNLTK Bangla NLP toolkit (tokenize / stem / POS) 🔗 GitHub
sbnltk This repo's toolkit (sentiment / NER / POS / sum) 🔗 GitHub
bangla-stemmer Lightweight Bn stemmer 🔗 PyPI
bnunicode Bijoy → Unicode normalization 🔗 GitHub
Indic NLP Library Multi-Indic processing / transliteration 🔗 GitHub
bntranslit Bengali transliteration 🔗 GitHub
BanglaSpeech2Text Bangla offline ASR 🔗 GitHub
bangla-tts Bangla TTS library 🔗 GitHub
BanglaKit organisation Tools, datasets, resources 🔗 GitHub

OCR / Vision

Tool Notes Link
EasyOCR Built-in Bangla support 🔗 GitHub
Tesseract OCR ben traineddata available 🔗 GitHub
ocr.bangla.gov.bd Govt-hosted Bangla OCR service 🌐 site

📄 Research Papers

Foundational

LLMs & Generation (2024–2025)

QA / Reading Comprehension

NER

Summarization

Speech

Workshops


🇧🇩 Bangladesh Government Resources

বাংলাদেশ সরকারের তথ্যপ্রযুক্তি ও বাংলা ভাষাগত সম্পদ:

Resource বাংলা নাম Link
Bangladesh National Corpus (BdNC) বাংলাদেশ জাতীয় কর্পাস (৪০ GB / ৩B+ শব্দ) 🌐 corpus.bangla.gov.bd
Govt. Bangla OCR সরকারি বাংলা OCR সেবা 🌐 ocr.bangla.gov.bd
Accessible Dictionary প্রতিবন্ধী-বান্ধব বাংলা অভিধান 🌐 accessibledictionary.gov.bd
EBLICT Project তথ্যপ্রযুক্তিতে বাংলা ভাষা সমৃদ্ধকরণ প্রকল্প 🌐 eblict.portal.gov.bd
IPA — Information Processing Authority বাংলা ভাষা প্রক্রিয়াকরণ কর্তৃপক্ষ 🌐 ipa.bangla.gov.bd
Bangladesh Computer Council (BCC) বাংলাদেশ কম্পিউটার কাউন্সিল 🌐 bcc.gov.bd
Bangla Academy বাংলা একাডেমি 🌐 banglaacademy.gov.bd

🔗 Curated Aggregator Lists (start here)

Resource Maintainer Link
bnlp-resources banglanlp 🔗 GitHub · 🌐 site
awesome-bangla banglakit 🔗 GitHub
bangla-corpus sagorbrur 🔗 GitHub
Awesome_Bangla_Datasets sabbirhossainujjal 🔗 GitHub
AI4Bharat Indic NLP catalog AI4Bharat 🔗 GitHub
Mahadih534 — Bangla NLP HF collection Mahadih534 🤗 HF
Mahadih534 — Bangla TTS collection Mahadih534 🤗 HF
Mahadih534 — Bangla LLM finetuning Mahadih534 🤗 HF
csebuetnlp organization BUET CSE NLP 🤗 HF
Sudipta Kar Bangla NLP resources personal 🌐 site
হাতেকলমে Bangla NLP (Rakibul Hassan) educational notebooks 🔗 GitHub · 🌐 book
GitHub topic: bangla-nlp community-tagged 🌐 topic
GitHub topic: bengali-nlp community-tagged 🌐 topic
Awesome Public Datasets (general NLP) community 🔗 GitHub
NLP Datasets Collection community 🔗 GitHub

💡 Motivation & Contribution

Bangla is the 6th-most-spoken language in the world (~270 million native speakers) but remains classified as low-resource in NLP. This repository exists to make every Bangla NLP resource one click away — accurately documented and free of fabricated links.

How to Get Started

  1. For Pre-trained Models — visit the HuggingFace links above and load directly with transformers.
  2. For Toolspip install bnlp_toolkit or pip install bnltk.
  3. For Datasets — follow individual links; honor each dataset's license (most are CC BY-NC-SA 4.0).
  4. For Research — start with the BLP workshop proceedings and the latest 2024–2026 release table.

Contributing

  • 📝 Submit new datasets via pull request — include a working URL and one-line description.
  • 🐛 Report broken / fabricated links by opening an issue.
  • 🔬 Tag your own paper to share with the community.
  • ✅ Every PR is welcome — please verify URLs return 200 before submitting.

⭐ If this repository helps your work, please give it a star! ⭐

🤝 Contributions, corrections, and additions are very welcome.

🌟 Thanks to every researcher and developer pushing Bangla NLP forward.


☕ Support This Project

If this resource has been helpful, you can support its maintenance:

Buy Me A Coffee

About

Bangla NLP dataset. Bangla NER,POStag, text summarization, stopword, translate, sentiment analysis, wiki articles, root word, dataset etc.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors