A comprehensive, URL-validated collection of Bangla / Bengali NLP datasets, models, tools, and corpora — for researchers, students, and developers.
বাংলা ভাষার এনএলপি গবেষণা, শিক্ষা এবং প্রায়োগিক কাজের জন্য একটি যাচাইকৃত সম্পদ-সংগ্রহ।
🔄 Our sbnltk dataset is in LFS mode — clone the repository to download data.
🚀 All deep-learning-era datasets are linked below; we'll keep adding new releases.
This repository contains the sbnltk datasets used in the Bangla NLP toolkit sbnltk , and serves as a comprehensive, URL-validated catalogue of publicly available Bangla NLP resources contributed by the worldwide Bangla research community.
Validation note: Every link in this document was tested between 2025–2026. Resources that previously appeared here under fabricated GitHub paths (e.g. github.com/poetry-bangla/corpus, github.com/medical-bangla/medical-translation, etc.) have been removed. If you find a dead link, please open an issue.
🎯 sbnltk Dataset List (DUMP & HUMAN Evaluated)
Dataset
Description
Link
Number List
Bangla number list
📥 Download
Root Word List
Bangla root word list
📥 Download
Word List
Bangla word list (highest → lowest occurrence)
📥 Download
Wiki Dump
Bangla wiki dump words
📥 Download
POS Tag Static
Bangla POS-tag static dataset (single word)
📥 Download
NER Static
Bangla NER static dataset (single word)
📥 Download
Stop Words
Bangla stop-word list
📥 Download
Dump POS Tag
Bangla dump POS-tag
📥 Download
Question Classification
Bangla dump question classification dataset
📥 Download
Sentiment Analysis
Bangla dump sentiment analysis
📥 Download
Translation Dataset
Google translation dataset
📥 Download
NER Enhanced
Existing NER dataset (modified + Date entity)
📥 Download
News Articles
News article dataset
📥 Download
POS Converted
POS-tag converted data
📥 Download
POS Human Evaluated
POS-tag human-evaluated data
📥 Download
NER Dump (Both)
Dump NER (active + passive)
📥 Download
NER Dump (Active)
Dump NER (active only)
📥 Download
Extractive Summarization
Extractive text summarization
🔗 GitHub
Abstractive Summarization
Abstractive summarization (newspaper)
📥 Drive · 📊 Kaggle
Text Classification
News article classification
📥 Drive · 📊 Kaggle
Keywords Classification
Topic-keyword classification
📥 Drive · 📊 Kaggle
🤖 Pre-trained Language Models
Model
Description
Params
Link
BanglaBERT
ELECTRA discriminator, SOTA Bangla NLU (BUET CSE NLP)
110M
🤗 HF · 🔗 GitHub
BanglaBERT (Small)
Lightweight variant
13M
🤗 HF
BanglaBERT (Large)
Large variant, top scores on BLUB
335M
🤗 HF
BanglishBERT
Bilingual (Bangla + English)
110M
🤗 HF
Bangla BERT Base (sagorsarker)
Popular community BERT
110M
🤗 HF
mBERT-Bengali-NER
Multilingual BERT fine-tuned for NER
—
🤗 HF
mBERT-Bengali-TyDiQA-QA
mBERT fine-tuned for QA
—
🤗 HF
sahajBERT
ALBERT-based collaborative training
18M
🤗 HF
MuRIL
Google multilingual (17 Indian)
236M
🤗 HF
IndicBERT
AI4Bharat (12 Indian)
—
🤗 HF
Generative / Seq2Seq Models
Model
Description
Params
Link
BanglaT5
T5-style seq2seq (BUET)
247M
🤗 HF
BanglaT5 (small)
Small T5 variant
60M
🤗 HF
BanglaT5 NMT bn↔en
Translation seq2seq
—
🤗 bn→en · 🤗 en→bn
BanglaT5-Paraphrase
Paraphrase seq2seq
—
🤗 HF
BanglaByT5
Byte-level T5
small
📄 arXiv 2505.17102
GPT-2 Bengali
Flax-community GPT-2
117M
🤗 HF
Model
Description
Params
Link
TigerLLM-1B-it
Bangla instruction-tuned LLM
1B
🤗 HF
TigerLLM-9B-it
Larger variant, beats GPT-3.5 on Bangla
9B
🤗 HF
TituLLMs (1B / 3B)
Family of Bangla LLMs with benchmarks
1B / 3B
📄 arXiv 2502.11187
TigerLLM Paper
ACL 2025 short paper
—
📄 arXiv 2503.10995 · 📄 ACL 2025
BanglaLLaMA-3-8B-BnWiki-Instruct
Llama-3 fine-tuned on Bn Wiki
8B
🤗 HF
Bangla LLaMA (saiful9379)
LoRA-tuned LLaMA
—
🔗 GitHub
Model
Description
Performance
Link
Wav2Vec2-Bengali (300M)
Self-supervised ASR
17.8 % WER
🤗 HF
Wav2Vec2-XLSR Bengali
XLSR fine-tune
—
🤗 HF
BanglaConformer
Conformer ASR by Bengali.AI
—
🤗 HF
BanglaASR
Whisper fine-tuned for Bengali
14.73 % WER
🤗 HF · 🔗 GitHub
Whisper (multilingual)
OpenAI base model — Bn supported
various sizes
🤗 HF
Word & Sentence Embeddings
Resource
Description
Link
Bangla FastText (sagorsarker)
20 M-token wiki-trained skipgram + CBOW
🤗 HF
Bangla Word2Vec (sagorsarker)
100-d Wikipedia embeddings
🤗 HF
fastText 157-language Bengali
Facebook 300-d Wiki + CC
🌐 fastText
Spark NLP bengali_cc_300d
Production embedding
🔗 Spark NLP
BanglaEmbed
Cross-lingual distilled sentence embeddings
📄 arXiv 2411.15270
📚 Latest 2024–2026 Datasets
These are the most relevant new releases — cite the original authors when used.
Dataset
Task
Size / Notes
Link
Bangla-Instruct
Instruction-tuning
342 K instruction–response pairs
🤗 HF
Bangla-TextBook
LM pretraining
9.9 M tokens, 163 NCTB textbooks
🤗 HF
BanglaSTEM
Technical-domain MT
5 K Bn-En STEM sentence pairs
📄 arXiv 2511.03498
NCTB-QA
Educational QA
87 805 QA pairs (grade 1–10)
📄 arXiv 2603.05462
BanglaQuAD
Open-domain QA
30 808 QA pairs
📄 arXiv 2410.10229
ANCHOLIK-NER
Regional-dialect NER
17 405 sentences, 5 regions
📄 arXiv 2502.11198
BanNERD (NAACL 2025)
NER, 10 classes / 29 domains
85 K sentences, 991 K tokens
🔗 GitHub
ONUBAD
Dialect→Standard MT
Chittagong / Sylhet / Barisal
🔗 ScienceDirect
BanglaDial
Dialect text corpus
60 729 entries × 11 dialects
🔗 PMC
BIDWESH
Regional hate speech
Multi-region
📄 arXiv 2507.16183
BanglaTLit
Romanized→Bn back-transliteration
42.7 K + 245.7 K pretrain
📄 ACL 2024
BanglishRev
E-commerce code-mix reviews
1.74 M Daraz reviews
📄 arXiv 2412.13161
BengaliSent140
Hate vs non-hate fusion
140 K speeches
📄 arXiv 2601.20129 · 🔗 IEEE DataPort
BLUCK
LLM cultural-knowledge benchmark
2 366 MCQs / 23 categories
📄 arXiv 2505.21092
BNLI (refined)
NLI
Curated entail/contra/neutral
📄 arXiv 2511.08813
MultiBanAbs
Multi-domain abstractive sum.
Multi-corpus
📄 arXiv 2511.19317
MultiBanFakeDetect
Multimodal fake news
Text + image
🔗 ScienceDirect
BanFakeNews-2.0
Fake news (2024)
47 K real + 13 K fake
📊 Mendeley
BanglaHealth
Health-domain paraphrase
200 K sentences
🔗 ScienceDirect
BanglaCHQ-Summ
Consumer-health-question summary
2 350 pairs (BLP-2023)
🔗 GitHub
Bangla-MedER
Medical NER
2 980 texts, 6 entity types
📊 Mendeley
BanglaSarc3
Sarcasm (ternary)
12 089 FB comments
🔗 ScienceDirect
VACASPATI
Bangla literature corpus
11 M sentences / 115 M words
📄 arXiv 2307.05083
MixSarc
Code-mix sarcasm/humor/offence
Bn-En transliterated
📄 arXiv 2602.21608
EmoMix-3L
Code-mix emotion
1 071 Bn-Hi-En instances
🔗 GitHub
Bangla-ToCo
Context-aware toxic
1 004 FB news comments
🔗 ScienceDirect
BanglaDocAtlas
Document-layout, 8 classes
annotated complex docs
🔗 IEEE
FoodBD
BD cuisine images, 67 categories
3 523 polygon-annotated meals
🔗 Springer 2025
DeshiFoodBD
BD traditional food images
5 425 images / 19 dishes
🔗 Springer
📊 Benchmarking and Evaluation
BLUB — Bangla Language Understanding Benchmark
The first comprehensive Bangla NLU benchmark, introduced with BanglaBERT (NAACL 2022).
Task
Dataset
Metric
Best Model
Score
Sentiment Classification
SentNoB
Macro-F1
BanglaBERT
72.89
Natural Language Inference
XNLI-bn / BNLI
Accuracy
BanglaBERT (Large)
83.41
Named Entity Recognition
MultiCoNER
Micro-F1
BanglaBERT (Large)
79.20
Question Answering
SQuAD-bn / TyDiQA
EM / F1
BanglaBERT (Large)
76.10 / 81.50
📄 BLUB code & leaderboard: github.com/csebuetnlp/banglabert
BLUCK — Bangla LLM Cultural & Linguistic Benchmark (2025)
2 366 multiple-choice questions across 23 categories covering Bangladesh culture, history, and Bangla linguistics — designed to probe LLM cultural knowledge. 📄 arXiv 2505.21092
Recent Benchmark Datasets
Dataset
Task
Size
Link
BanglaBook
Sentiment
158 065 reviews
🔗 GitHub
SentMix-3L / OffMix-3L
Code-mix sentiment / offence
~1 K each
📄 ACL
MultiCoNER (Bangla)
Multilingual complex NER
task
🔗 multiconer.github.io
📰 News, Corpora & Pretraining Data
Dataset
Size
Link
Bangla2B+ (BanglaBERT pretraining corpus)
27.5 GB / 110 sites
🔗 GitHub
BanglaLM (data-mining corpus)
14 GB
📄 IEEE
BdNC – Bangladesh National Corpus ✅
40 GB / 3 B+ words
🔗 corpus.bangla.gov.bd
VACASPATI literary corpus
11 M sentences
📄 arXiv 2307.05083
CC-100 Bangla
8.3 GB
🔗 StatMT
OSCAR Bangla
12 GB+
🔗 OSCAR
AI4Bharat IndicCorp
9 B tokens incl. Bangla
🔗 site
AI4Bharat IndicNLP corpus + catalog
meta-resource
🔗 corpus · 🔗 catalog
Bangla Wikipedia Corpus (Kaggle)
wiki-text
📊 Kaggle
Wikipedia bnwiki dumps
latest dumps
🔗 dumps.wikimedia.org
Leipzig Bengali corpora (2021)
1.65 M sentences
🔗 corpora.uni-leipzig.de
Wiki Articles (Kaggle)
wiki snapshot
📊 Kaggle
40k News Articles
40 K
📊 Kaggle
Largest Bangla Newspaper
large multi-paper
📊 Kaggle
bdNews24 corpus
bdnews24 articles
📊 Kaggle
Bangladesh Protidin
Bangladesh Protidin news
📊 Kaggle
csebuetnlp/xlsum
XL-Sum (Bangla subset)
🤗 HF
csebuetnlp/dailydialogue_bn
Daily-dialogue translated
🤗 HF
goru001/nlp-for-bengali
ULMFiT model + Wiki / news data
🔗 GitHub
masiur/Bangla-Corpus
Open community corpus
🔗 GitHub
🔄 Machine Translation & Paraphrase
Dataset
Description
Link
csebuetnlp/BanglaNMT
2.38 M Bn-En pairs (133 MB)
🤗 HF · 🔗 GitHub
AI4Bharat Samanantar
49.6 M sentence pairs across Indic languages
🤗 HF · 🔗 site
SUPara0.8M
Balanced En-Bn corpus
🔗 IEEE DataPort
BanglaSTEM
5 K STEM Bn-En pairs
📄 arXiv 2511.03498
WMT24 Bangla seed dataset
High-quality manual translation
📄 ACL 2024
BanglaParaphrase
466 K paraphrase pairs (AACL 2022)
🤗 HF · 🔗 GitHub
OPUS Collections
Multi-source parallel corpora
🔗 OPUS
Bengali Visual Genome 1.0
29 K image-caption multimodal
🔗 LINDAT
Google Dakshina
12 South-Asian transliteration / parallel
🔗 GitHub
BanglaTLit
Romanized → Bangla
📄 ACL 2024
bntranslit
Transliteration toolkit
🔗 GitHub
Bengali Dictionary (Minhas Kamal)
Dictionary
🔗 GitHub
TED2020 (Bangla)
TED multilingual
📥 TSV
🎤 Speech (ASR / TTS / Emotion)
Dataset
Description / Size
Link
OpenSLR-53
Large Bengali ASR — 196 K utterances / 14.6 GB (Google)
🔗 OpenSLR
OpenSLR (HF mirror)
All OpenSLR languages
🤗 HF
OpenSLR-104
Multilingual code-switching
🔗 OpenSLR
Bengali Common Voice (Mozilla)
399 h+ / 19 817 contributors (v9.0)
🔗 Mozilla
Bengali.AI OOD-Speech
1 177 h / 22 645 speakers — largest Bn ASR
🔗 Bengali.AI
FLEURS Bangla
Cross-lingual 12 h
🤗 HF
BanglaASR Dataset
Fine-tuned ASR
🔗 GitHub
SUST-CSE-Speech / banspeech
TTS / ASR corpus
🤗 HF
SKNahin / open-large-bengali-asr-data
Large open ASR
🤗 HF
Bangla-Speech-Corpora (Bangla-Language-Processing)
Cleaned TTS-ready speech
🔗 GitHub
Dataset
Description
Link
OpenSLR-37
High-quality Google Bengali TTS
🔗 OpenSLR
Bengali.AI TTS dataset
Studio-quality TTS
🔗 site
bangla-tts (zabir-nabil)
Real-time multilingual synthesis
🔗 GitHub
Tool
Notes
Link
BanglaSpeech2Text
Whisper-FT offline ASR (mp3/mp4/wav)
🔗 GitHub
😊 Sentiment / Emotion / Sarcasm
Dataset
Size / Notes
Link
BanglaBook
158 K book reviews
🔗 GitHub
SentNoB
Noisy social-media sentiment
📊 Kaggle
EmoNoBa
22 698 comments / 6 emotions (AACL 2022)
📊 Kaggle · 📄 ACL
BanglaEmotion (shaoncsecu)
Emotion benchmark
🔗 GitHub
MONOVAB
Multi-label emotion
📄 paper
BAN-ABSA / BANGLA-ABSA
9 009 aspect-level comments
📄 arXiv 2012.00288 · 📊 Mendeley
BanglaSenti (lexicon)
61 582 polarity words
🔗 GitHub
banglanlp/bangla-sentiment-classification
Compiled benchmarking sets
🔗 GitHub
Ayubur sentiment datasets
Multiple legacy datasets
🔗 GitHub
BanglaSarc
5 112 sarcasm samples
📊 Kaggle
BanglaSarc3
12 089 ternary-class sarcasm
🔗 ScienceDirect
BnSentMix
20 K Bn-En code-mix sentiment
📄 paper
SentMix-3L / OffMix-3L
Bn-En-Hi code-mix
📄 ACL
Drama Review
Bengali drama reviews
📊 Figshare
YouTube Sentiment / Emotion
YT comments
📊 Kaggle
News Comments Sentiment
Bn news comments
📊 Kaggle
News Headline Categories
Headline classification
📊 Kaggle
Big News Classification
Large news classifier
📊 Kaggle
News Article Classification (IndicNLP)
Indic news classifier
📊 Kaggle
Bengali-Banglish Emotion
Mixed-script
📊 Mendeley
E-commerce Sentiment + Emotion
78 130 Daraz / Pickaboo reviews
🔗 ScienceDirect
BanglishRev
1.74 M Daraz code-mix reviews
📄 arXiv 2412.13161
🛡️ Hate Speech / Toxic / Cyberbullying
Dataset
Size / Notes
Link
Bengali-Hate-Speech (rezacsedu)
6 418 / 5 categories
🔗 GitHub · 📊 UCI
Bengali Hate Speech (naurosromim)
Annotated hate dataset
📊 Kaggle
BIDWESH
Regional-based hate speech
📄 arXiv 2507.16183
BengaliSent140
140 K hate vs non-hate
🔗 IEEE DataPort
ToxLex_bn
1 959 bigrams from 2.2 M FB comments
📊 Mendeley
Multi-Labeled Bengali Toxic
16 073 / 7 labels
🔗 GitHub
Bangla-ToCo
1 004 context-aware toxic
🔗 ScienceDirect
Bangla Multilabel Cyberbully
12 557 (5 classes)
📊 Mendeley
Bangla Social-Media Cyberbullying
YT/FB/IG/TikTok
🔗 IEEE DataPort
Code-mixed Chaos (Banglish toxic)
10 234 multi-label
📊 Mendeley
BanglaMedia
7 725 YT comments — 10 topics, 4 sentiments
📊 Mendeley
VITD (BLP-2023 violence)
Violence-inciting text — 3 classes
🔗 BLP
🕵️ Fake News & Misinformation
Dataset
Size / Notes
Link
B-NER (IEEE Access)
22 144 sentences
🔗 IEEE
BanNERD (NAACL 2025)
85 K sentences / 991 K tokens / 10 classes / 29 domains
🔗 GitHub
NER-Bangla-Dataset (MISabic)
70 K sentences / 5 types
🔗 GitHub
bnlp-resources NER
Train/dev/test splits
🌐 banglanlp.github.io
ANCHOLIK-NER
Regional NER, 5 regions
📄 arXiv 2502.11198
celloscope_bangla_ner_dataset
319 K NER
🤗 HF
celloscope_bangla_ner_dataset (small)
6.57 K
🤗 HF
Bangla-MedER
Medical NER 2 980 / 6 types
📊 Mendeley
Bangla NER (towhidahmedfoysal)
400 K word-level
📊 Kaggle
POS — 3 K sentences
abhishekgupta92
🔗 GitHub
POS — 100 K+ words (towhidahmedfoysal)
Word-level POS
📊 Kaggle
UD_Bengali-BRU treebank
14 UPOS tags, UD v2.9+
🌐 universaldependencies.org
Dataset
Size
Link
csebuetnlp/squad_bn
118 K train QA
🤗 HF
BanglaRQA
14 889 QA / 3 K passages (EMNLP 2022)
🔗 GitHub
Bengali QA (Mayeesha)
SQuAD 2.0–style Bn QA
📊 Kaggle
BanglaQuAD
30 808 QA pairs
📄 arXiv 2410.10229
NCTB-QA
87 805 educational QA
📄 arXiv 2603.05462
doctor_qa_bangla
5.14 K medical QA
🤗 HF
TyDiQA (Bengali subset)
Cross-lingual QA
🤗 HF
Dataset
Size / Notes
Link
csebuetnlp/xlsum
XL-Sum (Bangla subset, BBC)
🤗 HF
BanglaCHQ-Summ
2 350 health-question summary pairs
🔗 GitHub
MultiBanAbs
Multi-domain abstractive
📄 arXiv 2511.19317
BNLPC + NCTB
EACL 2021 unsupervised abstractive
🔗 GitHub
BANSData (Prithwiraj)
News abstractive
📊 Kaggle
Bengali Text Summarization (Hasan Moni)
Extractive + abstractive
📊 Kaggle
BUSUM-BNLP
Multi-document update summarization
📊 Kaggle
bnSum_gemma7b-it
Gemma-7B inst news summary system
🔗 GitHub
Bangla-Text-summarization-Dataset (Abid)
Extractive
🔗 GitHub
3 Human-Evaluated articles (BNLPC)
Reference summaries
🌐 BNLPC
🖊️ OCR, Handwriting & Document Layout
Dataset
Size / Notes
Link
NumtaDB
85 K handwritten Bn-digit images
📊 Kaggle
Bengali.AI Handwritten Grapheme Classification
Kaggle competition
📊 Kaggle
Ekush
Handwritten Bn characters (largest)
🌐 rabby.dev/ekush · 📊 Kaggle
BN-HTRd
788 pages / 150 writers / 108 K words (HTR)
📊 Mendeley
BanglaWriting
Multi-purpose offline HTR
📄 paper
Bayanno
Multi-purpose handwriting
📊 Mendeley
Bongabdo
Bn handwritten script
📄 arXiv 2101.00204
CMATERdb (1 / 2.1.2)
5 K word imgs / 18 K Bn city names
🔗 site
BaDLAD
33 695 layout samples / 6 domains
📄 paper
BanglaDocAtlas
8-class complex Bn document layout
🔗 IEEE
DL Sprint 2.0 (BUET CSE Fest 2023)
Layout segmentation
📊 Kaggle
Bangla License Plate 2.5 K
2 519 plate images
🔗 Zenodo
BD-ALPDR
725 high-res LP images
🌐 site
Govt. Bangla OCR service
Free Bangla OCR
🌐 ocr.bangla.gov.bd
✋ Sign Language & Multimodal Vision
Dataset
Description
Link
csebuetnlp/xnli_bn
Bangla NLI translated from XNLI
🤗 HF
BNLI (refined)
Curated entail/contra/neutral
📄 arXiv 2511.08813
csebuetnlp/BanglaSocialBias
Social-bias evaluation
🤗 HF
csebuetnlp/BanglaContextualBias
Contextual-bias evaluation
🤗 HF
csebuetnlp/CrossSum
Cross-lingual summarization
🤗 HF
stopwords-iso/stopwords-bn
Stopword list
🔗 GitHub
Bangla Plagiarism Dataset
59.9 K
🤗 HF
Banking 14-intents (en + bn + banglish)
16.5 K intent samples
🤗 HF
Massive intent (bn-BD)
16.5 K
🤗 HF
BanglaMusicStylo
2 824 lyrics / 211 lyricists
🔗 IEEE DataPort
Bangla Song Lyrics (genres + artists)
Bn song-lyric corpus
📊 Kaggle
Bn Numbers w/ Words
Number-name dataset
📊 Kaggle
likhonsheikh/BanglaNLP
120 K parallel news pairs
🤗 HF
Library
Description
Link
BNLP (sagorbrur)
Comprehensive Bengali NLP toolkit
🔗 GitHub · pip install bnlp_toolkit
BNLTK
Bangla NLP toolkit (tokenize / stem / POS)
🔗 GitHub
sbnltk
This repo's toolkit (sentiment / NER / POS / sum)
🔗 GitHub
bangla-stemmer
Lightweight Bn stemmer
🔗 PyPI
bnunicode
Bijoy → Unicode normalization
🔗 GitHub
Indic NLP Library
Multi-Indic processing / transliteration
🔗 GitHub
bntranslit
Bengali transliteration
🔗 GitHub
BanglaSpeech2Text
Bangla offline ASR
🔗 GitHub
bangla-tts
Bangla TTS library
🔗 GitHub
BanglaKit organisation
Tools, datasets, resources
🔗 GitHub
Tool
Notes
Link
EasyOCR
Built-in Bangla support
🔗 GitHub
Tesseract OCR
ben traineddata available
🔗 GitHub
ocr.bangla.gov.bd
Govt-hosted Bangla OCR service
🌐 site
LLMs & Generation (2024–2025)
QA / Reading Comprehension
🇧🇩 Bangladesh Government Resources
বাংলাদেশ সরকারের তথ্যপ্রযুক্তি ও বাংলা ভাষাগত সম্পদ:
🔗 Curated Aggregator Lists (start here)
Resource
Maintainer
Link
bnlp-resources
banglanlp
🔗 GitHub · 🌐 site
awesome-bangla
banglakit
🔗 GitHub
bangla-corpus
sagorbrur
🔗 GitHub
Awesome_Bangla_Datasets
sabbirhossainujjal
🔗 GitHub
AI4Bharat Indic NLP catalog
AI4Bharat
🔗 GitHub
Mahadih534 — Bangla NLP HF collection
Mahadih534
🤗 HF
Mahadih534 — Bangla TTS collection
Mahadih534
🤗 HF
Mahadih534 — Bangla LLM finetuning
Mahadih534
🤗 HF
csebuetnlp organization
BUET CSE NLP
🤗 HF
Sudipta Kar Bangla NLP resources
personal
🌐 site
হাতেকলমে Bangla NLP (Rakibul Hassan)
educational notebooks
🔗 GitHub · 🌐 book
GitHub topic: bangla-nlp
community-tagged
🌐 topic
GitHub topic: bengali-nlp
community-tagged
🌐 topic
Awesome Public Datasets (general NLP)
community
🔗 GitHub
NLP Datasets Collection
community
🔗 GitHub
💡 Motivation & Contribution
Bangla is the 6th-most-spoken language in the world (~270 million native speakers) but remains classified as low-resource in NLP. This repository exists to make every Bangla NLP resource one click away — accurately documented and free of fabricated links.
For Pre-trained Models — visit the HuggingFace links above and load directly with transformers.
For Tools — pip install bnlp_toolkit or pip install bnltk.
For Datasets — follow individual links; honor each dataset's license (most are CC BY-NC-SA 4.0).
For Research — start with the BLP workshop proceedings and the latest 2024–2026 release table.
📝 Submit new datasets via pull request — include a working URL and one-line description.
🐛 Report broken / fabricated links by opening an issue.
🔬 Tag your own paper to share with the community.
✅ Every PR is welcome — please verify URLs return 200 before submitting.
⭐ If this repository helps your work, please give it a star! ⭐
🤝 Contributions, corrections, and additions are very welcome.
🌟 Thanks to every researcher and developer pushing Bangla NLP forward.
If this resource has been helpful, you can support its maintenance: