A Catalog for Indic LLM Resources

This catalog aims to assist researchers seeking Indic LLM resources. It's a collaborative endeavor, and any input to enhance the collection of Indic LLM resources is appreciated. All contributors have acknowledged on the CONTRIBUTOR list.

odia_master_data_llama2: This dataset contains 180k Odia instruction sets translated from open-source instruction sets and Odia domain knowledge instruction sets.
odia_context_10k_llama2_set: This dataset contains 10K instructions that span various facets of Odisha's unique identity. The instructions cover a wide array of subjects, ranging from the culinary delights in 'RECIPES,' the historical significance of 'HISTORICAL PLACES,' and 'TEMPLES OF ODISHA,' to the intellectual pursuits in 'ARITHMETIC,' 'HEALTH,' and 'GEOGRAPHY.' It also explores the artistic tapestry of Odisha through 'ART AND CULTURE,' which celebrates renowned figures in 'FAMOUS ODIA POETS/WRITERS', and 'FAMOUS ODIA POLITICAL LEADERS'. Furthermore, it encapsulates 'SPORTS' and the 'GENERAL KNOWLEDGE OF ODISHA,' providing an all-encompassing representation of the state.
roleplay_odia: This dataset contains 1k Odia role play instruction set in conversation format.
OdiEnCorp_translation_instructions_25k: This dataset contains 25k English-to-Odia translation instruction set.

Bengali

all_combined_bengali_252K: This dataset is a mix of Bengali instruction sets translated from open-source instruction sets: Dolly, Alpaca, ChatDoctor, Roleplay, and GSM.

Hindi

hindi_alpaca_dolly_67k_formatted: This dataset is translated from open-source Alpaca_Dolly instruction sets.
instruction_set_hindi_1035: The dataset has been created using OliveFarm web application. The domains that have been covered in this dataset are Art, Sports (Cricket, Football, Olympics), Politics, History, Cooking, Environment, and Music.
roleplay_hindi: The dataset contains 1k Hindi roleplay instruction set.
sentiment_analysis_hindi: This dataset contains 2.5k Hindi sentiment analysis instruction set.

Kannada

airoboros-3.2_kn: This dataset contains 35.5k number of Kannada instruction (input, instruction, output) sets.

Punjabi

punjabi_alpaca_52K: This dataset contains 52k number of Punjabi instruction (input, instruction, output) sets translated from Alpaca.

Telugu

telgu_alpaca_dolly_67k: This dataset contains 67k number of Telugu instruction (input, instruction, output) sets translated from Alpaca and Dolly.

Indic

aya_dataset: This dataset contains instruction sets in different Indic languages (Hindi, Tamil, Punjabi, Telugu, Marathi, Gujarati, Malayalam, Bengali). Paper

Pe-train Dataset

Indic

CulturaX: It is a multilingual dataset contains monolingual data for several Indic languages (Hindi, Bangla, Tamil, Malayalam, Marathi, Telugu, Kannada, Gujarati, Punjabi, Odia, Assamese, etc.). Paper
varta: The dataset contains 41.8 million news articles in 14 Indic languages and English, crawled from DailyHunt, a popular news aggregator in India that pulls high-quality articles from multiple trusted and reputed news publishers.

Foundation LLM

Hindi

OpenHathi-7B-Hi-v0.1-Base: This is a 7B parameter, based on Llama2, trained on Hindi, English, and Hinglish. As per the authors, this is a base model and is not meant to be used as is. Recommend to first fine-tune the interested task(s). Blog

Tamil

Tamil LLaMA 7B Base v0.1: This is a 7B parameter model for Causal LM pre-trained on CulturaX dataset's Tamil subset. Arxiv Paper

Marathi

MahaMarathi-7B-v24.01-Base: This is a domain adapted, continually pre-trained, and instruction fine-tuned native Marathi large language model (LLM) with 7 billion parameters based on Llama2+Mistral, and trained on a large corpus of Marathi text. As per the authors, it is a model is a base model and not meant to be used as is. It is recommended to first finetune it on downstream tasks.

Odia

Qwen_1.5_Odia_7B: This is a pre-trained Odia large language model with 7 billion parameters, and it is based on Qwen 1.5-7B. The model is pre-trained on the Culturex-Odia dataset, a filtered version of the original CulturaX dataset for Odia text. As per the authors, it is a model is a base model and not meant to be used as is. It is recommended to first finetune it on downstream tasks. Blog

Fine-Tuned LLM

Odia

odia_llama2_7B_base: odia_llama2_7B_base is based on Llama2-7b and finetuned with 180k Odia instruction set. Paper

Hindi

mistral_hindi_7b_base_v1 : mistral_hindi_7b_base_v1 is based on Mistral_7b and finetuned with the Hindi instruction set.

Bengali

odiagenAI-bengali-base-model-v1: odiagenAI-bengali-base-model-v1 is based on Llama-7b and finetuned with 252k Bengali instruction set. The instruction set is translated data from open-source resources, resulting in good Bengali instruction understanding and response generation capabilities. Blog

Benchmarking Set

Airavata Evaluation Suite: A collection of benchmarks used for evaluation of Airavata, a Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
Indic LLM Benchmark: A collection of LLM benchmark data in Gujurati, Nepali, Malayalam, Hindi, Telugu, Marathi, Kannada, Bengali.

Citation:

If you find this repository useful, please consider giving ⭐ and citing:

@misc{Indic_LLM_Resources,
  author = {Shantipriya Parida and Sambit Sekhar},
  title = {Indic LLM Resource Catalog},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shantipriyap/Indic_LLM_Resource_Catalog}},
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
CONTRIBUTORS.md		CONTRIBUTORS.md
README.md		README.md
magnifying-glass-7544299_1280.png		magnifying-glass-7544299_1280.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Catalog for Indic LLM Resources

Table of Contents

Instruction Set

Odia

Bengali

Hindi

Kannada

Punjabi

Telugu

Indic

Pe-train Dataset

Indic

Foundation LLM

Hindi

Tamil

Marathi

Odia

Fine-Tuned LLM

Odia

Hindi

Bengali

Benchmarking Set

License

About

Releases

Packages

OdiaGenAI/Indic_LLM_Resource_Catalog

Folders and files

Latest commit

History

Repository files navigation

A Catalog for Indic LLM Resources

Table of Contents

Instruction Set

Odia

Bengali

Hindi

Kannada

Punjabi

Telugu

Indic

Pe-train Dataset

Indic

Foundation LLM

Hindi

Tamil

Marathi

Odia

Fine-Tuned LLM

Odia

Hindi

Bengali

Benchmarking Set

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages