Skip to content

A Catalog lists instruction sets, models available for Indic language

Notifications You must be signed in to change notification settings

OdiaGenAI/Indic_LLM_Resource_Catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 

Repository files navigation

A Catalog for Indic LLM Resources

This catalog aims to assist researchers seeking Indic LLM resources. It's a collaborative endeavor, and any input to enhance the collection of Indic LLM resources is appreciated. All contributors have acknowledged on the CONTRIBUTOR list.

llm_resource_catalog

Table of Contents

Instruction Set

Odia

  • odia_master_data_llama2: This dataset contains 180k Odia instruction sets translated from open-source instruction sets and Odia domain knowledge instruction sets.
  • odia_context_10k_llama2_set: This dataset contains 10K instructions that span various facets of Odisha's unique identity. The instructions cover a wide array of subjects, ranging from the culinary delights in 'RECIPES,' the historical significance of 'HISTORICAL PLACES,' and 'TEMPLES OF ODISHA,' to the intellectual pursuits in 'ARITHMETIC,' 'HEALTH,' and 'GEOGRAPHY.' It also explores the artistic tapestry of Odisha through 'ART AND CULTURE,' which celebrates renowned figures in 'FAMOUS ODIA POETS/WRITERS', and 'FAMOUS ODIA POLITICAL LEADERS'. Furthermore, it encapsulates 'SPORTS' and the 'GENERAL KNOWLEDGE OF ODISHA,' providing an all-encompassing representation of the state.
  • roleplay_odia: This dataset contains 1k Odia role play instruction set in conversation format.
  • OdiEnCorp_translation_instructions_25k: This dataset contains 25k English-to-Odia translation instruction set.

Bengali

  • all_combined_bengali_252K: This dataset is a mix of Bengali instruction sets translated from open-source instruction sets: Dolly, Alpaca, ChatDoctor, Roleplay, and GSM.

Hindi

  • hindi_alpaca_dolly_67k_formatted: This dataset is translated from open-source Alpaca_Dolly instruction sets.
  • instruction_set_hindi_1035: The dataset has been created using OliveFarm web application. The domains that have been covered in this dataset are Art, Sports (Cricket, Football, Olympics), Politics, History, Cooking, Environment, and Music.
  • roleplay_hindi: The dataset contains 1k Hindi roleplay instruction set.
  • sentiment_analysis_hindi: This dataset contains 2.5k Hindi sentiment analysis instruction set.

Kannada

  • airoboros-3.2_kn: This dataset contains 35.5k number of Kannada instruction (input, instruction, output) sets.

Punjabi

  • punjabi_alpaca_52K: This dataset contains 52k number of Punjabi instruction (input, instruction, output) sets translated from Alpaca.

Telugu

  • telgu_alpaca_dolly_67k: This dataset contains 67k number of Telugu instruction (input, instruction, output) sets translated from Alpaca and Dolly.

Indic

  • aya_dataset: This dataset contains instruction sets in different Indic languages (Hindi, Tamil, Punjabi, Telugu, Marathi, Gujarati, Malayalam, Bengali). Paper

Pe-train Dataset

Indic

  • CulturaX: It is a multilingual dataset contains monolingual data for several Indic languages (Hindi, Bangla, Tamil, Malayalam, Marathi, Telugu, Kannada, Gujarati, Punjabi, Odia, Assamese, etc.). Paper
  • varta: The dataset contains 41.8 million news articles in 14 Indic languages and English, crawled from DailyHunt, a popular news aggregator in India that pulls high-quality articles from multiple trusted and reputed news publishers.

Foundation LLM

Hindi

  • OpenHathi-7B-Hi-v0.1-Base: This is a 7B parameter, based on Llama2, trained on Hindi, English, and Hinglish. As per the authors, this is a base model and is not meant to be used as is. Recommend to first fine-tune the interested task(s). Blog

Tamil

Marathi

  • MahaMarathi-7B-v24.01-Base: This is a domain adapted, continually pre-trained, and instruction fine-tuned native Marathi large language model (LLM) with 7 billion parameters based on Llama2+Mistral, and trained on a large corpus of Marathi text. As per the authors, it is a model is a base model and not meant to be used as is. It is recommended to first finetune it on downstream tasks.

Odia

  • Qwen_1.5_Odia_7B: This is a pre-trained Odia large language model with 7 billion parameters, and it is based on Qwen 1.5-7B. The model is pre-trained on the Culturex-Odia dataset, a filtered version of the original CulturaX dataset for Odia text. As per the authors, it is a model is a base model and not meant to be used as is. It is recommended to first finetune it on downstream tasks. Blog

Fine-Tuned LLM

Odia

Hindi

Bengali

  • odiagenAI-bengali-base-model-v1: odiagenAI-bengali-base-model-v1 is based on Llama-7b and finetuned with 252k Bengali instruction set. The instruction set is translated data from open-source resources, resulting in good Bengali instruction understanding and response generation capabilities. Blog

Benchmarking Set

  • Airavata Evaluation Suite: A collection of benchmarks used for evaluation of Airavata, a Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
  • Indic LLM Benchmark: A collection of LLM benchmark data in Gujurati, Nepali, Malayalam, Hindi, Telugu, Marathi, Kannada, Bengali.

Citation:

If you find this repository useful, please consider giving ⭐ and citing:

@misc{Indic_LLM_Resources,
  author = {Shantipriya Parida and Sambit Sekhar},
  title = {Indic LLM Resource Catalog},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shantipriyap/Indic_LLM_Resource_Catalog}},
}

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

About

A Catalog lists instruction sets, models available for Indic language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published