This repository contains the code and resources for our paper "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai" submitted to ACL SRW 2024. Our work presents a novel approach to generating synthetic instruction-tuning data for low-resource languages, with a specific focus on Thai.
- Seed-free framework for generating synthetic instruction-tuning data
- Incorporation of three key properties: fluency, diversity, and cultural context
- Data-efficient approach achieving competitive results with only 5,000 instructions
- Comprehensive evaluation across multiple models, datasets, and tasks
-
Clone this repository:
git clone https://github.com/parinzee/seed-free-synthetic-instruct.git cd seed-free-synthetic-instruct
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
- Firstly, make a copy of
example-settings.toml
, and configure the models (openai, claude, vllm, groq, etc). - Configure language within the settings to set the language
- Generate data!
python3 -m clsit.runner --generate /path/to/yaml/config
To get a clean jsonl file ready to be trained with axolotl:
python3 -m clsit.runner --clean /path/to/yaml/config
python3 -m clsit.runner --export /path/to/yaml/config
The jsonl files will be visible in your configured output directory under:
train_data.jsonl
val_data.jsonl
Please see some of our axolotl configurations to see how to use these files to train.
- Use VLLM to host your finetuned model.
- Run prediction:
cd eval/ python3 eval_vllm.py --model-name SERVED_VLLM_MODEL_NAME --few-shot 0
- Calculate scores:
python3 calculate_scores.py .
- Visualize scores:
- First, please edit
eval/visualize_results.py
and put your model name in the model name dictionary - Then run:
python3 visualize_results.py
- First, please edit
Dataset: We release our best performing dataset and make it publicly accessible at Huggingface Dataset.
Model: We release our best performing model at this huggingface repo.
Our best-performing synthetic dataset (F+ C+ D+) achieved competitive results compared to state-of-the-art Thai LLMs, using only 5,000 instructions. Key findings include:
- Comparable performance to WangchanX and OpenThaiGPT
- Second-highest BERTScore on both Thai Culture and General Test Sets
- Significant improvement over baseline models lacking key properties
For detailed results and analysis, please refer to the paper and the results/
directory.
COMING SOON
This project is licensed under the MIT License.
We extend our sincere gratitude to Potsawee Manakul for his invaluable assistance during the early stages of this project.
This research has received funding support from the NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation Grant Number B46G670083.
For any questions or concerns, please open an issue in this repository.