This guide explains how to prepare your Hirnu language data for model training.
The data preparation pipeline consists of:
- Data Collection - Gather grammar, vocabulary, and text data
- Preprocessing - Clean and normalize raw data
- Format Conversion - Convert to MLX-compatible JSONL format
- Dataset Splitting - Create train/test/valid splits
- Validation - Verify dataset quality
MLX fine-tuning requires data in JSONL (JSON Lines) format with three splits:
train.jsonl- Training datatest.jsonl- Test data (for evaluation during training)valid.jsonl- Validation data (for final evaluation)
The pipeline supports three MLX data formats:
{
"messages": [
{"role": "system", "content": "System prompt here"},
{"role": "user", "content": "User message"},
{"role": "assistant", "content": "Assistant response"}
]
}{
"prompt": "Translate to Hirnu: Hello",
"completion": "Hirnu translation here"
}{
"text": "Full text content here"
}Place your raw data in these directories:
data/raw/
├── grammar/ # Grammar rules and examples
├── vocabulary/ # Word definitions and translations
└── texts/ # Hirnu texts and stories
Place grammar files in data/raw/grammar/:
- Grammar rules
- Sentence patterns
- Language structure examples
Example file structure:
data/raw/grammar/
├── basic_rules.txt
├── verb_conjugation.txt
└── sentence_structure.txt
Place vocabulary files in data/raw/vocabulary/:
- Word lists
- English-Hirnu translations
- Definitions
Example file structure:
data/raw/vocabulary/
├── common_words.txt
├── verbs.txt
└── nouns.txt
Place Hirnu texts in data/raw/texts/:
- Stories
- Dialogues
- Example sentences
Example file structure:
data/raw/texts/
├── story_01.txt
├── dialogue_01.txt
└── examples.txt
Structure your data as English-Hirnu pairs. Example format in a text file:
EN: Hello, how are you?
HI: [Hirnu translation]
EN: What is your name?
HI: [Hirnu translation]
Simply include Hirnu text content. The system will use it for language modeling.
Structure as question-answer pairs:
Q: What is Hirnu?
A: Hirnu is an ancient Scandinavian language...
Q: How do you say "hello" in Hirnu?
A: [Hirnu translation]
Edit src/data/preprocessor.py to customize how your data is processed:
def preprocess_vocabulary_data(self, vocab_dir: Path) -> List[Dict[str, str]]:
"""Process vocabulary data into training examples."""
examples = []
# Your custom processing logic here
# For example, parse your specific file format
return examplesEdit src/data/converter.py to customize format conversion:
def to_chat_format(self, example: Dict[str, Any]) -> Dict[str, Any]:
"""Convert example to chat format."""
# Customize based on your data structure
messages = [
{"role": "system", "content": self.chat_template["system"]},
{"role": "user", "content": example["input"]},
{"role": "assistant", "content": example["output"]}
]
return {"messages": messages}python scripts/prepare_data.pyThis will:
- Process all raw data
- Convert to MLX format (configured in
configs/data_config.yaml) - Create train/test/valid splits (80/10/10 by default)
- Validate output datasets
Use a custom config file:
python scripts/prepare_data.py --config my_config.yamlValidate existing datasets without reprocessing:
python scripts/prepare_data.py --validate-onlySkip validation step (faster, for development):
python scripts/prepare_data.py --skip-validationEdit configs/data_config.yaml to customize:
splits:
train: 0.8
test: 0.1
valid: 0.1
random_seed: 42format:
type: "chat" # Options: "chat", "completion", "text"
max_length: 2048preprocessing:
lowercase: false
remove_special_chars: false
normalize_whitespace: true
min_text_length: 10
max_text_length: 4096The validation step checks:
- File existence
- JSON format validity
- Required fields presence
- Data structure compliance
If validation fails, review the error messages and fix the issues in your raw data or preprocessing logic.
-
Add raw data:
# Add your files to data/raw directories cp my_grammar_files/* data/raw/grammar/ cp my_vocab_files/* data/raw/vocabulary/ cp my_texts/* data/raw/texts/
-
Configure format:
# Edit configs/data_config.yaml # Set format.type to "chat", "completion", or "text"
-
Customize processing (if needed):
# Edit src/data/preprocessor.py # Implement your custom data parsing logic
-
Run preparation:
python scripts/prepare_data.py
-
Verify output:
# Check the generated files head -n 5 data/processed/train.jsonl
For testing without real data, create sample files:
# Create sample text
echo "Sample Hirnu text for testing" > data/raw/texts/sample.txt
# Run preparation
python scripts/prepare_data.pyAfter data preparation:
- Review generated datasets in
data/processed/ - Proceed to training - see TRAINING.md
- Adjust configuration if needed and re-run preparation
- Verify files exist in
data/raw/directories - Check file formats are readable
- Review custom preprocessing logic
- Check JSONL format (one JSON object per line)
- Verify required fields are present
- Review error messages for specific issues
- Add more raw data
- Adjust split ratios in configuration
- Consider data augmentation techniques