Train your language model course

We’ve all used Large Language Models (LLMs) and been amazed by what they can do. I wanted to understand how these models are built, so I created this course.

I’m from Morocco and speak Moroccan Darija. Most LLMs today understand it a little, but they can't hold proper conversations in Darija. So, as a challenge, I decided to train a language model from scratch using my own WhatsApp conversations in Darija.

I've made a YouTube playlist documenting every step. You can watch it at your own pace. If anything is unclear, feel free to open an issue in this repository. I’ll be happy to help!

What is in this repository?

notebooks/: Jupyter notebooks for each step in the pipeline.
Slides.odp: Presentation slides used in the YouTube series.
data/: Sample data and templates.
transformer/: Scripts for the Transformer and LoRA implementations.
minbpe/: A tokenizer from Andrej' Karpathy's repo, since it's not available as a package.

Setup

To get started, install Python and the required dependencies by running:

pip install -r requirements.txt

What you will learn?

This course covers:

Extracting data from WhatsApp.
Tokenizing text using the BPE algorithm.
Understanding Transformer models.
Pre-training the model.
Creating a fine-tuning dataset.
Fine-tuning the model (Instruction tuning and LoRA fine-tuning).

Each topic has a video in the YouTube playlist and a Jupyter notebook in the notebooks/ folder.

My experience

Like you, I had never trained a language model before. After following the steps in this course, I built my own 42M parameter model called BoDmagh. In Moroccan Darija, BoDmagh can mean someone with a brain. The word Bo + [noun] means something is deep inside you, so BoDmagh can also mean a smart person.

Here are two example conversations I had with the model:

The Supervised Fine-Tuning (SFT) dataset I created really helped improve the model’s ability to hold a conversation.

Limitations

The model doesn’t always give correct answers. If I try to discuss many different topics, it struggles. This is likely because both the model and the SFT dataset are small. Training on more data and using a larger model could improve the results. I might explore this in the future.

Contributions

We welcome contributions! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

Need help?

You can reach me through:

YouTube – Leave a comment on the videos.
LinkedIn – Connect with me.
Email – [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
colab		colab
data		data
images		images
loss_values		loss_values
minbpe		minbpe
notebooks		notebooks
scripts		scripts
transformer		transformer
.gitignore		.gitignore
README.md		README.md
RESOURCES.md		RESOURCES.md
Slides.odp		Slides.odp
Slides.pdf		Slides.pdf
chat.py		chat.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Train your language model course

What is in this repository?

Setup

What you will learn?

My experience

Limitations

Contributions

Need help?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Languages

ImadSaddik/Train_Your_Language_Model_Course

Folders and files

Latest commit

History

Repository files navigation

Train your language model course

What is in this repository?

Setup

What you will learn?

My experience

Limitations

Contributions

Need help?

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Languages

Packages