Skip to content

Thibault-GAREL/Language_Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖💬 My Language Models from Scratch

Python PyTorch

License
Contributions

Titre-GPT_from_scratch

📝 Project Description

Welcome to GPT from Scratch 🤖💬 !
The goal of this project is to implement a Transformer model step by step, inspired by the architecture behind GPT (Generative Pre-trained Transformer).

This repository shows how to go from a simple Bigram model ➡️ to a multi-layer Transformer capable of generating text in French 🇫🇷 and English 🇬🇧 for example.


⚙️ Features

This repository contains two different implementations of language models: a simple Bigram model and a full Transformer model.

🔹 Simple Bigram Model

→ super simple, fast, but no context awareness

  • 🏗️ Architecture: Basic bigram model using only an embedding table (token → vocab_size)
  • 🎯 Prediction: Each token directly predicts the next one via a lookup table
  • 📚 Dataset: Harry Potter text, character-level encoding
  • ⚙️ Training: 10,000 steps with AdamW optimizer (lr=1e-3)
  • ⚠️ Limitation: No context — each prediction is independent from previous ones
  • Generation: Multinomial sampling over softmax probabilities

🔹 Transformer Model

→ powerful, context-aware, more expensive to train

  • 🏗️ Architecture: Transformer with multi-head attention + feed-forward networks
  • 🔑 Self-Attention: Key-Query-Value mechanism with causal masking
  • 🧠 Multi-Head Attention: 6 parallel heads (n_head=6)
  • 📏 Positional Encoding: Position embeddings to capture sequential order
  • 🧩 Transformer Blocks: 6 layers (n_layer=6) with residual connections
  • 🧽 Normalization: LayerNorm before each sub-layer
  • 🛡️ Regularization: Dropout (0.2) to reduce overfitting
  • 📏 Extended Context: Block size of 256 tokens vs. 8 in bigram
  • 📚 Dataset: Texts of Victor Hugo
  • Optimizations: GPU/CUDA support, periodic train/val loss evaluation
  • 🤖 Generation: Context-aware text generation

Example Outputs

After just 5,000 iterations🏋 :

Description Example
Generated Text
L'homme a vie Vient pared »
Et leurs pas, ébranlant les arches colossales, Troublent les morts couchés sous le pavé des salles.
« Oui, nous triomphons ! Venez, sœurs en toutes la foules échoses,
D'où fut notre prend notre tout fincens. Le vent les dérité ! cerf, s'édiffrer leur ma des voix ;
Et le parles mourents sourirs le profondée, Mour !
Mère du bois ils Dieu la vise ent l'air fait des blancs de mains croisées, Triste, tous entière flots que jour passe ;
Il pour leur verra qui son ne ferait cette dans la mière ;
Le jour est tête en jour, ils sont là sour ma nombre,
Ne velous tra. Qu'on noir mon sangla ! Pierme qu'il nous dans les femmes ?
Ils ne s'en vont travailler quinze heures sous dont les tiffles ; Il profond des de l'enfini qui sortes astères,
La femme sous luille avec le noir poit.
Sans le vers main mauglant, filets attenant ; L'horreur bon est comte le vieille ; L'inge nous pare ; le maître ; leurs mes bleaux ; S'il me vol branche l'amour, regarde, et la nuit.
La pauvre montagne homme a Va degrés ! »
Le vol plus à cert la porte fix sont qui le partie : La maine se valle, Pour les bouche pritaint en pleint frilleur,
Et vous êtes l'homme un flot de l'empire à leur bouille !
Qui pourre mon cherveux qui rapportez, dans ce chacun !
Par à peine ces deux enfants, couvres Ainsi qu'un pour toute heure ;
Parvu qu'il elle, frappe elle lible, Temble, à dérans les coiffres qu'un regarde en tremblant son coeur,
Je coupens saint le vert d'enfant mennuit pleur moment.
Son bis non sang qu'on chevaiement plus frappé.
Car vous êtes pous l'ombre de l'amour même ! Vous êtes l'oasis qu'on le luit mour conde et tout fini.
Oui, a regarde et de la petite flamme Son au son aeuil s'aira ces ondeul !
Car dans le borouche, âme en pyréche à la mal !
Couris à la la penit ses cartant pas, L'ondre effroi de sa démon pable à voix !
Si ma triste, S'ai je double qui pleur main et voleur !
Il vit, qui fui voulez : Chantez, ples mortes, Cette foule qui fait ce que mure vous

🔍 Preliminary Results:
With only 5,000 iterations (~4h GPU 💻🔥), the model starts producing French-like words (though not meaningful sentences yet).


⚙️ How it works

  • 📖 Read a text dataset (Victor Hugo or Harry Potter, multilingual)
  • 🔢 Create a mapping between characters ↔ integers
  • 🧩 Build encoders/decoders to switch between text and numbers
  • ✂️ Split the dataset into training (90%) and validation (10%)
  • 📦 Process text into blocks (context windows) and batches
  • 🏗️ Implement a Bigram Language Model
  • 🚀 Train the model using PyTorch (AdamW optimizer)
    • ✨ Generate new text sequences and have fun 😆 !!!

🗺️ Schema

Embedding:

Embedding is the keystone of the "understanding" of the words and their senses for transformer models :

Embedding is a way to transform something (text with tokens, image, data...) into a list of numbers that captures their "meaning", not its raw form.

This list of numbers is a vector with many dimensions.

For example, let's take dimensions, as dessertness and sandwichness ! We can find, for food, percentage, coordinates of dessertness and sandwichness and place, for instance, the apple strudel as (0.6, 0.8) because it is a dessert and is a bit packaged like a sandwich :
Embedding_explication

Two pieces of content that mean roughly the same thing will have vectors close to each other. Two unrelated contents will be far apart.

The different dimensions can be grammar, syntax, semantics of words, but it is not chosen by human logic. It is a mix, intertwined, and not one dimension for one sense (like dessertness). The model has found by itself, learned through training a geometry, an embedding to understand well human creations (text, image...). This is why some people say that "we have stopped understanding AI", because the dimensions have a sense only for the model (and too many dimensions for the human brain), and we can't predict an AI's output other than testing it.

If we apply the same transformation vector for "royalty", we can transform a man into a king, or woman into a queen, and next to it, prince, princess can be found with the transformation vector "child" approximately :

E(king) - E(man) + E(woman) ≈ E(queen)
E(...) for the embedding

Here is a diagram that shows the properties of embeddings (addition, subtraction...). It is a geometric effect that the model learns statistically : Embedding_explication_vector

To find this, the model performs backpropagation of the global error in the transformer architecture, but also for the embedding !

Here is an example of training : Train embedding

Positional Encoding:

A Transformer treats a sentence as a set of tokens in parallel, not as a sequential sequence.
The positional encoder is here to inject the order notion in the token representations. So the input of LLM is :

Inputi = Embedding(tokeni) + PositionalEncoding(i)

(With i the index of the position)

In the original "Attention Is All You Need" paper, positions are encoded with sines and cosines at different frequencies. Here is the calcul :

PE(pos, 2i) = sin(pos / 10000^(2i / d))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d))

With :

  • pos : token position (0, 1, 2, …)
  • i = dimension index
  • d = total dimension of embedding

It is useful, because the fonctions are :

  • continuous,
  • bounded (no numerical explosion),
  • periodic,
  • allow us to express a position shift as a simple transformation

For the i dimension :
Calcul of the frequency :

f = 1 / 10000^(2i / d))

  • If i is low dimension, f is high !
  • If i is high dimension, f is low !
    Then :
  • If f is high, it is a fast variation and 2 consecutive positions seem very different.
  • If f is low, it is a slow variation and 2 consecutive positions seem very similar.

The same gap of position will be, for a high frequency, just 1 or 2 tokens but, for a low frequency, an entire paragraph.

It means that the first dimension are "looking" short terme relation and for the last dimension, the long terme relation.
Furthermore, the progression is exponential (not linear to cover a good range - many dimensions for short terme and mid terme and non redondantes long terme dimensions)


Sin & Cos schema

We calculate sin and cos to have an angle, to add a direction. If the value augment, are the position before or after ?
It allows to have a bijective representation of the angle, a unique point on the unit circle !

Even if there is a collision with the same angle, it is absorbed by the multiple scale (short, mid and long terme).

The position (PE - Positional Encoding) is then add to the embedding (E, the token's meaning) :

Input = E + PE

With:

E(t)=(e0,e1,…,ed−1) ∈ Rd

PE(pos)=(p0,p1,…,pd−1) ∈ Rd

  • With:
    • p0 = sin(θ0)
    • p1 = cos(θ0)
    • p2 = sin(θ1)
    • p3 = cos(θ1)
    • ...

Input = (e0 + sin(θ0), e1 + cos(θ0), e2 + sin(θ1), e3 + cos(θ1)...)

With this method, the position pollue, is mixed with embedding, but the transformer learn with it !

Multi-Head Attention:

In a Transformer, the core mechanism is attention. The attention mechanism is built around three vectors derived from the input: Q (Query), K (Key), and V (Value). They control how each token (word, sub-word, etc.) focuses on others in the same sequence.

Here you can find the schema of a Transformer Model :
(Follow the red number to understand better the localisation of each schema !) Transformer Schema

Here is a cool schema I found ! It is a really clear explanation of the different dimensions for on head: Dimension Schema

What They Mean ?

Q = Query → What am I looking for? The question a token asks to find relevant context.

K = Key → What do I contain? A label that represents what kind of information a token holds.

V = Value → What do I offer? The actual information content that can be shared if attended to.

A other explication can be :

  • Q asks a question: “Who in the sequence can help me?” (For example, Are there any adjectives around me?)

  • K provides an identity: “I can help if you need context about X.” (For example, Yes, I'm an ajective !)

  • V provides content: “Here’s what I can contribute.” (I can say that one thing on the sentence is blue)

For each cross (Q & K for 2 and A & V for 3): QxK We apply softmax to have A (We transform the matrix multiplication, QKT/sqrt(dk), scores into probabilities).

AxV

Add & Norm: Then Add & Norm is here to don't "forget" the initial prompt:

  • Add : We add the embedding find after the multi-head attention and the initial input
  • Norm : Then, we normalise the layer. It allows to center and resizes the values to stabilize and accelerates convergence.

Feed Forward:
In a Transformer, the feed-forward layer is just a small neural network applied independently to each token. It transforms the token’s representation through linear and non-linear operations, helping the model capture more complex relationships after attention has redistributed information.


📂 Repository structure

├── img/           # For the README.md
│
├── text/          # Training corpora (Victor Hugo, Harry Potter, …)
│
├── Bigram.py      # Bigram model + first experiments  
├── LICENSE
├── README.md
├── Transformer.py # Full Transformer implementation  
---
## 💻 Run it on Your PC  
Clone the repository and install dependencies:  
```bash
git clone https://github.com/Thibault-GAREL/Language_Models.git
sudo apt-get update
sudo apt-get install liballegro5-dev
# It is for Linux 🐧 !
# For macOS 🍎 / Windows 🪟, consult the documentation
```

Next, you can use Bigram Model :

python Bigram.py

Or use Transformer Model :

python Transformer.py

📖 Inspiration / Sources

This project is based on:

For the illustration:

Code created by me 😎, Thibault GAREL - Github

About

A construction of Bigram and Transformer Models (use for GPT) from scratch with full explication

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages