Welcome to GPT from Scratch 🤖💬 !
The goal of this project is to implement a Transformer model step by step, inspired by the architecture behind GPT (Generative Pre-trained Transformer).
This repository shows how to go from a simple Bigram model ➡️ to a multi-layer Transformer capable of generating text in French 🇫🇷 and English 🇬🇧 for example.
This repository contains two different implementations of language models: a simple Bigram model and a full Transformer model.
→ super simple, fast, but no context awareness
- 🏗️ Architecture: Basic bigram model using only an embedding table (
token → vocab_size) - 🎯 Prediction: Each token directly predicts the next one via a lookup table
- 📚 Dataset: Harry Potter text, character-level encoding
- ⚙️ Training: 10,000 steps with AdamW optimizer (
lr=1e-3) ⚠️ Limitation: No context — each prediction is independent from previous ones- ✨ Generation: Multinomial sampling over softmax probabilities
→ powerful, context-aware, more expensive to train
- 🏗️ Architecture: Transformer with multi-head attention + feed-forward networks
- 🔑 Self-Attention: Key-Query-Value mechanism with causal masking
- 🧠 Multi-Head Attention: 6 parallel heads (
n_head=6) - 📏 Positional Encoding: Position embeddings to capture sequential order
- 🧩 Transformer Blocks: 6 layers (
n_layer=6) with residual connections - 🧽 Normalization: LayerNorm before each sub-layer
- 🛡️ Regularization: Dropout (
0.2) to reduce overfitting - 📏 Extended Context: Block size of
256tokens vs.8in bigram - 📚 Dataset: Texts of Victor Hugo
- ⚡ Optimizations: GPU/CUDA support, periodic train/val loss evaluation
- 🤖 Generation: Context-aware text generation
After just 5,000 iterations🏋 :
| Description | Example |
|---|---|
| Generated Text | L'homme a vie Vient pared » Et leurs pas, ébranlant les arches colossales, Troublent les morts couchés sous le pavé des salles. « Oui, nous triomphons ! Venez, sœurs en toutes la foules échoses, D'où fut notre prend notre tout fincens. Le vent les dérité ! cerf, s'édiffrer leur ma des voix ; Et le parles mourents sourirs le profondée, Mour ! Mère du bois ils Dieu la vise ent l'air fait des blancs de mains croisées, Triste, tous entière flots que jour passe ; Il pour leur verra qui son ne ferait cette dans la mière ; Le jour est tête en jour, ils sont là sour ma nombre, Ne velous tra. Qu'on noir mon sangla ! Pierme qu'il nous dans les femmes ? Ils ne s'en vont travailler quinze heures sous dont les tiffles ; Il profond des de l'enfini qui sortes astères, La femme sous luille avec le noir poit. Sans le vers main mauglant, filets attenant ; L'horreur bon est comte le vieille ; L'inge nous pare ; le maître ; leurs mes bleaux ; S'il me vol branche l'amour, regarde, et la nuit. La pauvre montagne homme a Va degrés ! » Le vol plus à cert la porte fix sont qui le partie : La maine se valle, Pour les bouche pritaint en pleint frilleur, Et vous êtes l'homme un flot de l'empire à leur bouille ! Qui pourre mon cherveux qui rapportez, dans ce chacun ! Par à peine ces deux enfants, couvres Ainsi qu'un pour toute heure ; Parvu qu'il elle, frappe elle lible, Temble, à dérans les coiffres qu'un regarde en tremblant son coeur, Je coupens saint le vert d'enfant mennuit pleur moment. Son bis non sang qu'on chevaiement plus frappé. Car vous êtes pous l'ombre de l'amour même ! Vous êtes l'oasis qu'on le luit mour conde et tout fini. Oui, a regarde et de la petite flamme Son au son aeuil s'aira ces ondeul ! Car dans le borouche, âme en pyréche à la mal ! Couris à la la penit ses cartant pas, L'ondre effroi de sa démon pable à voix ! Si ma triste, S'ai je double qui pleur main et voleur ! Il vit, qui fui voulez : Chantez, ples mortes, Cette foule qui fait ce que mure vous |
🔍 Preliminary Results:
With only 5,000 iterations (~4h GPU 💻🔥), the model starts producing French-like words (though not meaningful sentences yet).
- 📖 Read a text dataset (Victor Hugo or Harry Potter, multilingual)
- 🔢 Create a mapping between characters ↔ integers
- 🧩 Build encoders/decoders to switch between text and numbers
- ✂️ Split the dataset into training (90%) and validation (10%)
- 📦 Process text into blocks (context windows) and batches
- 🏗️ Implement a Bigram Language Model
- 🚀 Train the model using PyTorch (
AdamWoptimizer)- ✨ Generate new text sequences and have fun 😆 !!!
Embedding is the keystone of the "understanding" of the words and their senses for transformer models :
Embedding is a way to transform something (text with tokens, image, data...) into a list of numbers that captures their "meaning", not its raw form.
This list of numbers is a vector with many dimensions.
For example, let's take dimensions, as dessertness and sandwichness ! We can find, for food, percentage, coordinates of dessertness and sandwichness and place, for instance, the apple strudel as (0.6, 0.8) because it is a dessert and is a bit packaged like a sandwich :

Two pieces of content that mean roughly the same thing will have vectors close to each other. Two unrelated contents will be far apart.
The different dimensions can be grammar, syntax, semantics of words, but it is not chosen by human logic. It is a mix, intertwined, and not one dimension for one sense (like dessertness). The model has found by itself, learned through training a geometry, an embedding to understand well human creations (text, image...). This is why some people say that "we have stopped understanding AI", because the dimensions have a sense only for the model (and too many dimensions for the human brain), and we can't predict an AI's output other than testing it.
If we apply the same transformation vector for "royalty", we can transform a man into a king, or woman into a queen, and next to it, prince, princess can be found with the transformation vector "child" approximately :
E(king) - E(man) + E(woman) ≈ E(queen)
E(...) for the embedding
Here is a diagram that shows the properties of embeddings (addition, subtraction...). It is a geometric effect that the model learns statistically :

To find this, the model performs backpropagation of the global error in the transformer architecture, but also for the embedding !
Here is an example of training :

A Transformer treats a sentence as a set of tokens in parallel, not as a sequential sequence.
The positional encoder is here to inject the order notion in the token representations.
So the input of LLM is :
Inputi = Embedding(tokeni) + PositionalEncoding(i)
(With i the index of the position)
In the original "Attention Is All You Need" paper, positions are encoded with sines and cosines at different frequencies.
Here is the calcul :
PE(pos, 2i) = sin(pos / 10000^(2i / d))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d))
With :
- pos : token position (0, 1, 2, …)
- i = dimension index
- d = total dimension of embedding
It is useful, because the fonctions are :
- continuous,
- bounded (no numerical explosion),
- periodic,
- allow us to express a position shift as a simple transformation
For the i dimension :
Calcul of the frequency :
f = 1 / 10000^(2i / d))
- If i is low dimension, f is high !
- If i is high dimension, f is low !
Then : - If f is high, it is a fast variation and 2 consecutive positions seem very different.
- If f is low, it is a slow variation and 2 consecutive positions seem very similar.
The same gap of position will be, for a high frequency, just 1 or 2 tokens but, for a low frequency, an entire paragraph.
It means that the first dimension are "looking" short terme relation and for the last dimension, the long terme relation.
Furthermore, the progression is exponential
(not linear to cover a good range -
many dimensions for short terme and mid terme and non redondantes long terme dimensions)
We calculate sin and cos to have an angle, to add a direction. If the value augment, are the position before or after ?
It allows to have a bijective representation of the angle, a unique point on the unit circle !
Even if there is a collision with the same angle, it is absorbed by the multiple scale (short, mid and long terme).
The position (PE - Positional Encoding) is then add to the embedding (E, the token's meaning) :
Input = E + PE
With:
E(t)=(e0,e1,…,ed−1) ∈ Rd
PE(pos)=(p0,p1,…,pd−1) ∈ Rd
- With:
- p0 = sin(θ0)
- p1 = cos(θ0)
- p2 = sin(θ1)
- p3 = cos(θ1)
- ...
Input = (e0 + sin(θ0), e1 + cos(θ0), e2 + sin(θ1), e3 + cos(θ1)...)
With this method, the position pollue, is mixed with embedding, but the transformer learn with it !
In a Transformer, the core mechanism is attention. The attention mechanism is built around three vectors derived from the input: Q (Query), K (Key), and V (Value). They control how each token (word, sub-word, etc.) focuses on others in the same sequence.
Here you can find the schema of a Transformer Model :
(Follow the red number to understand better the localisation of each schema !)

Here is a cool schema I found ! It is a really clear explanation of the different dimensions for on head:

Q = Query → What am I looking for? The question a token asks to find relevant context.
K = Key → What do I contain? A label that represents what kind of information a token holds.
V = Value → What do I offer? The actual information content that can be shared if attended to.
-
Q asks a question: “Who in the sequence can help me?” (For example, Are there any adjectives around me?)
-
K provides an identity: “I can help if you need context about X.” (For example, Yes, I'm an ajective !)
-
V provides content: “Here’s what I can contribute.” (I can say that one thing on the sentence is blue)
For each cross (Q & K for 2 and A & V for 3):
We apply softmax to have A (We transform the matrix multiplication, QKT/sqrt(dk), scores into probabilities).
Add & Norm: Then Add & Norm is here to don't "forget" the initial prompt:
- Add : We add the embedding find after the multi-head attention and the initial input
- Norm : Then, we normalise the layer. It allows to center and resizes the values to stabilize and accelerates convergence.
Feed Forward:
In a Transformer, the feed-forward layer is just a small neural network applied independently to each token. It transforms the token’s representation through linear and non-linear operations, helping the model capture more complex relationships after attention has redistributed information.
├── img/ # For the README.md
│
├── text/ # Training corpora (Victor Hugo, Harry Potter, …)
│
├── Bigram.py # Bigram model + first experiments
├── LICENSE
├── README.md
├── Transformer.py # Full Transformer implementation ---
## 💻 Run it on Your PC
Clone the repository and install dependencies:
```bash
git clone https://github.com/Thibault-GAREL/Language_Models.git
sudo apt-get update
sudo apt-get install liballegro5-dev
# It is for Linux 🐧 !
# For macOS 🍎 / Windows 🪟, consult the documentation
```
Next, you can use Bigram Model :
python Bigram.pyOr use Transformer Model :
python Transformer.pyThis project is based on:
- 🎥 Andrej Karpathy – Let's build GPT from scratch
- 📄 The scientific paper "Attention is All You Need"
- 🧠 OpenAI’s GPT-2 / GPT-3 and nanoGPT
- The training gif for embedding : Gif site
- 📄 The scientific paper "Attention is All You Need"
- A video from 3Blue1Brown : Attention in transformers
Code created by me 😎, Thibault GAREL - Github


