Modern LLM Notebook — 从零实现大语言模型的完整教程

21 Jupyter Notebooks · 手写核心算法 · 从 Tokenizer 到 On-Policy Distillation

Learning Map | Quick Start | Notebook Index | Papers Covered | Real-World Models

"The missing textbook for modern LLMs."

这不是另一份「调用 GPT API」的教程。这是一份从零实现大模型核心组件的实战指南。每个 Part 遵循 直觉理解 -> 手算验证 -> 代码实现 -> 实验观察 的教学循环。你会亲手写出 BPE Tokenizer、Multi-Head Attention、MoE Router、RLHF PPO、Speculative Decoding、VLM Cross-Attention。

Last updated: 2026-05

Learning Map

                        Modern LLM Full Stack
                              │
    ┌─────────────────────────┼─────────────────────────┐
    │                         │                         │
    ▼                         ▼                         ▼
┌──────────┐          ┌──────────────┐          ┌──────────────┐
│ Part 1   │          │  Part 2       │          │  Part 3       │
│ Foundation│ ───────>│  Training     │ ───────>│  Inference    │
│ 01-04    │          │  05-12        │          │  13-15        │
└──────────┘          └──────────────┘          └──────────────┘
  Tokenizer                                          KV Cache
  BPE                     RMSNorm / SwiGLU           FlashAttention
  Embedding               MoE Router                 vLLM / PagedAttn
  Position Encoding       BERT / MLM                 Speculative Dec.
  Mini-GPT                Scaling Laws               Beam Search
                          Data Pipeline
                          LoRA / RLHF / DPO
                              │
                    ┌─────────┴─────────┐
                    ▼                   ▼
             ┌──────────────┐   ┌──────────────┐
             │  Part 4       │   │  Part 5       │
             │  Frontiers    │   │  Production   │
             │  16-18        │   │  19-21        │
             └──────────────┘   └──────────────┘
               Long Context       Evaluation
               CoT / Thinking     Distillation
               VLM (Flamingo)     On-Policy Distillation

每个 Notebook 都是自包含的——可以按需跳转到任何 Part，不依赖前序 Notebook 的运行时状态。

Quick Start

git clone https://github.com/walkinglabs/modern-llm-notebook.git
cd modern-llm-notebook
pip install -r requirements.txt
jupyter notebook notebooks/part1-foundation/01-tokenizer-basics.ipynb

要求: Python 3.9+, PyTorch 2.0+, 16GB RAM。大部分 Notebook 在 CPU 上即可运行，部分训练章节建议使用 GPU。

网页端阅读器 (Optional Web Viewer)

如果你想通过更美观的网页界面阅读这些 Notebook，我们提供了一个基于 React 和 Vite 的网页端阅读器：

# 安装网页端依赖
npm install

# 启动开发服务器（会自动转换 Notebook 并在浏览器打开）
npm run dev

# 或者构建静态页面并预览
npm run build
npm run preview

Notebook Index

Part 1: Foundation — 基础组件

从 Tokenizer 到 Mini-GPT，理解一个 GPT 模型从输入文本到输出 logits 的完整数据流。

#	Notebook	核心内容	手写实现
01	Tokenizer Basics	为什么需要 Tokenizer？字符级/词级分词	`CharTokenizer`, `WordTokenizer`
02	BPE Tokenizer	BPE 训练/编码/解码，merge rules 可视化	`BPETokenizer` 完整实现
03	Embedding & Position	Token Embedding + Sinusoidal Position Encoding	`TokenEmbedding`, t-SNE 可视化
04	Mini-GPT	从零组装一个 GPT 模型	`MultiHeadAttention`, `TransformerBlock`, `MiniGPT`

Part 2: Training — 训练管线

从架构优化到人类对齐，掌握完整的训练管线。

#	Notebook	核心内容	手写实现
05	Architecture Refinements	LLaMA 的改进: RMSNorm, SwiGLU, RoPE, Pre-Norm	`RMSNorm`, `FeedForward_SwiGLU`, `LLaMABlock`
06	Mixture of Experts	MoE 路由机制、top-k 选择、负载均衡	`MoELayer`, Router Gate
07	BERT Encoder	Encoder-only 架构、双向注意力、MLM	`MiniBERT`, 分类头
08	Training & Loss	训练循环、loss 曲线、梯度累积	完整训练循环
09	Scaling Laws	Kaplan -> Chinchilla -> 过度训练, FLOPs 估算	C~~6PD, M~~20P
10	Data Engineering	HTML 清洗、质量过滤、MinHash 去重、数据混合	SHA256/MinHash 去重
11	LoRA	低秩适应、A*B 分解、merge 推理	`LoraLinear`, `apply_lora_to_attention`
12	RLHF Alignment	Reward Model、PPO Clip、DPO	Bradley-Terry loss, PPO clip, DPO loss

Part 3: Inference — 推理优化

掌握 LLM 推理加速的全部核心技术。

#	Notebook	核心内容	手写实现
13	Generation	Greedy, Temperature, Top-K, Top-P, Beam Search	`generate_greedy`, `top_p_filter`, `beam_search`
14	Inference Acceleration	KV Cache, FlashAttention, vLLM/PagedAttention	`AttentionWithKVCache`
15	Speculative Decoding	Draft Model -> Target Model 验证, Medusa	`speculative_accept`

Part 4: Frontiers — 前沿技术

2024-2025 年 LLM 的前沿方向。

#	Notebook	核心内容	手写实现
16	Long Context	RoPE 频率分析、PI、NTK、YaRN	`ExtrapolatableRoPE`, Needle-in-Haystack
17	CoT & Thinking	Chain-of-Thought, Self-Consistency, 思维链训练	`generate_coldstart_data`, RL reward function
18	Vision-Language Models	Patch Embedding, Cross-Attention, Flamingo Gating	`PatchEmbedding`, `FlamingoGatedCrossAttnBlock`

Part 5: Production — 工程落地

评测、压缩、部署——把模型推向生产。

#	Notebook	核心内容	手写实现
19	Evaluation	lm-eval, LLM-as-Judge, 5 种复合评分方法	雷达图、胜率矩阵、RAGAS
20	Distillation	Logit 蒸馏、数据蒸馏、特征蒸馏	温度对软标签的影响
21	On-Policy Distillation	Exposure Bias, Forward/Reverse KL, k1/k2/k3 估计器	OPSD, 21 篇论文分类法

Papers Covered

本教程直接对应以下论文的核心算法，每个都手写了实现或模拟：

论文	Notebook	实现内容
Attention Is All You Need (Vaswani et al., 2017)	04	Multi-Head Attention, Sinusoidal PE
BERT (Devlin et al., 2019)	07	Masked LM, Next Sentence Prediction
LLaMA (Touvron et al., 2023)	05	RMSNorm, SwiGLU, RoPE, Pre-Norm
Scaling Laws (Kaplan et al., 2020)	09	C ~ 6PD, compute-optimal training
Chinchilla (Hoffmann et al., 2022)	09	Data-optimal scaling, over-training
LoRA (Hu et al., 2022)	11	Low-Rank Adaptation, A*B decomposition
RLHF / PPO (Ouyang et al., 2022)	12	Reward Model, PPO clip, KL penalty
DPO (Rafailov et al., 2023)	12	Direct Preference Optimization loss
FlashAttention (Dao et al., 2022)	14	Tiling, SRAM-aware computation
vLLM (Kwon et al., 2023)	14	PagedAttention, memory sharing
Speculative Decoding (Leviathan et al., 2023)	15	Draft-then-verify, acceptance ratio
RoPE (Su et al., 2023)	16	Rotary Position Embedding, frequency analysis
YaRN (Peng et al., 2023)	16	NTK-aware + temperature tuning
Chain-of-Thought (Wei et al., 2022)	17	Few-shot CoT, Self-Consistency
DeepSeek-R1 (DeepSeek, 2025)	17	Thinking model training, RL for reasoning
Flamingo (Alayrac et al., 2022)	18	Cross-attention with tanh gating
LLaVA (Liu et al., 2023)	18	Vision projector, freeze strategy
RAGAS (Es et al., 2023)	19	Faithfulness, relevance metrics
LLM-as-Judge (Zheng et al., 2023)	19	MT-Bench, win rate evaluation
Knowledge Distillation (Hinton et al., 2015)	20	Logit distillation, temperature, dark knowledge
On-Policy Distillation (2024-2025)	21	Exposure bias, OPSD, KL estimation taxonomy

每个 Notebook 内部还引用了更多相关论文。

Real-World Models

教程中的实现直接对应以下真实模型的设计决策：

模型	关联技术	Notebook
GPT-4 / GPT-4o	Decoder-only, RLHF, Speculative Decoding	04, 12, 15
LLaMA 3	RMSNorm, SwiGLU, RoPE, Pre-Norm	05
Mixtral	Sparse MoE, Top-2 Routing	06
DeepSeek-V3 / R1	MoE, Multi-Head Latent Attention, Thinking Models	06, 17
Qwen2.5	GQA, Long Context (YaRN), Data Pipeline	10, 16
Gemini	VLM, Multi-modal fusion	18
Claude	RLHF, Constitutional AI, Thinking	12, 17
Phi-3	Data Quality, Distillation	10, 20

教学特色

手算验证。 每个核心算法先用具体数字手动计算一遍，确保理解每一步的数学含义，再用代码实现。

# 示例 — 来自 06-moe:
Input x = [1.0, 0.5]
Router weights = [[0.8, 0.2], [0.3, 0.7]]
Gate logits = x @ Router = [0.9, 1.7]
Top-2 mask -> Expert 0 and Expert 1 activated
-> 你理解了 Router 到底做了什么，不只是调了个 API。

从零实现。 所有实现仅依赖 PyTorch (torch.nn + torch.nn.functional)，不使用 transformers 等封装库。

实验驱动。 每个模块都有实验环节——改变温度看分布变化、增加专家数看路由模式、调整 RoPE 频率看外推效果。

Contributing

欢迎贡献！详见 CONTRIBUTING.md。

修复 Bug 或过时的 API 调用
改进解释、增加图示
英文翻译
新 Notebook（Mamba, Jamba, Liquid Models 等）

License

This course is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

_{Maintained by walkinglabs · If this helps you learn LLMs, consider giving it a star.}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude		.claude
.github/workflows		.github/workflows
notebooks		notebooks
scripts		scripts
web		web
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
intro.md		intro.md
myst.yml		myst.yml
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modern LLM Notebook — 从零实现大语言模型的完整教程

Learning Map

Quick Start

网页端阅读器 (Optional Web Viewer)

Notebook Index

Part 1: Foundation — 基础组件

Part 2: Training — 训练管线

Part 3: Inference — 推理优化

Part 4: Frontiers — 前沿技术

Part 5: Production — 工程落地

Papers Covered

Real-World Models

教学特色

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Modern LLM Notebook — 从零实现大语言模型的完整教程

Learning Map

Quick Start

网页端阅读器 (Optional Web Viewer)

Notebook Index

Part 1: Foundation — 基础组件

Part 2: Training — 训练管线

Part 3: Inference — 推理优化

Part 4: Frontiers — 前沿技术

Part 5: Production — 工程落地

Papers Covered

Real-World Models

教学特色

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages