diff --git a/PROJECT1_CPU_OPT_REPORT.md b/PROJECT1_CPU_OPT_REPORT.md
new file mode 100644
index 000000000..2b9a3fedc
--- /dev/null
+++ b/PROJECT1_CPU_OPT_REPORT.md
@@ -0,0 +1,86 @@
+# 项目#1 CPU 推理优化笔记（LLAISYS）
+
+## 1. 一页结论
+- 优化对象：`Ops.linear`（CPU 上最耗时，等价 GEMM）。
+- 主线策略：`朴素循环 -> SIMD+OpenMP -> OpenBLAS`。
+- 结果：`f32` 从 `6173.788 ms` 降到 `253.513 ms`，累计约 `24.35x`。
+- 扩展：`f16/bf16` 已接入专用快速路径，不再在最内层循环频繁类型转换。
+
+## 2. 测试口径（统一）
+- 平台：Windows x64, MSVC 19.50。
+- 代码：`llaisys`。
+- Python：`./.venv`。
+- 线程：`OMP_NUM_THREADS=8`。
+- 形状：`x=(512,4096)`, `w=(4096,4096)`, `bias=(4096,)`, `out=(512,4096)`。
+- benchmark：`warmup=1`, `repeat=3`。
+- 说明：Baseline 为历史记录；其余为同机复测。
+
+## 3. 里程碑与改动
+### A. Baseline（优化前）
+- 三重循环逐元素乘加。
+- 无 SIMD，无 BLAS。
+
+### B. SIMD + OpenMP
+- 文件：`src/ops/linear/op.cpp`。
+- `f32` 点积内核：`_mm256_loadu_ps` + `_mm256_fmadd_ps`。
+- 外层 `m` 维并行：`#pragma omp parallel for`。
+- 不支持 AVX2 时自动回退标量路径。
+
+### C. OpenBLAS
+- 文件：`src/ops/linear/op.cpp`。
+- `ENABLE_OPENBLAS` 下，`f32` 调用 `cblas_sgemm` 完成主计算，再加 bias。
+- 文件：`xmake.lua`、`xmake/cpu.lua`。
+- 新增构建开关：`--openblas=y|n`。
+
+### D. `f16/bf16` 专用高性能路径（已完成）
+- 文件：`src/ops/linear/op.cpp`。
+- 新增 `linear_impl_lowp_fast<T>`。
+- `weight/bias` 先批量转 `float`，避免最内层重复 `cast`。
+- OpenBLAS 开启时：`in` 也批量转 `float` 后走 `sgemm`。
+- OpenBLAS 关闭时：每线程复用 `in_row_f`，继续复用 `dot_f32`。
+
+## 4. 性能对比
+### 4.1 `f32` 分阶段结果
+| 阶段 | LLAISYS (ms) | 说明 |
+|---|---:|---|
+| A. Baseline（历史） | 6173.788 | 优化前 |
+| B. SIMD + OpenMP（复测） | 266.430 | `--openblas=n --openmp=y --cpu-avx2=y` |
+| C. OpenBLAS（复测） | 253.513 | `--openblas=y --openmp=y --cpu-avx2=y` |
+
+对应 Torch（同测参考）：
+- B 阶段 Torch：`56.534 ms`
+- C 阶段 Torch：`47.332 ms`
+
+### 4.2 `f16/bf16` 专用路径结果
+| dtype | Torch (ms) | LLAISYS (ms) |
+|---|---:|---:|
+| `f16` | 283.002 | 271.528 |
+| `bf16` | 297.637 | 268.177 |
+
+## 5. 加速比（`f32`）
+- A -> B：`6173.788 / 266.430 = 23.17x`
+- A -> C：`6173.788 / 253.513 = 24.35x`
+- B -> C：`266.430 / 253.513 = 1.05x`（约 `5.1%`）
+
+结论：
+- 最大收益来自结构性优化（SIMD + 并行）。
+- OpenBLAS 在其上继续带来稳定增益。
+
+## 6. 复现实验命令
+### 6.1 `f32`：SIMD + OpenMP（关闭 OpenBLAS）
+```powershell
+xmake f --openblas=n --openmp=y --cpu-avx2=y -y
+xmake -y
+xmake install -y
+$env:OMP_NUM_THREADS=8
+D:/86188/大模型学习/llaisys/.venv/Scripts/python.exe -c "import sys; sys.path.insert(0,'test'); import llaisys, torch; from test_utils import random_tensor, benchmark; x,x_=random_tensor((512,4096),'f32','cpu',scale=0.1); w,w_=random_tensor((4096,4096),'f32','cpu',scale=0.01); b,b_=random_tensor((4096,),'f32','cpu'); out,out_=random_tensor((512,4096),'f32','cpu'); f1=lambda: torch.nn.functional.linear(x,w,b,out=out); f2=lambda: llaisys.Ops.linear(out_,x_,w_,b_); benchmark(f1,f2,'cpu',warmup=1,repeat=3)"
+```
+
+### 6.2 `f32`：OpenBLAS
+```powershell
+xmake f --openblas=y --openmp=y --cpu-avx2=y -y
+xmake -y
+xmake install -y
+$env:OMP_NUM_THREADS=8
+D:/86188/大模型学习/llaisys/.venv/Scripts/python.exe -c "import sys; sys.path.insert(0,'test'); import llaisys, torch; from test_utils import random_tensor, benchmark; x,x_=random_tensor((512,4096),'f32','cpu',scale=0.1); w,w_=random_tensor((4096,4096),'f32','cpu',scale=0.01); b,b_=random_tensor((4096,),'f32','cpu'); out,out_=random_tensor((512,4096),'f32','cpu'); f1=lambda: torch.nn.functional.linear(x,w,b,out=out); f2=lambda: llaisys.Ops.linear(out_,x_,w_,b_); benchmark(f1,f2,'cpu',warmup=1,repeat=3)"
+```
diff --git a/PROJECT2_CUDA_INTEGRATION_REPORT.md b/PROJECT2_CUDA_INTEGRATION_REPORT.md
new file mode 100644
index 000000000..947877f0d
--- /dev/null
+++ b/PROJECT2_CUDA_INTEGRATION_REPORT.md
@@ -0,0 +1,165 @@
+# 项目#2 学习型报告：从 0 到 1 集成 CUDA（小白友好 + 弱 C++ 版）
+
+## 1. 先说结论：你在项目二里完成了什么
+你已经把 LLAISYS 从“只会 CPU”推进到“能在 NVIDIA GPU 上完整跑推理链路”。
+
+你完成的是一条完整工程链，而不是单点代码：
+- Runtime 层：支持 NVIDIA 设备初始化、显存申请、数据拷贝、流同步。
+- 算子层：关键算子都能在 `--device nvidia` 下跑通（先用 staging 方案保证正确性）。
+- 模型层：`Qwen2` 推理流程能在 nvidia 设备上正常执行。
+- 验证层：`build -> install -> test_runtime -> test/ops -> test_infer` 全链路通过。
+
+一句话总结：项目二的价值是“把 GPU 后端这条路打通，并且可复现”。
+
+## 2. 给完全初学者的背景补课
+
+### 2.1 什么是 Runtime API（为什么必须有它）
+你可以把 Runtime API 理解为“统一设备操作的遥控器”：
+- 不同设备（CPU、NVIDIA）内部实现不同。
+- 但上层算子不想关心细节，只想调用统一接口。
+
+所以项目里会有 `runtime_api.hpp` 这种抽象层，然后 CPU/NVIDIA 各自实现。
+
+### 2.2 什么是“端到端可用”
+不是“我写了几个 `.cpp` 文件就算完成”，而是下面全部成立：
+1. 能编译。
+2. 能安装成 Python 可调用动态库。
+3. Python 测试能调到 GPU 后端。
+4. 算子输出正确。
+5. 模型推理也正确。
+
+## 3. 你的核心实现（按模块拆解）
+
+### 3.1 Runtime 层：你做了 NVIDIA API 对接
+核心文件：`src/device/nvidia/nvidia_runtime_api.cpp`
+
+你把框架抽象接口映射到了 CUDA 官方 API：
+- 设备：`cudaGetDeviceCount`、`cudaSetDevice`、`cudaDeviceSynchronize`
+- Stream：`cudaStreamCreate`、`cudaStreamDestroy`、`cudaStreamSynchronize`
+- 内存：`cudaMalloc/cudaFree`、`cudaMallocHost/cudaFreeHost`
+- 拷贝：`cudaMemcpy`、`cudaMemcpyAsync`
+
+对新手最关键的理解：
+- 你的框架不是直接 everywhere 写 CUDA，而是先走统一抽象，再由 NVIDIA 后端实现具体细节。
+- 这种设计后面扩展 AMD/其它设备会更容易。
+
+### 3.2 共享 GPU 稳定性：你做了“可用设备筛选”
+场景是共享 A100，某些卡可能不可用，直接 `cudaSetDevice` 会失败。
+
+你做的处理是：
+- 启动时逐卡探测。
+- 仅把“可成功激活”的卡加入可用列表。
+- 上层看到的是逻辑设备编号，底层再映射到物理卡。
+
+价值：避免“机器有 8 张卡但你正好选到坏卡”导致测试假失败。
+
+### 3.3 构建系统：你把 NVIDIA 后端接入 xmake
+关键文件：`xmake/nvidia.lua`、`xmake.lua`
+
+你完成了：
+- 新增 nvidia 目标库并纳入总构建。
+- `--nv-gpu=y` 时启用 `ENABLE_NVIDIA_API`。
+- Linux 动态库链接增加 `gomp`（OpenMP 运行时）。
+
+为什么这一步重要：
+- 很多“代码没错但跑不起来”的问题都在链接阶段。
+- 你把“编译通过”和“运行时符号可解析”都兜住了。
+
+### 3.4 算子层：你采用 staging 方案优先保正确
+关键思路：
+- 数据 D2H（Device 到 Host）
+- 复用 CPU 算法
+- 结果 H2D 写回
+
+涉及关键算子：
+- `add`、`linear`、`argmax`、`embedding`、`rms_norm`、`rope`、`self_attention`、`swiglu`
+
+你这个选择非常工程化：
+- 第一阶段先拿到“可跑、正确”。
+- 第二阶段再逐个替换成真正 CUDA Kernel 做性能优化。
+
+### 3.5 模型层：你把 `qwen2` 的设备读写改成安全模式
+关键文件：`src/llaisys/qwen2.cc`
+
+你新增/使用了安全 helper：
+- `zero_tensor`
+- `tensor_write_i64`
+- `tensor_read_i64`
+- `tensor_copy_bytes`
+
+这一步解决了新手常见大坑：
+- 在 GPU 上不能像 CPU 一样随便 `memcpy`/解引用设备指针。
+- 必须通过 runtime API 做合法的 H2D/D2H/D2D 操作。
+
+## 4. 你踩过并修复的关键问题（学习价值很高）
+
+### 4.1 本机无 CUDA SDK
+- 现象：`Cuda SDK not found!`
+- 处理：转远端 A100 环境。
+
+### 4.2 远端缺 xmake
+- 现象：`xmake: command not found`
+- 处理：安装 `~/.local/bin/xmake`，后续固定绝对路径。
+
+### 4.3 动态库 CUDA 注册符号异常
+- 现象：`undefined symbol: __cudaRegisterLinkedBinary...`
+- 处理：统一为 `cpp + cudart` 路径，移除问题 `.cu` 路线。
+
+### 4.4 OpenMP 链接缺失
+- 现象：`omp_get_thread_num` undefined
+- 处理：`xmake.lua` 添加 `add_syslinks("gomp")`。
+
+### 4.5 共享卡不可用
+- 处理：设备筛选+逻辑映射+固定 `CUDA_VISIBLE_DEVICES`。
+
+## 5. 如何复现你的项目二成果（一步一步）
+
+### 5.1 环境准备
+```bash
+cd /home/yuanstar/llaisys
+export PYTHONPATH=/home/yuanstar/llaisys/python
+export CUDA_VISIBLE_DEVICES=2
+```
+
+### 5.2 构建安装
+```bash
+/home/yuanstar/.local/bin/xmake f --nv-gpu=y -cv
+/home/yuanstar/.local/bin/xmake -y
+/home/yuanstar/.local/bin/xmake install -y
+```
+
+### 5.3 运行验证
+```bash
+python3 test/test_runtime.py --device nvidia
+
+python3 test/ops/add.py --device nvidia
+python3 test/ops/linear.py --device nvidia
+python3 test/ops/argmax.py --device nvidia
+python3 test/ops/embedding.py --device nvidia
+python3 test/ops/rms_norm.py --device nvidia
+python3 test/ops/rope.py --device nvidia
+python3 test/ops/self_attention.py --device nvidia
+python3 test/ops/swiglu.py --device nvidia
+
+python3 test/test_infer.py --model /home/yuanstar/models/DeepSeek-R1-Distill-Qwen-1___5B --device nvidia --test --max_steps 8
+```
+
+### 5.4 成功标准（你答辩时可以直接说）
+- Runtime 测试通过。
+- 核心算子测试通过。
+- `test_infer` 在 nvidia 下通过，且与参考输出一致。
+
+## 6. 你项目二的能力成长（面向答辩表达）
+你不仅“会调 API”，而且已经体现了以下工程能力：
+- 抽象层思维：Runtime 统一接口。
+- 系统排错能力：从编译、链接、运行时逐层定位。
+- 资源受限环境适配：共享 GPU 稳定性处理。
+- 交付意识：不仅改代码，还确保复现脚本和验证闭环。
+
+## 7. 面向下一步优化（可选，不影响已完成）
+- 把 staging 算子逐步替换为原生 CUDA kernel，重点先做 `linear`、`self_attention`。
+- 给关键性能路径增加 benchmark，量化 GPU 加速收益。
+- 将“设备筛选 + 健康检查”做成统一工具函数，降低维护成本。
+
+## 8. 项目二一句话复盘
+项目二你已经完成到“可提交且可复现”的标准：CUDA 后端集成成功，nvidia 测试链路打通，模型推理在真实 GPU 环境下验证通过。
\ No newline at end of file
diff --git a/PROJECT3_CHATBOT_DETAILED_NOTE.md b/PROJECT3_CHATBOT_DETAILED_NOTE.md
new file mode 100644
index 000000000..ae5561faf
--- /dev/null
+++ b/PROJECT3_CHATBOT_DETAILED_NOTE.md
@@ -0,0 +1,201 @@
+# 项目#3 学习型报告：AI 聊天机器人（超详细小白版）
+
+## 1. 先说你项目三到底完成了什么
+你在项目三完成的是一套“能实际使用”的聊天系统，不只是单个函数。
+
+你已经做完的部分：
+- 在 C++ 推理链路里加入随机采样能力：`temperature`、`top_k`、`top_p`。
+- 在 Python 层把采样参数一路打通到 C API。
+- 提供 OpenAI 风格 HTTP 接口：`/v1/models`、`/v1/chat/completions`。
+- 支持非流式和流式（SSE）返回。
+- 提供 CLI 对话界面（可以连续多轮聊天）。
+- 完成 CPU 冒烟验证和远端 NVIDIA 实机验证。
+
+一句话总结：你把“模型推理能力”包装成了“可对话服务能力”。
+
+## 2. 这个项目在系统里分几层（小白必看）
+
+项目三可以拆成 4 层：
+
+1. 算法层（怎么选下一个 token）
+- 从原来几乎贪心的选择，升级为可控随机采样。
+
+2. C++ 引擎层（高性能推理核心）
+- 采样算子放在 C++，并支持 CPU/NVIDIA 路径。
+
+3. Python 绑定层（把 C++ 暴露给 Python）
+- `ctypes` 签名更新，Python 端能传 `top_k/top_p/temperature`。
+
+4. 应用层（用户能用的产品）
+- FastAPI server + CLI，统一 OpenAI 风格协议。
+
+## 3. 你改了哪些关键文件（按“作用”分组）
+
+### 3.1 新增采样算子（核心算法）
+- `src/ops/sampling/op.hpp`
+- `src/ops/sampling/op.cpp`
+
+你做的核心逻辑：
+- `temperature <= 0` 时退化为 `argmax`（稳定、可控）。
+- 否则先做温度缩放 softmax。
+- 再做 `top_k` 裁剪。
+- 再做 `top_p`（nucleus）裁剪。
+- 对保留集合重新归一化后随机采样。
+
+初学者理解：
+- `temperature` 控制“敢不敢冒险”。
+- `top_k` 控制“只在前 K 个候选里选”。
+- `top_p` 控制“只在累计概率达到 p 的候选里选”。
+
+### 3.2 C API 导出与桥接
+- `include/llaisys/ops.h`
+- `src/llaisys/ops.cc`
+
+你新增了导出函数：
+- `llaisysSample(...)`
+
+意义：
+- 让 Python 能调用到 C++ 的采样实现。
+
+### 3.3 Qwen2 推理改造（从 argmax 到 sample）
+- `include/llaisys/models/qwen2.h`
+- `src/llaisys/qwen2.cc`
+
+关键变化：
+- `llaisysQwen2ModelInfer` 增加 `top_k/top_p/temperature` 参数。
+- 在最后选 token 的地方，从“固定最大值”切换为“可配置采样”。
+
+效果：
+- 回复不再每次都过于机械，模型输出更自然。
+
+### 3.4 Python 绑定打通
+- `python/llaisys/libllaisys/ops.py`
+- `python/llaisys/ops.py`
+- `python/llaisys/libllaisys/models.py`
+- `python/llaisys/models/qwen2.py`
+
+你做了两件关键事：
+- `ctypes` 参数签名与 C API 对齐。
+- `Qwen2.generate()` 把采样参数传到 C++ 推理层。
+
+### 3.5 服务与 UI
+- `python/llaisys/chat_server.py`
+- `python/llaisys/chat_cli.py`
+- `python/setup.cfg`（`chat` extras）
+
+服务能力：
+- `GET /v1/models`：返回模型列表。
+- `POST /v1/chat/completions`：返回聊天结果。
+- 支持 `stream=false` 的整包返回。
+- 支持 `stream=true` 的 SSE 分块返回。
+
+CLI 能力：
+- 连续多轮对话。
+- 可选流式输出。
+- 可调 `temperature/top_p/top_k/max_tokens`。
+
+### 3.6 测试
+- `test/ops/sampling.py`
+
+你验证了：
+- `top_k=1` 时行为应等价于 argmax。
+- `top_k=2` 时采样结果必须落在前二候选集合中。
+
+## 4. 面向新手：为什么这些改动是“必要且正确”的
+
+### 4.1 为什么不能只做 argmax
+argmax 总是选最大概率 token，常见问题是：
+- 回复重复。
+- 创造性不足。
+- 句式单一。
+
+聊天机器人需要“可控随机性”，所以必须引入采样。
+
+### 4.2 为什么要做 C++ + Python 全链路
+只在 Python 做采样不够，因为：
+- 你实际推理主路径在 C++。
+- 参数必须一路传到真正产生 token 的位置。
+
+这就是你做“C API + ctypes + model infer 签名”改造的价值。
+
+### 4.3 为什么要有 server 与 CLI 两种入口
+- Server：用于前后端联调、接口标准化、后续接 Web。
+- CLI：调试成本低，开发期最快验证。
+
+## 5. 你在项目三中遇到并解决的真实问题
+
+### 5.1 远端命令不稳定（转义/时序/端口）
+问题表现：
+- PowerShell 到远端 bash 的引号转义容易破坏 JSON 或脚本。
+- 服务刚启动就请求，导致 `Connection refused`。
+- 端口被旧进程占用，启动报 `address already in use`。
+
+你的解决策略：
+- 改成“分步执行 + 健康检查轮询”。
+- 先清理残留进程，再启动服务。
+- 用固定端口并在失败时打印日志尾部。
+
+### 5.2 远端依赖缺失
+问题表现：
+- `No module named uvicorn`
+- `No module named torch`
+
+解决：
+- 在远端 `.venv` 安装：`uvicorn fastapi requests transformers torch`。
+
+### 5.3 uvicorn 启动方式错误
+问题表现：
+- `Attribute "app" not found in module "llaisys.chat_server"`
+
+原因：
+- 该文件是 `create_app + main()` 结构，不是模块级 `app` 变量。
+
+正确启动：
+- `python -m llaisys.chat_server --model ... --device nvidia --host ... --port ...`
+
+## 6. 你项目三的验证证据（可用于答辩）
+
+你已经拿到远端 nvidia 的关键成功信号：
+- `MODELS_STATUS=200`
+- `CHAT_STATUS=200`
+- 返回了有效 assistant 文本
+
+这说明：
+- 服务能启动。
+- 路由能命中。
+- 模型推理链路可执行。
+- 采样参数不会阻断主流程。
+
+## 7. 一套可复现命令（你以后直接照着跑）
+
+### 7.1 远端启动（nvidia）
+```bash
+ssh yuanstar-a100 "cd /home/yuanstar/llaisys && source .venv/bin/activate && export PYTHONPATH=/home/yuanstar/llaisys/python && python -m llaisys.chat_server --model /home/yuanstar/models/DeepSeek-R1-Distill-Qwen-1___5B --device nvidia --host 127.0.0.1 --port 18000"
+```
+
+### 7.2 检查模型列表
+```bash
+ssh yuanstar-a100 "curl -s http://127.0.0.1:18000/v1/models"
+```
+
+### 7.3 发一条非流式聊天请求
+```bash
+ssh yuanstar-a100 "curl -s -X POST http://127.0.0.1:18000/v1/chat/completions -H 'Content-Type: application/json' --data '{\"model\":\"llaisys-qwen2\",\"messages\":[{\"role\":\"user\",\"content\":\"请用一句话介绍你自己。\"}],\"max_tokens\":64,\"temperature\":0.8,\"top_k\":20,\"top_p\":0.9,\"stream\":false}'"
+```
+
+### 7.4 如果端口冲突
+```bash
+ssh yuanstar-a100 "pkill -f 'python -m llaisys.chat_server' || true"
+```
+
+## 8. 给不会 C++ 的你：如何讲这部分才不慌
+
+你可以按这个话术：
+- 我没有去手写复杂 CUDA kernel，而是先通过框架已有模式把“采样能力”完整接入 C++ 推理链。
+- 我重点做了接口设计与工程打通：算子实现、C API 导出、Python 绑定、服务封装、远端验证闭环。
+- 在工程上我解决了依赖、端口、远端执行稳定性、启动方式等实际问题，最终拿到 nvidia 环境 200 响应。
+
+这段话能准确体现你的真实工作，而且不会因为深挖某个 C++ 语法细节而卡住。
+
+## 9. 项目三答辩版 30 秒总结
+项目三中，我把底层推理引擎从单一 argmax 扩展为可控随机采样，并通过 C API 和 Python 绑定打通到应用层，构建了 OpenAI 风格 chat-completion 服务和 CLI 交互界面。最终在远端 NVIDIA 环境完成了服务启动与接口验证，确认 `models/chat` 请求均返回 200，实现了从推理能力到产品化接口的完整闭环。
diff --git a/README.md b/README.md
index 456067c82..487227daa 100644
--- a/README.md
+++ b/README.md
@@ -382,7 +382,7 @@ python test/test_runtime.py --device nvidia
 ### Implement CUDA Operators
 Create a `nvdia/` sub-directory in each operator source directory and implement a cuda version. Check `src/ops/add/op.cpp` to see how to include your cuda implementations. Remeber to define the compiling procedures in the xmake files. Run the operator tests with `--device nvidia` flag to test your CUDA implementation.
 
-You can use CUDA libraries like cuBLAS, cuDNN, etc. to accelerate your operators. Check their documentations to see how to use them. You can store extra device resources in `src/device/nvidia/nvidia_resource.cu`.
+You can use CUDA libraries like cuBLAS, cuDNN, etc. to accelerate your operators. Check their documentations to see how to use them. You can store extra device resources in `src/device/nvidia/nvidia_resource.cpp`.
 
 Modify your model codes to support CUDA inference. 
 
diff --git a/README_ZN.md b/README_ZN.md
index 7704dbd5b..7b0a31120 100644
--- a/README_ZN.md
+++ b/README_ZN.md
@@ -383,7 +383,7 @@ python test/test_runtime.py --device nvidia
 
 在每个算子目录下新建 ``nvidia/`` 子目录，写 CUDA 版本实现。参考 ``src/ops/add/op.cpp`` 看如何包含 CUDA 实现。别忘了在 xmake 文件中定义编译流程。用 ``--device nvidia`` 参数运行测试。
 
-你可以使用 cuBLAS、cuDNN 等 CUDA 库来加速算子，额外的设备资源可以放在 `src/device/nvidia/nvidia_resource.cu`。
+你可以使用 cuBLAS、cuDNN 等 CUDA 库来加速算子，额外的设备资源可以放在 `src/device/nvidia/nvidia_resource.cpp`。
 
 最后,修改模型代码，支持 CUDA 推理：
 
diff --git "a/RMSNorm\347\237\245\350\257\206\347\254\224\350\256\260.md" "b/RMSNorm\347\237\245\350\257\206\347\254\224\350\256\260.md"
new file mode 100644
index 000000000..6ede521e1
--- /dev/null
+++ "b/RMSNorm\347\237\245\350\257\206\347\254\224\350\256\260.md"
@@ -0,0 +1,510 @@
+# RMSNorm (Root Mean Square Normalization) 知识笔记
+
+## 📌 核心概念
+
+**RMSNorm = 均方根归一化**，是 Transformer 模型中的一种归一化层，通过**缩放向量幅度**来稳定训练和推理过程。
+
+---
+
+## 1️⃣ RMSNorm 在大模型中的作用
+
+### 在 Transformer 架构中的位置
+- **位置**：每个 Transformer Block 内，在 **Attention 层之前** 和 **FFN 层之前**（Pre-Norm 结构）
+- **应用时机**：对隐藏状态向量进行归一化处理
+
+### 核心功能
+1. **稳定激活分布**：将每一层的向量幅度控制在稳定范围
+2. **防止数值爆炸/消失**：避免深层网络中的梯度问题
+3. **提升训练稳定性**：让模型更容易收敛
+4. **减少数值漂移**：推理时输出更稳定，尤其在长序列场景
+
+### 训练/推理流程
+```
+训练阶段：
+Input → Embedding → RMSNorm → Attention → RMSNorm → FFN → ... → Output
+                ↓                          ↓
+          稳定激活分布              控制梯度流动
+
+推理阶段：
+Input → Embedding → RMSNorm → Attention → RMSNorm → FFN → ... → Output
+                ↓                          ↓
+          减少数值漂移              保持输出稳定
+```
+
+### 在 Transformer Block 中的典型结构（Pre-Norm）
+```python
+# 典型的 Transformer Block
+def transformer_block(x):
+    # 1. Attention 子层
+    residual = x
+    x = rms_norm(x)           # ← RMSNorm
+    x = self_attention(x)
+    x = x + residual          # 残差连接
+    
+    # 2. FFN 子层
+    residual = x
+    x = rms_norm(x)           # ← RMSNorm
+    x = ffn(x)
+    x = x + residual          # 残差连接
+    
+    return x
+```
+
+---
+
+## 2️⃣ 基本数学原理
+
+### 数学公式
+
+对输入矩阵 $X \in \mathbb{R}^{M \times d}$ 的每一行 $X_i \in \mathbb{R}^d$：
+
+**1. 计算均方根（RMS）：**
+$$\text{RMS}(X_i) = \sqrt{\frac{1}{d}\sum_{j=1}^d X_{i,j}^2 + \epsilon}$$
+
+**2. 归一化：**
+$$\hat{X}_{i,j} = \frac{X_{i,j}}{\text{RMS}(X_i)}$$
+
+**3. 缩放（可学习权重）：**
+$$Y_{i,j} = W_j \cdot \hat{X}_{i,j}$$
+
+其中：
+- $M$ = 矩阵行数（batch size 或 sequence length）
+- $d$ = 向量维度
+- $\epsilon$ = 极小值（通常 1e-5 或 1e-6），防止除零
+- $W \in \mathbb{R}^d$ = 可学习的缩放参数
+
+### 完整公式（合并）
+$$Y_{i,j} = W_j \times \frac{X_{i,j}}{\sqrt{\frac{1}{d}\sum_{k=1}^d X_{i,k}^2 + \epsilon}}$$
+
+---
+
+## 3️⃣ 关键概念详解
+
+### Q1：RMS（均方根）是什么？
+
+**定义：** 所有元素平方的平均值，再开方。
+
+$$\text{RMS} = \sqrt{\frac{1}{d}(x_1^2 + x_2^2 + ... + x_d^2)}$$
+
+**物理意义：** 衡量向量的"整体幅度"或"能量"。
+
+**例子：**
+```
+向量 [3, 4]:
+  RMS = √((3² + 4²) / 2) = √(25/2) = √12.5 ≈ 3.536
+
+向量 [1, -1, 2]:
+  RMS = √((1² + 1² + 4) / 3) = √(6/3) = √2 ≈ 1.414
+```
+
+---
+
+### Q2：为什么要除以 RMS？
+
+**目的：** 把向量的幅度"拉回"到稳定范围（通常接近 1）。
+
+**效果：**
+- 大向量被缩小
+- 小向量被放大
+- 所有向量的幅度趋于一致
+
+**例子：**
+```
+原始向量 A = [10, 20, 30]:
+  RMS = √((100 + 400 + 900) / 3) ≈ 21.6
+  归一化后 = [10/21.6, 20/21.6, 30/21.6] ≈ [0.46, 0.93, 1.39]
+  新 RMS ≈ 1.0
+
+原始向量 B = [0.1, 0.2, 0.3]:
+  RMS ≈ 0.216
+  归一化后 ≈ [0.46, 0.93, 1.39]
+  新 RMS ≈ 1.0
+```
+
+**关键：不同幅度的向量，归一化后具有相似的尺度。**
+
+---
+
+### Q3：epsilon ($\epsilon$) 的作用是什么？
+
+**作用：** 防止除以零或非常小的数。
+
+**场景：**
+- 如果向量全是 0：RMS = 0，除法会出错
+- 如果向量接近 0：RMS 很小，除法结果不稳定
+
+**解决方案：**
+$$\text{RMS} = \sqrt{\frac{1}{d}\sum x^2 + \epsilon}$$
+
+**例子：**
+```
+零向量 [0, 0, 0]:
+  不加 epsilon: RMS = 0 → 除零错误 ❌
+  加 epsilon:   RMS = √(1e-6) ≈ 0.001 → 安全 ✅
+
+微小向量 [1e-8, 1e-8]:
+  不加 epsilon: RMS ≈ 1e-8 → 除法结果巨大（不稳定）
+  加 epsilon:   RMS ≈ √(1e-6) → 稳定
+```
+
+---
+
+### Q4：可学习权重 $W$ 的作用？
+
+**为什么需要 $W$？**
+
+归一化后所有向量幅度都接近 1，可能**限制了模型的表达能力**。
+
+**$W$ 的作用：**
+- 让模型学习"哪些维度应该放大，哪些应该缩小"
+- 恢复模型的表达能力
+- 每个维度有独立的缩放因子
+
+**例子：**
+```
+归一化后向量: [0.5, 0.7, 0.3]
+学习到的权重 W: [2.0, 0.5, 1.0]
+最终输出: [0.5×2.0, 0.7×0.5, 0.3×1.0] = [1.0, 0.35, 0.3]
+```
+
+**训练过程：**
+- 初始化：通常 $W$ 全部初始化为 1
+- 训练中：通过反向传播学习最优的缩放因子
+
+---
+
+### Q5：RMSNorm 与 LayerNorm 的区别？
+
+| 特性 | LayerNorm | RMSNorm |
+|------|-----------|---------|
+| **计算均值** | ✅ 需要 | ❌ 不需要 |
+| **计算方差** | ✅ 需要 | ❌ 不需要（用均方根） |
+| **计算复杂度** | 高（两遍扫描） | 低（一遍扫描） |
+| **可学习参数** | γ（缩放）+ β（偏移） | W（仅缩放） |
+| **数值稳定性** | 好 | 更好 |
+| **速度** | 较慢 | 更快 |
+
+**LayerNorm 公式：**
+$$y = \gamma \times \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$
+
+**RMSNorm 公式：**
+$$y = W \times \frac{x}{\sqrt{\frac{1}{d}\sum x^2 + \epsilon}}$$
+
+**主要区别：**
+1. RMSNorm 不减去均值（不中心化）
+2. RMSNorm 不需要计算方差，直接用均方根
+3. RMSNorm 没有偏移参数 $\beta$
+
+---
+
+## 4️⃣ 为什么 RMSNorm 有效？
+
+### 核心机制：控制尺度 + 稳定梯度
+
+#### 1. 稳定激活分布
+
+**问题：** 深层网络中，激活值可能逐层放大或缩小。
+
+```
+Layer 1: 激活值范围 [-1, 1]
+Layer 2: 激活值范围 [-10, 10]    ← 放大
+Layer 3: 激活值范围 [-100, 100]  ← 继续放大
+...
+Layer N: 激活值溢出或梯度爆炸 ❌
+```
+
+**RMSNorm 的作用：**
+```
+Layer 1: RMSNorm → 范围 ≈ [-1, 1]
+Layer 2: RMSNorm → 范围 ≈ [-1, 1]  ← 稳定
+Layer 3: RMSNorm → 范围 ≈ [-1, 1]  ← 稳定
+...
+Layer N: 激活值始终可控 ✅
+```
+
+#### 2. 稳定梯度流动
+
+**反向传播时：**
+- 归一化后的激活值更稳定
+- 梯度不容易爆炸或消失
+- 训练更容易收敛
+
+**数学直觉：**
+$$\frac{\partial \text{Loss}}{\partial x} = \frac{\partial \text{Loss}}{\partial y} \times \frac{\partial y}{\partial x}$$
+
+当 $y$ 的尺度稳定时，梯度也更稳定。
+
+#### 3. 减少分布漂移
+
+**训练中的问题（Internal Covariate Shift）：**
+- 前一层的输出分布变化
+- 后一层需要不断适应新的分布
+- 减慢训练速度
+
+**RMSNorm 的作用：**
+- 让每一层的输入分布更一致
+- 减少分布漂移
+- 加速训练
+
+---
+
+## 5️⃣ 实现思路（2D 连续张量）
+
+### 1. 参数验证
+```cpp
+CHECK_ARGUMENT(in->ndim() == 2);     // 2D 张量
+CHECK_ARGUMENT(out->ndim() == 2);
+CHECK_ARGUMENT(weight->ndim() == 1); // 1D 权重
+CHECK_ARGUMENT(weight->shape()[0] == in->shape()[1]); // 维度匹配
+```
+
+### 2. 核心算法
+
+```cpp
+size_t M = in->shape()[0];      // 行数
+size_t d = in->shape()[1];      // 列数（维度）
+
+for (size_t i = 0; i < M; ++i) {
+    // Step 1: 计算该行的均方根
+    float mean_square = 0.0f;
+    for (size_t j = 0; j < d; ++j) {
+        float val = in[i][j];
+        mean_square += val * val;
+    }
+    float rms = sqrt(mean_square / d + eps);
+    
+    // Step 2: 归一化并缩放
+    for (size_t j = 0; j < d; ++j) {
+        out[i][j] = (in[i][j] / rms) * weight[j];
+    }
+}
+```
+
+### 3. 多数据类型支持
+
+使用 **template + switch** 模式：
+
+```cpp
+template <typename T>
+void rms_norm_impl(T *out, const T *in, const T *weight,
+                   size_t M, size_t d, float eps) {
+    for (size_t i = 0; i < M; ++i) {
+        // 计算 RMS（用 float 避免精度损失）
+        float mean_square = 0.0f;
+        for (size_t j = 0; j < d; ++j) {
+            float val = cast<float>(in[i * d + j]);
+            mean_square += val * val;
+        }
+        float inv_rms = 1.0f / sqrt(mean_square / d + eps);
+        
+        // 归一化并缩放
+        for (size_t j = 0; j < d; ++j) {
+            float val = cast<float>(in[i * d + j]);
+            float w = cast<float>(weight[j]);
+            out[i * d + j] = cast<T>(val * inv_rms * w);
+        }
+    }
+}
+
+// 主函数中 switch 分发
+switch (dtype) {
+    case DTYPE_F32: return rms_norm_impl<float>(...);
+    case DTYPE_F16: return rms_norm_impl<fp16_t>(...);
+    case DTYPE_BF16: return rms_norm_impl<bf16_t>(...);
+}
+```
+
+---
+
+## 6️⃣ 具体例子
+
+### 例子 1：简单 2D 矩阵
+
+**输入：**
+```
+X = [[1, 2],
+     [3, 4]]
+     
+W = [1, 1]
+eps = 0
+```
+
+**计算过程：**
+
+**行 1: [1, 2]**
+```
+mean_square = (1² + 2²) / 2 = 5/2 = 2.5
+rms = √2.5 ≈ 1.581
+归一化: [1/1.581, 2/1.581] ≈ [0.632, 1.265]
+乘权重: [0.632×1, 1.265×1] = [0.632, 1.265]
+```
+
+**行 2: [3, 4]**
+```
+mean_square = (3² + 4²) / 2 = 25/2 = 12.5
+rms = √12.5 ≈ 3.536
+归一化: [3/3.536, 4/3.536] ≈ [0.848, 1.131]
+乘权重: [0.848×1, 1.131×1] = [0.848, 1.131]
+```
+
+**输出：**
+```
+Y = [[0.632, 1.265],
+     [0.848, 1.131]]
+```
+
+**验证：** 每行的 RMS 都接近 1.0 ✅
+
+---
+
+### 例子 2：带权重的情况
+
+**输入：**
+```
+X = [[1, -1, 2]]
+W = [2, 0.5, 1]
+eps = 1e-5
+```
+
+**计算过程：**
+```
+mean_square = (1² + 1² + 4) / 3 = 6/3 = 2.0
+rms = √(2.0 + 1e-5) ≈ 1.414
+归一化: [1/1.414, -1/1.414, 2/1.414] ≈ [0.707, -0.707, 1.414]
+乘权重: [0.707×2, -0.707×0.5, 1.414×1] ≈ [1.414, -0.354, 1.414]
+```
+
+**输出：**
+```
+Y = [1.414, -0.354, 1.414]
+```
+
+---
+
+## 7️⃣ 常见问题 FAQ
+
+### Q: RMSNorm 为什么不减去均值？
+A: 研究发现去除均值计算后：
+- **性能几乎不变**（在大多数任务中）
+- **速度更快**（减少一遍扫描）
+- **数值更稳定**（减少浮点运算误差）
+
+实践证明：仅控制幅度就足够稳定训练。
+
+### Q: RMSNorm 在什么时候用？
+A: 主要在现代大语言模型中：
+- **LLaMA 系列**
+- **Qwen 系列**
+- **GLM 系列**
+- **GPT-NeoX**
+
+较老的模型（如 BERT、GPT-2）用 LayerNorm。
+
+### Q: RMSNorm 和 BatchNorm 的区别？
+A: 
+- **BatchNorm**: 对一个 batch 内的**同一特征维度**归一化（跨样本）
+- **RMSNorm**: 对**每个样本的所有特征**归一化（跨维度）
+
+RMSNorm 更适合序列模型（Transformer），BatchNorm 更适合 CNN。
+
+### Q: eps 设置多大合适？
+A: 通常 **1e-5** 或 **1e-6**。
+- 太大：影响归一化效果
+- 太小：可能数值不稳定
+
+### Q: 为什么不用方差而用均方根？
+A: 
+- **方差需要两遍扫描**（先算均值，再算方差）
+- **RMS 只需一遍扫描**（直接计算）
+- 效果相近但更高效
+
+---
+
+## 8️⃣ 优势总结
+
+### 与 LayerNorm 对比
+
+| 维度 | LayerNorm | RMSNorm |
+|------|-----------|---------|
+| **速度** | 较慢 | 更快（省略均值计算） |
+| **内存** | 较高 | 较低 |
+| **稳定性** | 好 | 更好 |
+| **效果** | 优秀 | 相当或更好 |
+| **主流应用** | 传统模型 | 现代大模型 |
+
+### 核心优势
+
+1. ✅ **计算高效**：一遍扫描完成
+2. ✅ **数值稳定**：减少浮点运算
+3. ✅ **参数更少**：只有缩放权重 W
+4. ✅ **效果相当**：性能不输 LayerNorm
+5. ✅ **易于实现**：代码简洁
+
+---
+
+## 9️⃣ 实践建议
+
+### 训练时
+- eps 设为 **1e-5** 或 **1e-6**
+- 权重 W 初始化为全 **1**
+- 使用 **Pre-Norm** 结构（RMSNorm 在子层之前）
+
+### 推理时
+- 确保使用训练时相同的 eps
+- fp16/bf16 推理时，中间计算用 float 避免精度损失
+
+### 调试技巧
+- 检查输出的 RMS 是否接近 1.0
+- 验证梯度是否稳定
+- 对比 LayerNorm 的效果
+
+---
+
+## 🔟 时间线与应用
+
+```
+2019: RMSNorm 论文发布（Zhang & Sennrich）
+      "Root Mean Square Layer Normalization"
+      ↓
+2020-2021: 被 GPT-NeoX 等模型采用
+      ↓
+2023: LLaMA 使用 RMSNorm，成为主流
+      ↓
+2024: 几乎所有新的大模型都用 RMSNorm
+```
+
+### 主流应用模型
+- **LLaMA 1/2/3**（Meta）
+- **Qwen 系列**（阿里）
+- **GLM-4**（智谱）
+- **GPT-NeoX**（EleutherAI）
+- **DeepSeek 系列**
+
+---
+
+## 📝 总结
+
+### 一句话总结
+**RMSNorm 通过控制向量的幅度（RMS），在保持高效的同时稳定深层网络的训练和推理。**
+
+### 核心要点
+1. ✅ 只计算均方根，不计算均值和方差
+2. ✅ 更快、更简单、更稳定
+3. ✅ 效果与 LayerNorm 相当或更好
+4. ✅ 现代大模型的标准选择
+
+### 为什么重要？
+在大模型时代，RMSNorm 以更简单的方式实现了归一化，成为训练稳定性的关键组件。
+
+---
+
+## 📚 参考资料
+
+- **论文**：Zhang & Sennrich, "Root Mean Square Layer Normalization" (2019)
+- **应用**：LLaMA, Qwen 等模型的技术报告
+- **实现**：本项目 `src/ops/rms_norm/op.cpp`
+
+---
+
+*笔记整理时间：2026年2月3日*  
+*基于 LLAISYS 项目学习经历*
diff --git "a/RoPE\347\237\245\350\257\206\347\254\224\350\256\260.md" "b/RoPE\347\237\245\350\257\206\347\254\224\350\256\260.md"
new file mode 100644
index 000000000..50a4a53c0
--- /dev/null
+++ "b/RoPE\347\237\245\350\257\206\347\254\224\350\256\260.md"
@@ -0,0 +1,680 @@
+# RoPE (Rotary Position Embedding) 知识笔记
+
+## 📌 核心概念
+
+**RoPE = 旋转位置编码**，是 Transformer 模型中的一种位置编码方式，通过**旋转变换**将位置信息编码到向量中。
+
+---
+
+## 1️⃣ RoPE 在大模型中的作用
+
+### 在 Transformer 架构中的位置
+- **位置**：每个 Transformer Block 内，在 Attention 层对 Q（Query）和 K（Key）进行处理
+- **应用时机**：计算 Attention 之前，对 Q 和 K 向量进行旋转编码
+
+### 核心功能
+1. **编码位置信息**：让模型知道每个 token 在序列中的位置
+2. **相对位置建模**：模型学习的是"相邻"、"间隔2个"等相对关系，而非"第5个token"这样的绝对位置
+3. **长度外推**：训练时用 512 长度，推理时可以处理 2048+ 长度的序列
+4. **稳定数值**：旋转保留向量长度，不破坏原始语义
+
+### 训练/推理流程
+```
+训练阶段：
+Input → Embedding → RoPE → Attention → FFN → ... → Output
+               ↓ 
+         让模型学习相对位置关系
+
+推理阶段：
+New Input → Embedding → RoPE → Attention → FFN → ... → Output
+                   ↓
+             自动泛化到新长度（长度外推）
+```
+
+---
+
+## 2️⃣ 基本数学原理
+
+### 数学公式
+
+对输入向量 $x_i = [a_i, b_i] \in \mathbb{R}^d$（其中 $a_i, b_i \in \mathbb{R}^{d/2}$）：
+
+**角度计算：**
+$$\phi_{i,j} = \frac{p_i}{\theta^{2j/d}}$$
+
+其中：
+- $p_i$ = token 的位置 ID（从 pos_ids 获取）
+- $\theta$ = 固定基数（通常 10000）
+- $j$ = 维度索引（0, 1, ..., d/2-1）
+- $d$ = 向量维度
+
+**旋转变换：**
+$$a'_{i,j} = a_{i,j} \cos(\phi_{i,j}) - b_{i,j} \sin(\phi_{i,j})$$
+$$b'_{i,j} = b_{i,j} \cos(\phi_{i,j}) + a_{i,j} \sin(\phi_{i,j})$$
+
+### 复数域解释
+
+这是标准的 2D 旋转矩阵：
+$$\begin{bmatrix} a' \\ b' \end{bmatrix} = \begin{bmatrix} \cos\phi & -\sin\phi \\ \sin\phi & \cos\phi \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}$$
+
+在复数域：$z' = z \cdot e^{i\phi}$（旋转复数）
+
+---
+
+## 3️⃣ 关键概念详解
+
+### Q1：频率 (freq) 是什么？从哪里来？
+
+**定义：** 频率 = 旋转速度
+
+$$\text{freq}_j = \frac{1}{\theta^{2j/d}}$$
+
+**来源：** 这是 RoPE 论文预先设计的公式，**不是计算出来的**。
+
+**例子（d=4, theta=10000）：**
+```
+j=0: freq_0 = 1 / 10000^0     = 1.0    (快速旋转，高频)
+j=1: freq_1 = 1 / 10000^0.5   = 0.01   (慢速旋转，低频)
+```
+
+**作用：**
+- **高频**（freq 大）：捕捉短距离位置关系（相邻 token）
+- **低频**（freq 小）：捕捉长距离位置关系（远距离 token）
+
+**类比：** 就像时钟
+- 秒针（高频）：快速转动，短时间内变化大
+- 时针（低频）：缓慢转动，长时间才变化明显
+
+---
+
+### Q2：角度 (angle) 是什么？
+
+**定义：** 角度 = 位置 × 频率
+
+$$\text{angle}_j = \text{pos} \times \text{freq}_j$$
+
+**例子：**
+```
+位置 0: angle = 0 × freq = 0     (不旋转)
+位置 1: angle = 1 × freq = freq
+位置 10: angle = 10 × freq        (旋转更多)
+```
+
+**物理意义：** 位置越大，旋转角度越大，向量方向变化越多。
+
+---
+
+### Q3：如何将位置信息与向量结合？
+
+**关键：直接旋转向量本身，而非加法！**
+
+#### 传统方法 vs RoPE
+
+| 方法 | 操作 | 效果 |
+|------|------|------|
+| **传统 PE** | `x_new = x + PE(pos)` | ❌ 加法破坏原始语义 |
+| **RoPE** | `x_new = Rotate(x, angle)` | ✅ 保留向量长度，只改变方向 |
+
+#### 具体流程
+```python
+原始向量: [3.0, 4.0, 1.0, 0.0]  (d=4)
+         ↓
+分组: [3.0, 4.0] 和 [1.0, 0.0]
+         ↓
+分别旋转不同角度
+         ↓
+新向量: [0.779, 3.804, 3.063, 1.244]  (含位置信息)
+```
+
+**验证向量长度不变：**
+```
+|原始| = √(3²+4²+1²+0²) = √26 ≈ 5.099
+|新的| = √(0.779²+3.804²+3.063²+1.244²) ≈ 5.099  ✅
+```
+
+---
+
+### Q4：为什么要分组？
+
+**原因：旋转是 2D 操作，需要成对的数据。**
+
+- **1D 向量**：无法旋转（没有平面）
+- **2D 向量**：可以在平面上旋转
+- **高维向量**：分解成多个 2D 对，每对独立旋转
+
+**类比：**
+```
+你无法旋转一条线（1D）
+但可以旋转一个平面上的点（2D）
+3D 物体可以分解为多个 2D 切面分别旋转
+```
+
+---
+
+### Q5：如何分组？分组规则是什么？
+
+**固定规则：前 d/2 个和后 d/2 个配对**
+
+$$\text{第 } j \text{ 对} = [x_j, x_{j+d/2}], \quad j = 0, 1, ..., d/2-1$$
+
+#### 例子 1：d=4
+```
+向量: [x₀, x₁, x₂, x₃]
+
+第0对: [x₀, x₂]  → 用 freq_0 旋转
+第1对: [x₁, x₃]  → 用 freq_1 旋转
+```
+
+#### 例子 2：d=8
+```
+向量: [x₀, x₁, x₂, x₃, x₄, x₅, x₆, x₇]
+
+第0对: [x₀, x₄]  → 用 freq_0 旋转 (高频)
+第1对: [x₁, x₅]  → 用 freq_1 旋转
+第2对: [x₂, x₆]  → 用 freq_2 旋转
+第3对: [x₃, x₇]  → 用 freq_3 旋转 (低频)
+```
+
+**为什么这样分组？**
+- 让每对使用不同频率
+- 实现多尺度位置编码
+- 这是 RoPE 论文的设计选择
+
+---
+
+### Q6：分组后如何旋转？
+
+**每一对使用自己的频率和角度独立旋转**
+
+#### 标准旋转公式（第 j 对）
+
+对 $[x_j, x_{j+d/2}]$：
+
+$$x'_j = x_j \cos(\phi_j) - x_{j+d/2} \sin(\phi_j)$$
+$$x'_{j+d/2} = x_{j+d/2} \cos(\phi_j) + x_j \sin(\phi_j)$$
+
+其中：$\phi_j = \text{pos} \times \text{freq}_j$
+
+#### 完整计算例子（d=8, pos=2, theta=10）
+
+**步骤 1：计算频率**
+```
+freq_0 = 1.0
+freq_1 = 0.562
+freq_2 = 0.316
+freq_3 = 0.178
+```
+
+**步骤 2：计算角度**
+```
+angle_0 = 2 × 1.0 = 2.0 rad
+angle_1 = 2 × 0.562 = 1.124 rad
+angle_2 = 2 × 0.316 = 0.632 rad
+angle_3 = 2 × 0.178 = 0.356 rad
+```
+
+**步骤 3：分组旋转**
+```
+输入: [1, 2, 3, 4, 5, 6, 7, 8]
+
+第0对 [1, 5] 用 2.0 rad 旋转 → [-4.961, -1.171]
+第1对 [2, 6] 用 1.124 rad 旋转 → [-4.274, 4.204]
+第2对 [3, 7] 用 0.632 rad 旋转 → [-1.704, 7.302]
+第3对 [4, 8] 用 0.356 rad 旋转 → [1.488, 8.180]
+```
+
+**步骤 4：拼接结果**
+```
+输出: [-4.961, -4.274, -1.704, 1.488, -1.171, 4.204, 7.302, 8.180]
+```
+
+---
+
+## 4️⃣ 为什么 RoPE 有效？
+
+### 信息保留机制：旋转为什么能保存原始信息？
+
+#### 向量的两个独立属性
+
+向量包含两种信息：
+- **幅度（长度）**：代表"强度"、"重要性" → **语义信息**
+- **方向（角度）**：代表"指向" → **位置信息**
+
+**RoPE 的巧妙之处：** 将这两种信息**完全分离**编码。
+
+#### 旋转变换的数学性质
+
+对任意 2D 向量 $[a, b]$ 进行旋转变换：
+
+$$\begin{bmatrix} a' \\ b' \end{bmatrix} = \begin{bmatrix} \cos\phi & -\sin\phi \\ \sin\phi & \cos\phi \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix}$$
+
+**关键性质：**
+- ✅ **幅度（长度）完全不变**
+- ⚠️ **方向改变了 $\phi$ 角度**
+
+#### 数学证明
+
+旋转后向量的长度：
+$$|v'| = \sqrt{(a')^2 + (b')^2}$$
+
+展开：
+$$= \sqrt{(a\cos\phi - b\sin\phi)^2 + (b\cos\phi + a\sin\phi)^2}$$
+
+$$= \sqrt{a^2\cos^2\phi - 2ab\cos\phi\sin\phi + b^2\sin^2\phi + b^2\cos^2\phi + 2ab\sin\phi\cos\phi + a^2\sin^2\phi}$$
+
+$$= \sqrt{a^2(\cos^2\phi + \sin^2\phi) + b^2(\sin^2\phi + \cos^2\phi)}$$
+
+利用三角恒等式 $\cos^2\phi + \sin^2\phi = 1$：
+$$= \sqrt{a^2 + b^2} = |v|$$
+
+**结论：旋转前后长度完全相同！** ✅
+
+#### 直观例子
+
+**例1：简单向量**
+```
+原始向量 [3, 4]:
+  幅度 = √(3² + 4²) = √25 = 5
+  方向 = arctan(4/3) ≈ 53.13°
+
+旋转 90° 后变成 [-4, 3]:
+  幅度 = √((-4)² + 3²) = √25 = 5  ✅ 相同
+  方向 = arctan(3/-4) ≈ 143.13°   (增加了 90°)
+```
+
+**例2：向量 [1, 0]**
+```
+原始：幅度 = 1，方向 = 0°
+
+旋转 45° → [0.707, 0.707]:
+  幅度 = √(0.707² + 0.707²) ≈ 1  ✅
+  
+旋转 90° → [0, 1]:
+  幅度 = √(0² + 1²) = 1  ✅
+  
+旋转 180° → [-1, 0]:
+  幅度 = √((-1)² + 0²) = 1  ✅
+```
+
+**无论旋转多少角度，长度始终是 1！**
+
+#### 信息分离的意义
+
+| 属性 | 编码内容 | 变化规律 | 保留程度 |
+|------|---------|---------|---------|
+| **幅度** | 语义信息（重要性、强度） | 不变 | 100% 保留 ✅ |
+| **方向** | 位置信息 | 随位置改变 | 精确编码 ✅ |
+
+**对比传统加法编码：**
+```
+传统 PE: x_new = x + PE
+  原始向量: [3, 4]，长度 = 5
+  位置编码: [0.5, 0.5]
+  结果: [3.5, 4.5]，长度 = √(3.5² + 4.5²) ≈ 5.7  ❌ 语义被破坏
+
+RoPE: x_new = Rotate(x, φ)
+  原始向量: [3, 4]，长度 = 5
+  旋转 30°
+  结果: [2.6, 4.5]，长度 = 5  ✅ 语义完美保留
+```
+
+#### 为什么这很重要？
+
+在深度学习中：
+- **向量长度**通常表示特征的"强度"或"置信度"
+- **Attention 机制**中，内积计算既需要语义相似度，也需要位置关系
+- **RoPE** 让模型同时获得两种信息，且互不干扰
+
+**类比：指南针**
+```
+指针长度 = 指示强度（信号强弱）
+指针方向 = 指示方位（东南西北）
+
+旋转指针时：
+- 指针长度不变 → 信号强度不变
+- 指针方向改变 → 方位信息改变
+
+两种信息完美分离！
+```
+
+---
+
+### 核心机制：相对位置自动编码
+
+在计算 Attention 时：
+$$Q_{pos=i} \cdot K_{pos=j} \propto \cos(\phi_i - \phi_j)$$
+
+**关键性质：**
+- 内积自动包含角度差 $\Delta\phi = \phi_i - \phi_j$
+- 角度差只依赖于**相对位置** $\Delta pos = i - j$
+- 模型学习的是相对关系，而非绝对位置
+
+### 例子：Attention 矩阵
+
+3 个 token，原始向量都是 `[1, 0]`（语义相同）
+
+```
+旋转后：
+pos=0: [1.0, 0.0]           (不旋转)
+pos=1: [0.54, 0.84]         (旋转 1 rad)
+pos=2: [-0.42, 0.91]        (旋转 2 rad)
+
+Attention 内积矩阵 (Q·K^T)：
+       k0     k1     k2
+q0    1.00   0.54  -0.42
+q1    0.54   1.00   0.54   ← 注意：相同相对位置 = 相同值
+q2   -0.42   0.54   1.00
+```
+
+**观察：**
+- 对角线全是 1.0（自己和自己）
+- 相邻位置（±1）的值都是 0.54
+- **相同的相对位置 → 相同的 attention 权重**
+
+---
+
+## 5️⃣ RoPE 的设计思路
+
+### 背景：传统位置编码的问题
+
+| 方法 | 问题 |
+|------|------|
+| **Sinusoidal PE** | 加法破坏语义，位置信息分散 |
+| **Learned PE (BERT)** | 无法外推到训练长度之外 |
+| **ALiBi** | 只能在 Attention 处理，不够通用 |
+
+### 创新点
+
+#### 1. 旋转替代加法
+- **之前**：`x_new = x + PE(pos)` ❌
+- **RoPE**：`x_new = Rotate(x, angle)` ✅
+
+#### 2. 复数域的优雅
+利用复数乘法的旋转性质：$z' = z \cdot e^{i\phi}$
+
+#### 3. 自动的相对位置
+Q·K^T 自动包含相对位置信息，无需显式计算
+
+#### 4. 长度外推能力
+相同相对位置的旋转角度差总是相同，无论绝对位置多大
+
+### 灵感来源
+
+1. **复数旋转**：数学上旋转保留向量长度
+2. **傅里叶级数**：多频率分解信号（多尺度）
+3. **相对位置编码**：只关心相对关系，不关心绝对位置
+
+---
+
+## 6️⃣ RoPE 的优势
+
+### 与其他位置编码对比
+
+| 特性 | 传统 PE | Learned PE | RoPE |
+|------|---------|-----------|------|
+| **语义保留** | ❌ 加法破坏 | ❌ 加法破坏 | ✅ 旋转保留 |
+| **长度外推** | ⚠️ 有限 | ❌ 无法 | ✅ 自动 |
+| **相对编码** | ❌ 间接 | ❌ 绝对 | ✅ 直接 |
+| **额外参数** | ✅ 无 | ❌ 需要 | ✅ 无 |
+| **多尺度** | ✅ 有 | ❌ 无 | ✅ 有 |
+| **数学优雅** | ⚠️ 启发式 | ❌ 工程化 | ✅ 理论基础 |
+
+### 实际效果
+
+1. **训练稳定性好**：不破坏向量语义
+2. **长度外推强**：训练 512，推理 2048+ 无问题
+3. **计算高效**：仅需三角函数，易并行化
+4. **广泛应用**：LLaMA、Qwen、GLM 等主流模型都用 RoPE
+
+---
+
+## 7️⃣ 实现伪代码
+
+```python
+def rope(x, pos, theta=10000):
+    """
+    x: 向量 [d]
+    pos: 位置 ID
+    theta: 基数
+    """
+    d = len(x)
+    
+    for j in range(d // 2):
+        # 计算频率和角度
+        freq_j = 1.0 / (theta ** (2 * j / d))
+        angle_j = pos * freq_j
+        
+        # 获取第 j 对
+        a = x[j]           # 前半部分的第 j 个
+        b = x[j + d // 2]  # 后半部分的第 j 个
+        
+        # 旋转变换
+        cos_val = cos(angle_j)
+        sin_val = sin(angle_j)
+        
+        x[j]         = a * cos_val - b * sin_val
+        x[j + d // 2] = b * cos_val + a * sin_val
+    
+    return x
+```
+
+---
+
+## 8️⃣ 关键参数
+
+### out, in, pos_ids
+
+- **in**：输入张量，shape `[seqlen, nhead, d]`
+- **out**：输出张量，shape `[seqlen, nhead, d]`
+- **pos_ids**：位置 ID，shape `[seqlen]`，dtype `int64`
+- **theta**：基数，通常 10000
+
+### 三重循环结构
+
+```cpp
+for (seq = 0; seq < seqlen; ++seq) {
+    int64_t pos = pos_ids[seq];  // 获取位置
+    
+    for (head = 0; head < nhead; ++head) {
+        
+        for (j = 0; j < d/2; ++j) {
+            // 计算频率和角度
+            float freq = 1.0f / powf(theta, 2.0f * j / d);
+            float angle = pos * freq;
+            
+            // 旋转第 j 对
+            float a = in[seq][head][j];
+            float b = in[seq][head][j + d/2];
+            
+            out[seq][head][j]       = a * cos(angle) - b * sin(angle);
+            out[seq][head][j + d/2] = b * cos(angle) + a * sin(angle);
+        }
+    }
+}
+```
+
+---
+
+## 9️⃣ 常见问题 FAQ
+
+### Q: 为什么叫"旋转"位置编码？
+A: 因为数学上是 2D 旋转矩阵，在复数域就是 $z \cdot e^{i\phi}$。
+
+### Q: 旋转为什么可以保存原始信息？向量不是改变了吗？
+A: 旋转只改变**方向**，不改变**幅度（长度）**。语义信息存储在向量的幅度中，位置信息编码在方向的变化中。
+
+**验证公式：** 旋转后的向量长度 $|v'| = \sqrt{(a')^2 + (b')^2} = \sqrt{a^2+b^2} = |v|$
+
+**例子：**
+- 原始：`[3, 4]` → 长度 = 5（语义：重要性）
+- 旋转90°：`[-4, 3]` → 长度 = 5（语义保留✅），方向改变（位置信息）
+
+### Q: RoPE 如何知道 Q 和 K 的相对位置？
+A: 通过**内积运算自动提取角度差**。
+
+**数学原理：**
+$$Q_i \cdot K_j = \cos(\phi_i - \phi_j) = \cos((i-j) \times \text{freq})$$
+
+内积只依赖于**相对位置差 $(i-j)$**，与绝对位置无关。
+
+**例子：**
+- Q在位置1，K在位置3：内积 = $\cos(-2 \times \text{freq})$
+- Q在位置10，K在位置12：内积 = $\cos(-2 \times \text{freq})$（相同！）
+
+模型学到的是"相距2个位置的token应该有多大的attention"，而非"位置1和3"这样的绝对关系。
+
+### Q: RoPE 是如何一开始就知道 Q 和 K 在什么位置的？
+A: **通过显式传入的 `pos_ids` 参数**。
+
+RoPE 本身**不会自动知道位置**，而是需要外部告诉它每个 token 的位置信息。
+
+**具体流程：**
+1. **输入数据**：有一个序列 `["我", "爱", "中国"]`
+2. **位置标记**：系统给每个 token 分配位置 ID：`[0, 1, 2]`
+3. **传入 RoPE**：
+   ```python
+   rope(in, pos_ids=[0, 1, 2], ...)
+   ```
+4. **RoPE 处理**：
+   - 对位置 0 的 token：旋转角度 = 0 × freq = 0
+   - 对位置 1 的 token：旋转角度 = 1 × freq
+   - 对位置 2 的 token：旋转角度 = 2 × freq
+
+**pos_ids 的来源：**
+- **训练/推理时**：通常是简单的序列 `[0, 1, 2, ..., seqlen-1]`
+- **特殊情况**：可以自定义（如处理填充、缓存等场景）
+
+**示例代码：**
+```python
+# Transformer 内部
+def forward(self, input_ids):
+    # 1. 获取序列长度
+    seqlen = input_ids.shape[1]
+    
+    # 2. 生成位置 ID
+    pos_ids = torch.arange(seqlen)  # [0, 1, 2, ..., seqlen-1]
+    
+    # 3. Embedding
+    x = self.embedding(input_ids)
+    
+    # 4. 应用 RoPE（显式传入位置）
+    q = rope(q, pos_ids)
+    k = rope(k, pos_ids)
+    
+    # 5. 计算 Attention
+    attn = q @ k.T
+```
+
+**关键点：**
+- pos_ids 是**外部输入**，不是 RoPE 内部计算的
+- RoPE 只负责"根据给定的位置 ID 进行旋转"
+- 位置信息的获取由调用者（Transformer 框架）负责
+
+### Q: 为什么能长度外推？
+A: 因为学习的是相对位置关系。相对距离 Δ=1 在任何绝对位置都是一样的旋转角度差。
+
+### Q: 为什么多个频率？
+A: 不同频率捕捉不同尺度的位置关系（短距离 vs 长距离）。
+
+### Q: theta=10000 是怎么定的？
+A: 经验值，论文设定。太小会导致长序列时角度重复，太大会导致短序列时分辨率不够。
+
+### Q: 能用于其他模态（图像/音频）吗？
+A: 可以，但需要调整（如 2D RoPE for Vision Transformer）。
+
+### Q: 如果向量维度是奇数怎么办？
+A: **RoPE 要求维度必须是偶数**，这是设计上的硬性约束。
+
+**原因：**
+- 旋转是 2D 操作，必须成对进行
+- 每对需要 2 个元素：$[x_j, x_{j+d/2}]$
+- 如果 $d$ 是奇数，无法完美配对
+
+**实际情况：**
+在所有使用 RoPE 的大模型中，隐藏层维度都是偶数：
+- **LLaMA**: 4096, 5120, 6656（偶数）
+- **Qwen**: 4096, 8192（偶数）
+- **GPT**: 768, 1024, 1536, 2048（偶数）
+
+**如果真的遇到奇数维度：**
+1. **填充到偶数**（最简单）
+   ```
+   原始维度 d=127
+   填充一个 0 → d'=128
+   RoPE(前128维) + 保留最后一维不变
+   ```
+
+2. **降维到偶数**（不推荐）
+   ```
+   d=127 → 丢弃最后一维 → d'=126
+   ```
+
+3. **只对前 d-1 维应用 RoPE**
+   ```
+   d=127
+   对前 126 维应用 RoPE（63对）
+   最后 1 维保持不变
+   ```
+
+**为什么模型设计者总选偶数维度？**
+- 便于 RoPE 等位置编码
+- 便于并行计算（2的幂次更优）
+- 便于矩阵分块（多头注意力）
+
+**验证代码：**
+```python
+# 在 RoPE 实现中通常有这样的检查
+assert d % 2 == 0, f"RoPE requires even dimension, got {d}"
+```
+
+**结论：** RoPE 的设计就是为偶数维度优化的，实践中不会遇到奇数维度的情况。
+
+---
+
+## 🔟 时间线与应用
+
+```
+2017: Transformer 原论文（Sinusoidal PE）
+      ↓
+2021: RoPE 论文发布（RoFormer）
+      ↓
+2023: LLaMA、Qwen 等大模型广泛采用
+      ↓
+2024: 成为事实标准位置编码
+```
+
+### 主流应用模型
+- **LLaMA 系列**（Meta）
+- **Qwen 系列**（阿里）
+- **GLM 系列**（智谱）
+- **DeepSeek 系列**
+
+---
+
+## 📝 总结
+
+### 一句话总结
+**RoPE 通过复数旋转，将位置信息优雅地编码到向量方向中，实现了相对位置建模和长度外推。**
+
+### 核心优势
+1. ✅ 保留向量语义（旋转不改变幅度）
+2. ✅ 自动相对位置编码
+3. ✅ 长度外推能力强
+4. ✅ 数学优雅、实现简单
+5. ✅ 无额外参数、计算高效
+
+### 为什么重要？
+在大模型时代，RoPE 解决了位置编码的核心痛点（长度外推），成为了 Transformer 架构的标准组件。
+
+---
+
+## 📚 参考资料
+
+- **论文**：Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021)
+- **应用**：LLaMA, Qwen, GLM 等模型的技术报告
+- **实现**：本项目 `src/ops/rope/op.cpp`
+
+---
+
+*笔记整理时间：2026年2月3日*  
+*基于 LLAISYS 项目学习经历*
diff --git "a/Self-Attention\347\237\245\350\257\206\347\254\224\350\256\260.md" "b/Self-Attention\347\237\245\350\257\206\347\254\224\350\256\260.md"
new file mode 100644
index 000000000..8a49c9b3b
--- /dev/null
+++ "b/Self-Attention\347\237\245\350\257\206\347\254\224\350\256\260.md"
@@ -0,0 +1,909 @@
+# Self-Attention 知识笔记
+
+> 从 RoPE + RMSNorm 之后，Transformer 如何通过 Self-Attention 学习前文信息
+
+---
+
+## 一、核心概念与位置
+
+### Self-Attention 在 Transformer 中的角色
+
+```
+Transformer Block 完整流程：
+
+输入 x [seqlen, d_model]
+  ↓
+RMSNorm(x)  ← 【已学：对输入归一化，稳定训练】
+  ↓
+RoPE(x)     ← 【已学：添加旋转位置编码，编码相对位置】
+  ↓
+Self-Attention ← 【本章：让每个词学到前文的关键信息】
+  ↓
+残差连接 + 第二个 RMSNorm
+  ↓
+FFN（Feed-Forward Network）
+  ↓
+输出 y [seqlen, d_model]
+```
+
+**核心作用**：给定前文，通过注意力机制聚合相关的历史信息，为生成下一个词提供"上下文"。
+
+---
+
+## 二、Q、K、V 的来源与含义
+
+### 2.1 线性投影（来自学习的权重矩阵）
+
+Self-Attention 的输入是经过 RoPE 处理的向量 `x`，它包含了位置信息。接下来通过三个**可学习的线性变换**生成 Q、K、V：
+
+```
+x [seqlen, d_model=4096]  ← 每个词已编码位置信息
+  ├─→ 乘以 W_q [4096, d=128]  → Q [seqlen, nhead, d]
+  ├─→ 乘以 W_k [4096, d=128]  → K [total_len, nkvhead, d]
+  └─→ 乘以 W_v [4096, dv=128] → V [total_len, nkvhead, dv]
+```
+
+**关键点**：虽然 Q、K、V 都来自同一个 x，但它们经过**不同的权重矩阵**，因此突出了 x 的不同方面。
+
+### 2.2 参数从何而来？
+
+**W_q、W_k、W_v 都是神经网络的可学习参数！**
+
+#### 初始化阶段
+```
+模型创建时：
+  W_q ← 随机初始化（正态分布）
+  W_k ← 随机初始化
+  W_v ← 随机初始化
+```
+
+#### 训练阶段（反向传播）
+
+```
+每次训练迭代：
+
+前向传播：
+  Q = x @ W_q
+  K = x @ W_k
+  V = x @ W_v  ← 注意：这里也会被优化
+  attn_output = softmax(Q @ K.T / sqrt(d)) @ V
+  pred = FFN(attn_output)
+
+计算损失：
+  loss = CrossEntropy(pred, 真实标签)
+
+反向传播梯度链（非常重要）：
+  loss
+    ↓ (从后续层反向传播回来)
+  attn_output的梯度
+    ↓ (softmax_weights @ V 的梯度分解为两部分)
+  ├─→ softmax_weights的梯度
+  │     ↓
+  │   Q @ K^T 的梯度  ← 传给W_q、W_k
+  │
+  └─→ V的梯度
+        ↓
+      x @ W_v 的梯度  ← 传给W_v！
+      
+所以，W_v也有梯度和更新！
+
+参数更新（梯度下降）：
+  ∂loss/∂W_q = (attention梯度通过Q) @ x.T
+  ∂loss/∂W_k = (attention梯度通过K) @ x.T
+  ∂loss/∂W_v = (value聚合梯度通过V) @ x.T  ← W_v也改变！
+  
+  W_q_new = W_q_old - learning_rate × ∂loss/∂W_q
+  W_k_new = W_k_old - learning_rate × ∂loss/∂W_k
+  W_v_new = W_v_old - learning_rate × ∂loss/∂W_v
+```
+
+**三个权重矩阵都被同时训练！**
+
+- **W_q** 学会：如何在"查询空间"中突出需要什么信息
+- **W_k** 学会：如何在"键空间"中突出是什么信息（与Q匹配）
+- **W_v** 学会：如何提取什么样的"值向量"来被聚合
+
+**经过数百万个迭代后**，W_q、W_k、W_v 逐渐学会了提取和匹配语义特征。
+
+### 2.3 Q、K、V 的语义含义
+
+```
+Q（Query，查询）：
+  ├─ 含义：当前词"我需要什么信息？"
+  ├─ 作用：在"查询特征空间"中的投影
+  ├─ 形状：[seqlen, nhead, d]
+  └─ 用途：用来与其他词的K比对
+
+K（Key，键）：
+  ├─ 含义：前文词"我是什么？"
+  ├─ 作用：在"键特征空间"中的投影
+  ├─ 形状：[total_len, nkvhead, d]  (可能包含KV Cache)
+  └─ 用途：被Q查询匹配
+
+V（Value，值）：
+  ├─ 含义：前文词"我的信息是什么？"
+  ├─ 作用：在"值特征空间"中的投影
+  ├─ 形状：[total_len, nkvhead, dv]
+  └─ 用途：被加权聚合
+```
+
+---
+
+## 三、Self-Attention 的数学原理
+
+### 3.1 核心公式
+
+$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V$$
+
+### 3.2 各步骤详解
+
+#### Step 1：计算相似度矩阵
+
+$$A = \frac{QK^T}{\sqrt{d}}$$
+
+```
+形状计算：
+Q [seqlen, d] × K^T [d, total_len]  = A [seqlen, total_len]
+
+含义：
+A[i, j] = (Q[i] · K[j]) / sqrt(d)
+        = 第i个词的查询与第j个词的键的相似度（归一化）
+
+例子（seqlen=3, total_len=3）：
+A = [[Q0·K0, Q0·K1, Q0·K2],   ← 词0的查询与所有词的键相似度
+     [Q1·K0, Q1·K1, Q1·K2],   ← 词1的查询与所有词的键相似度
+     [Q2·K0, Q2·K1, Q2·K2]]   ← 词2的查询与所有词的键相似度
+```
+
+**为什么点积能衡量相似度？**
+
+```
+两个向量的点积反映它们的"对齐程度"：
+
+v1 = [1, 0]（指向东），v2 = [1, 0]（也指向东）
+v1 · v2 = 1  ← 完全相同，相似度最大
+
+v3 = [0, 1]（指向北）
+v1 · v3 = 0  ← 垂直，毫无关联
+
+在Self-Attention中：
+  训练时，W_q和W_k会学到在"查询空间"和"键空间"中
+  都强调语义关键特征的投影方式
+  
+  → Q·K^T 自动发现语义相关的词对
+  → 相关词的相似度高，不相关词的相似度低
+```
+
+#### Step 2：应用 Causal Mask（因果掩码）
+
+```
+生成任务必须是因果的：第i个词只能看到前i个词（包括自己）
+
+Causal Mask 矩阵：
+     词0  词1  词2
+词0   ✓    ✗    ✗     可以看   不能看（未来）
+词1   ✓    ✓    ✗
+词2   ✓    ✓    ✓
+
+实现：将未来位置的分数设为 -∞
+
+A_masked = A + Mask
+         
+如果 Mask[i, j] = 0（i能看j），保持原值
+如果 Mask[i, j] = -∞（i不能看j），设为 -∞
+
+例如：
+A_masked[0] = [0.5, -∞, -∞]  ← 词0只关注自己
+A_masked[2] = [0.3, 0.2, 0.4]  ← 词2能看全部
+```
+
+#### Step 3：Softmax 归一化
+
+$$\text{Attention Weights} = \text{softmax}(A_{masked})$$
+
+```
+对每一行独立做softmax（将分数转换为概率）：
+
+weights[i] = exp(A_masked[i]) / sum(exp(A_masked[i]))
+
+例子：
+A_masked[2] = [0.3, 0.2, 0.4]
+
+exp(values) = [e^0.3, e^0.2, e^0.4] = [1.35, 1.22, 1.49]
+sum = 4.06
+
+weights[2] = [1.35/4.06, 1.22/4.06, 1.49/4.06]
+           = [0.33, 0.30, 0.37]
+
+性质：
+  ✓ 所有元素都是非负数
+  ✓ 每一行的和为 1.0（概率分布）
+  ✓ 最高的值对应"最需要关注"的词
+```
+
+#### Step 4：加权求和聚合信息
+
+$$\text{Output} = \text{Attention Weights} \times V$$
+
+```
+形状计算：
+Attention Weights [seqlen, total_len] × V [total_len, dv]
+  = Output [seqlen, dv]
+
+含义：
+output[i] = sum(weights[i, j] * V[j] for j in range(total_len))
+          = 用注意力权重加权求和所有词的value向量
+
+例子：
+output[2] = 0.33 * V[0] + 0.30 * V[1] + 0.37 * V[2]
+
+              词0的值        词1的值        词2的值
+           (权重0.33)    (权重0.30)    (权重0.37)
+
+结果：output[2] 是一个融合了三个词信息的向量
+      其中词2的信息最多（权重0.37），词0次之（0.33）
+```
+
+---
+
+## 四、完整工作流程示例
+
+### 场景：翻译模型生成"鱼"
+
+**输入**："猫 喜欢 吃 ___"  
+**目标**：生成"鱼"（或其他食物）
+
+### 前向推理过程
+
+```
+词序列：  猫    喜欢   吃
+位置ID：  0     1      2
+
+Step 1：Embedding + RoPE
+  x[0] = embedding("猫") with RoPE(pos=0)    [d_model=4096]
+  x[1] = embedding("喜欢") with RoPE(pos=1)  [d_model=4096]
+  x[2] = embedding("吃") with RoPE(pos=2)    [d_model=4096]
+
+Step 2：投影到Q、K、V空间
+  Q = x @ W_q  [3, 32头, 128维]
+  K = x @ W_k  [3, 8头, 128维]   (GQA: 8个共享头)
+  V = x @ W_v  [3, 8头, 128维]
+  
+  详细值（简化为3维）：
+  Q[0]"猫" = [0.3, 0.1, 0.5]
+  Q[1]"喜欢" = [0.2, 0.4, 0.1]
+  Q[2]"吃" = [0.8, 0.3, 0.2]
+  
+  K[0]"猫" = [0.25, 0.15, 0.5]
+  K[1]"喜欢" = [0.22, 0.38, 0.15]
+  K[2]"吃" = [0.75, 0.35, 0.25]
+  
+  V[0]"猫" = [值向量...]
+  V[1]"喜欢" = [值向量...]
+  V[2]"吃" = [值向量...]
+
+Step 3：计算相似度
+  A = Q @ K^T / sqrt(128)
+  
+  A[0] = [0.34, 0.28, 0.25]  ← Q[0]与所有K的相似度
+  A[1] = [0.29, 0.32, 0.31]
+  A[2] = [0.58, 0.35, 0.52]  ← Q[2]与所有K的相似度
+  
+  关键观察：
+    A[2, 0] = 0.58  ← "吃"的查询与"猫"的键高度相似！
+    （模型学到了"动作"与"主体"的强关联）
+
+Step 4：应用Causal Mask
+  A_masked[0] = [0.34, -∞, -∞]
+  A_masked[1] = [0.29, 0.32, -∞]
+  A_masked[2] = [0.58, 0.35, 0.52]
+
+Step 5：Softmax
+  weights[0] = softmax([0.34, -∞, -∞]) = [1.0, 0, 0]
+  weights[1] = softmax([0.29, 0.32, -∞]) ≈ [0.48, 0.52, 0]
+  weights[2] = softmax([0.58, 0.35, 0.52]) ≈ [0.40, 0.24, 0.36]
+  
+  → "吃"关注"猫"的权重：0.40（最高！）
+  → "吃"关注"喜欢"的权重：0.24
+  → "吃"关注自己的权重：0.36
+
+Step 6：加权聚合
+  output[2] = 0.40 * V[0]"猫" + 0.24 * V[1]"喜欢" + 0.36 * V[2]"吃"
+            = 融合向量，主要包含"猫"和"吃"的信息
+  
+  这个向量已经"知道"：
+    - 主体是"猫"（权重0.40）
+    - 猫喜欢做某事（权重0.24）
+    - 正在做"吃"的动作（权重0.36）
+
+Step 7：后续层处理
+  output[2] 传给 FFN 和其他Transformer层
+  最终输出 logits，解码得到 "鱼"
+```
+
+**这就是模型"知道"前文信息的原理！**
+
+---
+
+## 五、GQA（Grouped Query Attention）与 KV Cache
+
+### 5.1 为什么要用 GQA？
+
+```
+标准 Multi-Head Attention（MHA）：
+  Q: [seqlen, nhead=32, d]
+  K: [seqlen, nhead=32, d]
+  V: [seqlen, nhead=32, d]
+  
+  内存占用：大！每个Q头都有独立的K、V头
+
+Grouped Query Attention（GQA）：
+  Q: [seqlen, nhead=32, d]       ← 仍然32个查询头
+  K: [seqlen, nkvhead=8, d]      ← 只有8个键头（共享）
+  V: [seqlen, nkvhead=8, d]      ← 只有8个值头（共享）
+  
+  映射关系（重要）：
+  多个Q头共享一个K/V头
+  
+  K/V_head_idx = Q_head_idx // 4  (每4个Q头共享1个K/V头)
+  
+  具体映射：
+  ┌─────────── Q头0 ┐
+  ├─────────── Q头1 ├──→ K/V头0  (共享同一个K/V)
+  ├─────────── Q头2 │
+  ├─────────── Q头3 ┘
+  ├─────────── Q头4 ┐
+  ├─────────── Q头5 ├──→ K/V头1  (共享同一个K/V)
+  ├─────────── Q头6 │
+  ├─────────── Q头7 ┘
+  ...
+  
+  效果分析：
+  ✓ Q头0-3都使用K/V头0，但用不同的查询方式
+  ✓ 这4个头从同一个"值向量"中提取不同的特征
+  ✓ KV Cache 内存减少 75%（32→8）
+  ✓ 计算量没有显著增加，但推理速度更快
+```
+
+#### GQA 与特征学习的关系
+
+```
+不同Q头学习不同的查询模式：
+
+Q头0: "词性匹配" 查询空间
+Q头1: "语义相似" 查询空间
+Q头2: "距离关系" 查询空间
+Q头3: "实体指代" 查询空间
+↓ (都使用) K/V头0
+
+K/V头0: 提供某种"基础特征值"
+  ├─ 被Q头0用"词性匹配"的视角查询
+  ├─ 被Q头1用"语义相似"的视角查询
+  ├─ 被Q头2用"距离关系"的视角查询
+  └─ 被Q头3用"实体指代"的视角查询
+
+结果：
+  同一个V向量被4种不同的"眼光"审视
+  → 4个头得到4种不同的解释
+  → 最终聚合时融合这4种视角
+  
+这就是GQA的巧妙之处：
+  用较少的值向量(8个)，通过多个查询视角(32个)，
+  实现和MHA相当的特征丰富度，但内存占用更少！
+```
+
+#### 完整计算流程详解（重要！）
+
+**以处理第3个词"吃"为例**：
+
+```
+输入：词"吃"的向量 x[2] [d_model=4096]
+
+Step 1: 线性投影生成Q、K、V
+  Q[2] = x[2] @ W_q  → [32头, 128维]
+  K[2] = x[2] @ W_k  → [8头, 128维]
+  V[2] = x[2] @ W_v  → [8头, 128维]
+
+  展开Q（32个头）：
+  Q[2] = [Q头0, Q头1, Q头2, ..., Q头31]  每个头128维
+
+  展开K（8个头）：
+  K[2] = [K头0, K头1, ..., K头7]  每个头128维
+
+  展开V（8个头）：
+  V[2] = [V头0, V头1, ..., V头7]  每个头128维
+
+Step 2: 对每个Q头分别计算注意力（这里是关键！）
+
+  Q头0（第0组）：
+    ├─ 找到对应的K/V头：K/V头0（因为 0 // 4 = 0）
+    ├─ 计算相似度：
+    │   scores_0 = Q头0 @ [K头0[词0], K头0[词1], K头0[词2]]^T
+    │            = [score_00, score_01, score_02]
+    │            ← 与前文所有词的K头0比对
+    ├─ Causal Softmax：
+    │   weights_0 = softmax([score_00, score_01, score_02])
+    │             = [0.3, 0.3, 0.4]
+    └─ 加权求和V：
+        output_0 = 0.3*V头0[词0] + 0.3*V头0[词1] + 0.4*V头0[词2]
+                 = [128维融合向量]
+
+  Q头1（第0组）：
+    ├─ 对应K/V头：K/V头0（因为 1 // 4 = 0，共享同一个K/V）
+    ├─ 计算相似度：
+    │   scores_1 = Q头1 @ [K头0[词0], K头0[词1], K头0[词2]]^T
+    │            = [score_10, score_11, score_12]
+    │            ← 注意：用的是同一个K头0，但Q不同！
+    ├─ Causal Softmax：
+    │   weights_1 = softmax([score_10, score_11, score_12])
+    │             = [0.5, 0.2, 0.3]  ← 与Q头0不同的权重！
+    └─ 加权求和V：
+        output_1 = 0.5*V头0[词0] + 0.2*V头0[词1] + 0.3*V头0[词2]
+                 = [128维融合向量，但权重不同！]
+
+  ... (Q头2、Q头3 也都用K/V头0)
+
+  Q头4（第1组）：
+    ├─ 对应K/V头：K/V头1（因为 4 // 4 = 1）
+    ├─ 计算相似度：
+    │   scores_4 = Q头4 @ [K头1[词0], K头1[词1], K头1[词2]]^T
+    │            ← 用的是K头1，不同的K特征！
+    └─ ...
+  
+  ... (重复32次，每个Q头都独立计算)
+
+Step 3: 拼接所有头的输出
+  final_output = concat(output_0, output_1, ..., output_31)
+               = [32 × 128 = 4096维]
+               ← 融合了32个头的信息
+
+Step 4: 输出投影（可选）
+  result = final_output @ W_out
+         = [d_model=4096维]
+         ← 最终的包含前文信息的向量！
+```
+
+**关键理解**：
+
+```
+✓ 每个Q头只与1个K/V头配对（不是8个）
+  Q头0-3 → K/V头0
+  Q头4-7 → K/V头1
+  ...
+
+✗ 不是"每个头与8个K/V运算"
+  而是"32个Q头分成8组，每组用一个K/V"
+
+✓ 虽然Q头0-3共享K/V头0，但因为Q不同：
+  → Q头0·K头0 得到的相似度 ≠ Q头1·K头0 得到的相似度
+  → 所以attention权重不同
+  → 聚合的V向量权重不同
+  → 4个输出虽然用同一个V，但加权不同，结果不同！
+
+✓ 最终32个头输出拼接：
+  [output_0 | output_1 | ... | output_31]
+  = 包含32种不同视角的信息
+  = 对"吃"这个词的完整上下文理解
+```
+
+**直观比喻**：
+
+```
+书店里有8本书（K/V头0-7）
+你有32副不同的眼镜（Q头0-31）
+
+戴眼镜0-3看书0：
+  ├─ 眼镜0（词性视角）：看到"名词、动词、介词..."
+  ├─ 眼镜1（语义视角）：看到"动物、食物、地点..."
+  ├─ 眼镜2（距离视角）：看到"远、近、相邻..."
+  └─ 眼镜3（指代视角）：看到"主体、客体、代词..."
+
+戴眼镜4-7看书1：
+  （又是4种不同的视角解读另一本书）
+
+最终：32副眼镜 × 8本书 = 32种不同的理解
+拼接起来 = 对当前词的完整上下文表示
+```
+
+### 5.2 KV Cache 加速推理
+
+```
+场景：生成一个1000词的文本
+
+不用KV Cache（低效）：
+  生成词1：计算完整序列的K、V [1, nhead, d]
+  生成词2：重新计算 [2, nhead, d] ← 词1被重复计算！
+  生成词3：重新计算 [3, nhead, d] ← 词1、2被重复计算！
+  ...
+  总计算量 ∝ 1000²
+
+用KV Cache（高效）：
+  生成词1：计算K1、V1，存入cache
+  生成词2：只计算新K2、V2，拼接cache → K = [K1, K2]
+  生成词3：只计算新K3、V3，拼接cache → K = [K1, K2, K3]
+  ...
+  总计算量 ∝ 1000  ← 线性而非平方！
+  
+实现：
+  K_cache = [K1, K2, ..., K_{i-1}]  [已生成词数, nkvhead, d]
+  V_cache = [V1, V2, ..., V_{i-1}]
+  
+  生成第i个词时：
+    K_new = concat(K_cache, K_i)  [i, nkvhead, d]
+    V_new = concat(V_cache, V_i)
+    
+    Attention(Q_i, K_new, V_new)  ← 完整的注意力机制
+```
+
+### 5.3 GQA 的性能-成本权衡
+
+#### GQA 是否以牺牲性能换成本？
+
+**简答**：有轻微性能损失，但权衡**极其划算**！
+
+根据 GQA 论文的实验数据：
+
+```
+对比标准MHA（32个K/V头）：
+
+GQA-8（8个K/V头）：
+  ├─ 模型困惑度：↓ 0-2%  ← 几乎看不出来
+  ├─ 推理速度：↑ 2-3倍
+  ├─ KV Cache内存：↓ 75%
+  └─ 性价比：★★★★★ 极高！
+
+GQA-16（16个K/V头）：
+  ├─ 困惑度：↓ 0%（基本无损失）
+  └─ KV Cache内存：↓ 50%
+
+对比：直接减少到8个注意力头
+  ├─ 困惑度：↓ 5-10%  ← 明显衰退！
+  └─ 这才是真正的性能损失
+
+结论：GQA的0-2%损失 << 简单减头的5-10%损失
+```
+
+#### 为什么 GQA 损失这么小？
+
+```
+关键认识：K/V具有高冗余度
+
+标准MHA的问题：
+  32个完全独立的K/V投影
+  ├─ 参数很多
+  ├─ 但在语义上有重复
+  └─ 许多K在特征空间中其实很相似
+
+GQA的巧妙设计：
+  ├─ 只用8个K/V头（减少参数）
+  ├─ 配合32个多样的Q头来查询
+  └─ 效果：用32种不同的"视角"看同一组信息
+  
+类比：
+  MHA = 32本书 + 32副眼镜  （冗余）
+  GQA = 8本书 + 32副眼镜   （高效！）
+  
+  关键信息全在，只是共享了"书"的内容
+  但32副眼镜保证了解读的多样性
+```
+
+#### 生成长文本时的成本对比
+
+```
+生成1000个词的例子：
+
+标准MHA（nhead=32）：
+  KV Cache大小：
+    1000(seqlen) × 32(nhead) × 128(d) × 4bytes
+    = 16.4 MB
+  推理计算：Q·K^T 的规模 = O(1000²)
+
+GQA-8（nkvhead=8）：
+  KV Cache大小：
+    1000 × 8 × 128 × 4
+    = 4.1 MB  ← 减少 75%！
+  推理计算：Q(32)·K(8)^T，仍是线性复杂度
+  
+  省了什么：
+  ├─ 内存：12.3 MB
+  ├─ 推理速度：3倍快（更好的缓存局部性）
+  └─ 代价：困惑度↑ 0-2%
+
+投资回报率：
+  投入：0.2% 困惑度损失
+  收获：75% 内存减少 + 3倍速度
+  
+  这笔交易太值了！
+```
+
+#### 训练 vs 推理对 GQA 的影响不同
+
+```
+推理时（生成模式）：
+  优势最大：
+  ├─ KV Cache占用大量内存（线性增长）
+  ├─ GQA能显著降低内存压力
+  ├─ 允许生成更长的序列
+  └─ 批量大小可以更大
+
+训练时（完整序列一次性计算）：
+  优势较小：
+  ├─ K/V不需要缓存（直接全量计算）
+  ├─ 内存节省不如推理明显
+  ├─ 主要优势来自参数减少（1800万→1400万参数）
+  └─ 性能影响取决于学习率等超参数调整
+
+常见做法：
+  ├─ 大模型：用标准MHA训练
+  │  （参数越多效果越好，不怕麻烦）
+  │
+  ├─ 中等模型：用GQA训练和推理
+  │  （平衡性能和成本）
+  │
+  └─ 端侧/移动部署：GQA推理+蒸馏
+     （严格限制内存和速度）
+```
+
+#### 现实应用中的采用情况
+
+```
+DeepSeek、Qwen等前沿模型使用 GQA 的原因：
+
+✓ 推理加速 2-3 倍
+  ├─ 降低推理成本（降低GPU时间）
+  └─ 用户获得更快响应
+
+✓ KV Cache 减少 75%
+  ├─ 允许更大的批量推理
+  ├─ 更长的序列处理
+  └─ 多用户并发推理
+
+✓ 困惑度损失仅 0-2%
+  ├─ 对最终应用体验无明显影响
+  ├─ 通过更好的数据质量补偿
+  └─ 整体收益远超成本
+
+如果性能损失是 5-10%，就不值得了
+但 0-2% 的损失换来 75% 内存 + 3倍速度？
+这是现代LLM设计的必然选择！
+```
+
+---
+
+## 六、与前面学过的内容的关联
+
+### 6.1 RoPE + Self-Attention
+
+```
+RoPE 的作用（位置编码）：
+  ├─ 在向量中嵌入相对位置信息
+  ├─ 通过旋转保留向量长度
+  ├─ 相对位置通过内积 Q·K^T 自动编码
+  
+Self-Attention 中的应用：
+  q_rotated = rotate(q, pos_q)
+  k_rotated = rotate(k, pos_k)
+  
+  A = q_rotated · k_rotated^T
+    = |q| |k| cos(φ_q - φ_k)  ← 包含相对位置差 φ_q - φ_k
+  
+  结果：不同位置的词自动产生不同的注意力！
+```
+
+### 6.2 RMSNorm + Self-Attention
+
+```
+RMSNorm 的作用（归一化）：
+  ├─ 稳定每个维度的大小
+  ├─ 防止某维度过大导致梯度消失
+  ├─ 让线性层(W_q、W_k、W_v)输入分布一致
+  
+Self-Attention 中的效果：
+  x_normalized = rms_norm(x)  ← 分布稳定
+  Q = x_normalized @ W_q      ← 投影更稳定
+  K = x_normalized @ W_k
+  V = x_normalized @ W_v
+  
+  A = Q·K^T / sqrt(d)          ← softmax输入范围稳定
+  
+  好处：梯度流更稳定，训练收敛更快
+```
+
+### 6.3 完整Transformer Block
+
+```
+x_input [seqlen, d_model]
+  ↓
+RMSNorm(x_input)  ← 稳定输入分布
+  ↓
+RoPE(normalized)  ← 添加位置信息
+  ↓
+Self-Attention(Q, K, V)  ← 聚合前文信息
+  attn_output = softmax(Q·K^T / sqrt(d)) @ V
+  ↓
+残差连接 + RMSNorm
+  ↓
+FFN  ← 非线性变换
+  ↓
+残差连接
+  ↓
+输出 y_output [seqlen, d_model]
+```
+
+---
+
+## 七、Self-Attention 的参数数量
+
+```
+单个Transformer层的Self-Attention部分：
+
+W_q: [d_model, d] = [4096, 128] = 524,288 参数
+W_k: [d_model, d] = [4096, 128] = 524,288 参数
+W_v: [d_model, d_v] = [4096, 128] = 524,288 参数
+输出投影：[4096, 4096] = 16,777,216 参数
+
+小计：约1800万参数
+
+一个完整Transformer块（含FFN）：可能有1亿+参数
+
+整个大模型：可能有10亿-1000亿+参数，都通过反向传播训练！
+```
+
+---
+
+## 八、常见问题
+
+### Q1: 为什么要除以 sqrt(d)？
+
+```
+A = Q·K^T 的分布分析：
+
+不除以sqrt(d)：
+  如果d很大（比如128）
+  Q、K向量的每一维都是[-1, 1]范围
+  点积会累积，导致 A 的值变得很大
+  
+  例：d=128时，A 可能在 [-100, 100] 范围
+  softmax([100, -100, 50]) 会导致：
+    exp(100) ≈ 10^43  ← 爆炸！
+    梯度接近0 ← 梯度消失！
+
+除以sqrt(d)：
+  A' = A / sqrt(128) ≈ A / 11.3
+  A' 在 [-9, 9] 范围
+  softmax 梯度正常，训练稳定！
+
+一般规则：
+  缩放因子 = 1 / sqrt(d)
+  目的：让 Q·K^T 的方差 = 1，稳定softmax
+```
+
+### Q2: 多头是否都学相同的模式？
+
+```
+不同的头学习不同的关系模式：
+
+Head 0 可能学到："词性匹配"
+  "名词"关注"名词"，"动词"关注"动词"
+
+Head 1 可能学到："语义相似性"
+  "猫"关注"狗"、"动物"等语义相近词
+
+Head 2 可能学到："距离感"
+  关注距离较近的词（局部上下文）
+
+...
+
+最终输出：融合所有头的信息
+attn_output = concat(head0_output, head1_output, ..., headN_output)
+            = 包含多种关系的丰富表示
+```
+
+### Q3: 为什么 Self-Attention 能"学到"语义？
+
+```
+关键在于反向传播（训练时）：
+
+初始状态：W_q、W_k 是随机的
+→ A = Q·K^T 的相似度毫无意义
+→ softmax后attention weights几乎随机
+→ 模型生成的词错误
+
+训练信号（梯度）：
+  "为什么预测错了？"
+  → 因为关注了错误的词
+  → 我需要调整W_q和W_k
+
+反向传播后：
+  ∂loss/∂W_q 和 ∂loss/∂W_k 会指向让"相关词"相似的方向
+
+多次迭代后：
+  W_q 学会：在查询空间中突出"需要什么信息"
+  W_k 学会：在键空间中也突出"是什么信息"
+  → 语义相关的词自动产生高点积
+  → attention weights 自动正确分配
+
+这就是深度学习的魔法！
+```
+
+---
+
+## 九、实现注意事项
+
+### Self-Attention 函数签名
+
+```cpp
+void self_attention(
+  tensor_t attn_val,    // 输出：[seqlen, nhead, dv]
+  tensor_t q,           // 查询：[seqlen, nhead, d]
+  tensor_t k,           // 键：[total_len, nkvhead, d]
+  tensor_t v,           // 值：[total_len, nkvhead, dv]
+  float scale           // 缩放因子：通常 1/sqrt(d)
+);
+```
+
+### 计算步骤
+
+1. **计算注意力分数**：`scores = Q @ K^T * scale`
+2. **应用Causal Mask**：将位置 i > j 的分数设为 -∞
+3. **Softmax**：行方向归一化
+4. **加权求和**：`output = softmax_scores @ V`
+
+### 多头处理
+
+```
+对于每个注意力头：
+  分别计算 Q[head]、K[head]、V[head]
+  执行上述操作
+  最后拼接所有头的输出
+```
+
+### 内存索引
+
+```
+对于 3D 张量 [dim0, dim1, dim2]：
+  linear_idx = i0 * (dim1 * dim2) + i1 * dim2 + i2
+```
+
+---
+
+## 十、学习路线总结
+
+```
+Transformer 核心算子学习进度：
+
+✅ RMSNorm
+   └─ 作用：稳定激活，为后续操作创建一致的输入分布
+
+✅ RoPE
+   └─ 作用：编码相对位置，旋转不改变向量长度
+
+🚧 Self-Attention  ← 当前
+   └─ 作用：通过 Q·K^T 的相似度聚合前文信息
+
+❌ Rearrange（张量重组）
+❌ SwiGLU（激活函数）
+❌ ...其他算子
+
+完整推理流程（Assignment #3）：
+  在这些算子基础上组装完整的 Transformer 推理
+```
+
+---
+
+## 参考：公式速查表
+
+| 公式 | 含义 |
+|------|------|
+| $Q = xW_q$ | 查询投影 |
+| $K = xW_k$ | 键投影 |
+| $V = xW_v$ | 值投影 |
+| $A = \frac{QK^T}{\sqrt{d}}$ | 相似度矩阵（缩放） |
+| $A_{masked} = A + \text{Mask}$ | 应用因果掩码 |
+| $W = \text{softmax}(A_{masked})$ | 注意力权重 |
+| $\text{output} = W \times V$ | 加权聚合 |
+
+---
+
+**下一步**：实现 self_attention 函数，验证与 PyTorch 参考实现一致。
+
diff --git "a/SwiGLU\347\237\245\350\257\206\347\254\224\350\256\260.md" "b/SwiGLU\347\237\245\350\257\206\347\254\224\350\256\260.md"
new file mode 100644
index 000000000..e1b637dac
--- /dev/null
+++ "b/SwiGLU\347\237\245\350\257\206\347\254\224\350\256\260.md"
@@ -0,0 +1,774 @@
+# SwiGLU 激活函数完整学习笔记
+
+## 一、概述与地位
+
+### 1.1 什么是 SwiGLU？
+
+**定义**：SwiGLU 是一种用于 Transformer 前馈网络（FFN）的非线性激活函数。
+
+**完整公式**：
+$$out_i = up_i \cdot \frac{gate_i}{1 + e^{-gate_i}}$$
+
+其中：
+- $up_i$：值分支生成的特征
+- $gate_i$：门控分支生成的控制信号  
+- $\frac{gate_i}{1 + e^{-gate_i}}$：动态门控函数
+
+**数学性质**：
+- 定义域：$(-\infty, +\infty)$
+- 值域：$(-\infty, +\infty)$
+- 当 $gate_i \to +\infty$ 时，$\frac{gate_i}{1+e^{-gate_i}} \to gate_i$
+- 当 $gate_i \to -\infty$ 时，$\frac{gate_i}{1+e^{-gate_i}} \to 0$
+- 当 $gate_i = 0$ 时，$\frac{gate_i}{1+e^{-gate_i}} = 0$
+
+### 1.2 在 Transformer 推理流程中的位置
+
+```
+┌─────────────────────────────────────────────────────┐
+│             Transformer Block 结构                    │
+├─────────────────────────────────────────────────────┤
+│                                                       │
+│  输入: x [seqlen, d_model]                           │
+│    ↓                                                  │
+│  ┌─────────────────────────────────────────────┐    │
+│  │  Multi-Head Self-Attention                  │    │
+│  │  - 计算上下文关系                            │    │
+│  │  - 输出保留所有位置的上下文信息              │    │
+│  │  - 输出: [seqlen, d_model]                  │    │
+│  └──────────────┬──────────────────────────────┘    │
+│                 ↓                                    │
+│  LayerNorm(归一化，稳定训练)                        │
+│                 ↓                                    │
+│  ┌─────────────────────────────────────────────┐    │
+│  │  Feed-Forward Network (FFN)                 │    │
+│  │                                              │    │
+│  │  中间层扩展: Linear(d_model → 4×d_model)    │    │
+│  │  输出: [seqlen, 4×d_model]                  │    │
+│  │       ↓                                     │    │
+│  │  ┌────────────────────────────────────┐    │    │
+│  │  │  ⭐ SwiGLU 激活函数                │    │    │
+│  │  │  - 对 4×d_model 个特征逐元素激活   │    │    │
+│  │  │  - 根据上下文动态选择特征           │    │    │
+│  │  │  输入: [seqlen, 4×d_model]         │    │    │
+│  │  │  输出: [seqlen, 4×d_model]         │    │    │
+│  │  └────────────────────────────────────┘    │    │
+│  │       ↓                                     │    │
+│  │  投影回原维度: Linear(4×d_model → d_model) │    │
+│  │  输出: [seqlen, d_model]                   │    │
+│  └─────────────────────────────────────────────┘    │
+│                 ↓                                    │
+│  残差连接 + LayerNorm                               │
+│                 ↓                                    │
+│  输出: y [seqlen, d_model]                         │
+│                                                       │
+└─────────────────────────────────────────────────────┘
+```
+
+---
+
+## 二、作用机理深度解析
+
+### 2.1 为什么需要激活函数？
+
+#### 线性变换的局限性
+
+没有激活函数的神经网络：
+```
+x → Linear(W₁) → Linear(W₂) → Linear(W₃) → ... → y
+```
+
+**问题**：无论有多少层，最终计算仍然是：
+$$y = W_n \cdot ... \cdot W_2 \cdot W_1 \cdot x = W_{合并} \cdot x$$
+
+仍然是线性变换！**无法学习复杂的非线性关系**。
+
+#### 激活函数的作用
+
+激活函数引入**非线性**，使得：
+```
+x → Linear → 激活 → Linear → 激活 → ... → y
+```
+
+每一层都能学习不同的特征表示，形成更复杂的决策边界。
+
+### 2.2 SwiGLU 相比其他激活函数的本质区别
+
+#### 2.2.1 常见激活函数对比
+
+**ReLU**（Rectified Linear Unit）：
+$$ReLU(x) = \max(0, x)$$
+
+- 优点：计算简单快速
+- 缺点：当 $x < 0$ 时，梯度为 0（**死神经元问题**）
+- 性质：**固定激活曲线**，所有位置相同方式
+
+**GELU**（Gaussian Error Linear Unit）：
+$$GELU(x) = x \cdot \Phi(x)$$
+
+其中 $\Phi(x)$ 是标准正态分布的累积分布函数。
+
+- 优点：平滑，梯度不为 0，效果更好
+- 缺点：仍然是**固定激活策略**
+
+**SwiGLU**：
+$$SwiGLU(up, gate) = up \cdot \frac{gate}{1+e^{-gate}}$$
+
+- **动态激活策略**：根据输入内容动态决定激活强度
+- **双分支设计**：up 生成值，gate 学习控制逻辑
+- **自适应选择**：不同位置的特征可以有不同的激活方式
+
+#### 2.2.2 直观对比
+
+假设中间层某维度处理"水果"语义，有 3 个样本位置：
+
+```python
+位置1："苹果"（语境：红的、甜的）
+位置2："科学"（语境：技术、研究）
+位置3："树"（语境：自然、环境）
+
+中间层值：
+position = [up1=5.2,  up2=-0.3,  up3=2.8]
+
+GELU 激活（固定策略）：
+output = GELU([5.2, -0.3, 2.8])
+       = [5.2×Φ(5.2), -0.3×Φ(-0.3), 2.8×Φ(2.8)]
+       = [5.19, -0.085, 2.78]  ← 用同一个激活曲线处理所有位置
+
+SwiGLU 激活（动态策略）：
+gate = [gate1=8, gate2=-10, gate3=2]
+
+output = up * sigmoid_gate(gate)
+       = [5.2×1.0,  -0.3×0.0,  2.8×0.6]
+       = [5.2, 0, 1.68]  ← 根据语境动态调整激活强度
+
+分析：
+- 位置1："苹果"：gate=8 很高 → 这个特征对水果识别很关键 → 全部保留
+- 位置2："科学"：gate=-10 很低 → 这个特征在非水果语境无用 → 全部阻挡
+- 位置3："树"：gate=2 中等 → 这个特征部分相关（树是植物，与水果有关）→ 部分保留
+```
+
+### 2.3 双分支设计的深层含义
+
+#### FFN 的完整计算流程
+
+```
+输入: x [seqlen, d_model]
+      ↓
+┌─────────────────────────────────────────────────────┐
+│             第一层 Linear（扩展层）                  │
+│         输出维度：4 × d_model（通常）              │
+└──────┬──────────────────────────────────────────────┘
+       ↓
+   [seqlen, 4×d_model]
+       ↓ 分裂成两个分支
+   ┌───────────────────────────────────────┐
+   │                                       │
+   ├→ gate 分支                            │  ← 学习什么
+   │  - 对每个特征维度学习一个控制信号      │    应该被激活
+   │  - 范围：(-∞, +∞)                     │
+   │  - 语义："这个特征对当前语境重要吗？" │
+   │                                       │
+   ├→ up 分支                              │  ← 生成什么
+   │  - 对每个特征维度生成原始特征值        │    样的值
+   │  - 范围：(-∞, +∞)                     │
+   │  - 语义："这个特征的具体值是多少？"   │
+   │                                       │
+   └───────────────────────────────────────┘
+       ↓
+   元素乘积：out = up * sigmoid_gate(gate)
+       ↓
+   [seqlen, 4×d_model]（特征被选择性激活）
+       ↓
+   第二层 Linear（压缩层）
+   输出维度：d_model
+       ↓
+   [seqlen, d_model]
+```
+
+#### 为什么要分成两个分支？
+
+**原因 1：参数效率**
+```
+单分支（普通激活）：
+  Linear(d → 4d) + ReLU/GELU + Linear(4d → d)
+  参数量：d×4d + 4d×d = 8d²
+
+双分支（SwiGLU）：
+  Linear(d → 4d) [gate] + Linear(d → 4d) [up] + SwiGLU + Linear(4d → d)
+  参数量：d×4d + d×4d + 4d×d = 12d²
+```
+看起来更多，但实际效果更好。
+
+**原因 2：独立学习空间**
+- `gate` 分支学会：**"鉴别"** - 判断当前特征对这个位置有多重要
+- `up` 分支学会：**"生成"** - 产生对应特征的最好表示
+
+两个独立的线性变换允许网络同时优化这两个不同的目标。
+
+**原因 3：信息分解**
+```
+假设输入 x 包含多个语义：[位置信息, 语义信息, 语法信息, ...]
+
+不使用门控：
+  Linear(x) → 单一压缩 → 无法区分哪些信息对当前特征重要
+  
+使用双分支：
+  gate_branch: 学会对不同输入特征的"敏感度"权重
+  up_branch: 生成具体的特征值
+  
+  结果：自动学会"对位置信息敏感的特征就高激活，对语法信息敏感的就低激活"
+```
+
+---
+
+## 三、研究背景与发现
+
+### 3.1 为什么研究人员会发现 SwiGLU？
+
+#### 问题的提出
+
+**背景**：2020-2021 年间，Transformer 应用越来越广泛，但 FFN 的设计（特别是激活函数）还很简陋。
+
+**核心问题**：
+1. **ReLU/GELU 的局限**：固定激活策略，不能根据语境调整
+2. **参数冗余**：简单的 Linear → 激活 → Linear，并不能充分发挥参数的作用
+3. **特征浪费**：某些特征在某个位置根本不需要，但仍然被激活处理
+
+#### 3.2 "为什么"的深层思考
+
+**类比：人类的注意力机制**
+
+```
+场景1：看到苹果
+  大脑激活的特征：颜色、圆形、甜味、...
+  抑制的特征：数学公式、编程语法、...
+
+场景2：看到数学题
+  大脑激活的特征：逻辑、符号、推理、...
+  抑制的特征：颜色、口味、...
+
+关键：同一个"苹果"特征在不同语境下的激活强度不同！
+```
+
+SwiGLU 就是对这一直觉的形式化表达。
+
+#### 3.3 论文发现
+
+**论文**：《GLU Variants Improve Transformer》（2020年）
+
+**主要发现**：
+1. **门控线性单元（GLU）优于固定激活函数**
+2. **多头变体（Gated Linear Units, Multi-Head）**
+3. **SwiGLU 实现（Swish + GLU）**：结合 Swish 激活（$x \cdot \sigma(x)$）和门控机制
+
+**实验结果**：
+- 同等参数量下，SwiGLU 比 GELU 性能提升 **7-15%**
+- 成为现代大模型（Qwen、LLaMA 等）的标准配置
+
+---
+
+## 四、在完整推理系统中的作用流程
+
+### 4.1 从词嵌入到输出的完整链路
+
+```
+┌──────────────────────────────────────────────────────────┐
+│ Step 1: 词向量嵌入与位置编码                             │
+│ Input: "苹果很红"                                        │
+│ ↓                                                          │
+│ Token Embedding + Position Encoding                      │
+│ 词向量: [0.1, -0.2, 0.5, ...]  [768维]                 │
+│ ↓                                                          │
+│ Block 0                                                   │
+│ ├─ Self-Attention: 融合上文信息                          │
+│ │  "苹果" + "很" + "红" = [体现"红苹果"关系]           │
+│ ├─ FFN + SwiGLU: 特征精化与选择                         │
+│ │  中间: [4×768=3072维]                                 │
+│ │  激活: 动态选择哪些特征关键 → "红色"特征高激活       │
+│ └─ 输出: [精化后的向量]                                 │
+│ ↓                                                          │
+│ Block 1                                                   │
+│ ├─ Self-Attention: 融合更多上下文                       │
+│ │  结合前面的结果和自身语境                              │
+│ ├─ FFN + SwiGLU: 进一步精化                             │
+│ │  gate 学到："水果"语境下的特征应该这样激活            │
+│ └─ 输出: [再次精化的向量]                                │
+│ ↓                                                          │
+│ ... Block 2, 3, ..., N-1 ...                            │
+│ ↓                                                          │
+│ Block N-1（最后一层）                                    │
+│ ├─ Self-Attention: 综合全局语义                         │
+│ ├─ FFN + SwiGLU: 最后的特征选择                         │
+│ └─ 输出: [最终向量表示]                                 │
+│ ↓                                                          │
+│ Step 2: 输出层                                           │
+│ Output Linear: [768] → [vocab_size]                     │
+│ Softmax: 概率分布                                        │
+│ Output: ["苹果"的下一个词概率分布]                      │
+└──────────────────────────────────────────────────────────┘
+```
+
+### 4.2 SwiGLU 在每个 Block 中的具体作用
+
+#### Block i 的完整计算
+
+```python
+# 伪代码表示
+def transformer_block(x):
+    # Self-Attention：获取上下文
+    attn_output = self_attention(x)  # 包含周围词的信息
+    x = layernorm(x + attn_output)   # 残差连接
+    
+    # FFN with SwiGLU：特征精化
+    # 第一阶段：扩展
+    up = linear_up(x)      # [seqlen, 4×d]
+    gate = linear_gate(x)  # [seqlen, 4×d]
+    
+    # 第二阶段：SwiGLU激活
+    # 这是关键！根据 x 的内容，决定哪些中间特征应该通过
+    activated = up * sigmoid_glu(gate)  # 元素乘积
+    
+    # 为什么这样做？
+    # - up 生成了 4×d 个潜在特征
+    # - gate 学到了"在这个输入下，哪些特征相关"
+    # - sigmoid_glu(gate) 值域 [0, 1)，相当于一个掩码
+    # - 相关的特征 gate>0，会被放大；无关的 gate<0，会被抑制
+    
+    # 第三阶段：压缩
+    output = linear_out(activated)  # [seqlen, d]
+    
+    x = layernorm(x + output)  # 残差连接
+    
+    return x
+```
+
+#### 具体数值例子
+
+假设处理位置 i（词"苹果"）：
+
+```
+输入 x_i: [-0.1, 0.3, 0.5, 0.2, ...]  (d_model=768)
+
+Self-Attention 后（融合上下文）:
+x_attn_i: [0.2, 0.4, -0.3, 0.6, ...]
+          ↑
+       现在包含了"很"和"红"的信息
+
+FFN 扩展层：
+up_i = Linear_up(x_attn_i)
+     = [5.2, -0.3, 0.8, 10.1, -2.5, 3.1, ..., ...](3072维)
+
+gate_i = Linear_gate(x_attn_i)
+       = [8.0, -10.0, 2.0, 15.0, -0.5, 1.5, ..., ...](3072维)
+
+计算 sigmoid_glu(gate_i)：
+特征0:  gate[0]=8.0   → sigmoid_glu=1.0     (高激活)
+特征1:  gate[1]=-10.0 → sigmoid_glu≈0.0     (抑制)
+特征2:  gate[2]=2.0   → sigmoid_glu≈0.6     (中等激活)
+特征3:  gate[3]=15.0  → sigmoid_glu=1.0     (高激活)
+特征4:  gate[4]=-0.5  → sigmoid_glu≈0.4     (弱激活)
+特征5:  gate[5]=1.5   → sigmoid_glu≈0.58    (中等激活)
+...
+
+激活结果（元素乘积）：
+activated_i[0] = 5.2 × 1.0 = 5.2      ✓ 保留
+activated_i[1] = -0.3 × 0.0 = 0       ✓ 抑制
+activated_i[2] = 0.8 × 0.6 = 0.48     ✓ 部分保留
+activated_i[3] = 10.1 × 1.0 = 10.1    ✓ 保留
+activated_i[4] = -2.5 × 0.4 = -1.0    ✓ 弱保留
+activated_i[5] = 3.1 × 0.58 = 1.8     ✓ 中等保留
+...
+
+投影回原维度：
+output_i = Linear_out([5.2, 0, 0.48, 10.1, -1.0, 1.8, ...])
+         = [新的768维向量]
+
+作用：
+Original: [-0.1, 0.3, 0.5, 0.2, ...]  (只有Self-Attention的直接结果)
+After SwiGLU: [经过精化的向量]  (特征被选择性地强化或抑制)
+```
+
+---
+
+## 五、激活的目的与后续影响
+
+### 5.1 SwiGLU 激活的三个关键目的
+
+#### 目的1：**特征过滤（Feature Filtering）**
+
+```
+通过 gate 学习的过滤器：
+- 高 gate：这个特征在当前语境重要 → 放大
+- 低 gate：这个特征无关 → 衰减
+- 中等 gate：这个特征部分相关 → 部分保留
+
+好处：减少信息噪声
+```
+
+#### 目的2：**非线性变换（Non-linear Transformation）**
+
+```
+纯线性的叠加：Linear₁ + Linear₂ = 仍是 Linear
+引入非线性激活：能学到复杂的决策边界
+
+例如：判断"这是什么水果"
+Linear 只能学：[红色, 圆形, 甜味] → 苹果（直线分界）
+非线性可以学：
+  if (红色 AND 圆形) → 苹果
+  if (红色 AND 细长) → 辣椒
+  if (圆形 AND 甜) AND NOT 红色 → 白色葡萄
+  ...
+```
+
+#### 目的3：**语境自适应（Context Adaptation）**
+
+```
+关键特性：gate 的值取决于输入 x
+- 同一个特征在不同位置可能有完全不同的激活强度
+- 模型学会了"在不同语境下用不同的方式处理信息"
+
+例子：
+"红苹果"位置：gate[红色特征]=10 → 高激活
+"科学"位置：gate[红色特征]=-5 → 低激活
+同一维度，不同激活！
+```
+
+### 5.2 激活后续如何影响下层
+
+#### 信息流的演变
+
+```
+Layer 0:
+  输入词向量: [苹果] = [0.1, -0.2, ...]
+  Self-Attn: 融合周围词 = [0.2, 0.3, ...]
+  FFN+SwiGLU: 特征精化 = [提高了"水果"语义强度]
+  输出: [0.15, 0.5, ...]  ← 现在"水果"语义更强
+
+Layer 1:
+  输入: [0.15, 0.5, ...]  (继承了强化的水果语义)
+  Self-Attn: 与其他词再次互动
+           用现在的"强水果语义"与周围词交互
+           可能激活"食物"、"颜色"等相关概念
+  FFN+SwiGLU: 在食物语境下进行特征选择
+  输出: [更高层次的语义：这是可以吃的东西]
+
+Layer 2:
+  输入: [更高层次语义]
+  Self-Attn: 在更高抽象度上处理
+  FFN+SwiGLU: 再次特征选择
+  输出: [最后可能激活：农产品、自然、健康, ...]
+
+...
+
+最后：
+模型基于这一系列渐进的特征激活和选择，
+预测下一个词时已经建立了完整的语义理解
+```
+
+### 5.3 梯度反传与训练影响
+
+#### 前向传播（推理）
+
+```
+x ──→ Linear_up ──→ up
+ \                    ↘
+  → Linear_gate → gate ──→ SwiGLU → activated ──→ Linear_out → output
+```
+
+#### 反向传播（训练）
+
+```
+损失函数 L
+  ↓
+∂L/∂output（输出层的梯度）
+  ↓
+∂L/∂(Linear_out 的权重)  ← Linear_out 学到如何压缩激活结果
+  ↓
+∂L/∂activated           (激活结果的梯度)
+  ├──────────┬─────────────────┐
+  ↓          ↓                 ↓
+∂L/∂up      ∂L/∂gate    ∂L/∂(sigmoid_glu)
+  ↓          ↓                 ↓
+由激活值决定  由up值和激活     由输入决定
+             导数决定
+
+关键的链式法则：
+∂L/∂up = ∂L/∂(up × sigmoid_glu(gate)) · sigmoid_glu(gate)
+         ← gate 值越高，up 的梯度越强
+
+∂L/∂gate = ∂L/∂(up × sigmoid_glu(gate)) · up · ∂sigmoid_glu/∂gate
+           ← up 值越大，gate 学习的信号越强
+```
+
+#### 学到了什么？
+
+```
+Linear_up 的权重：学到 x → 特征值 的映射
+  训练目标：生成有用的特征候选
+
+Linear_gate 的权重：学到 x → 特征选择 的映射
+  训练目标：学会"这个特征在这个语境下重要吗？"
+
+整体效果：
+随着训练进行，
+- up 分支越来越善于生成相关的特征
+- gate 分支越来越善于识别哪些特征应该通过
+- 两者的配合越来越紧密和高效
+```
+
+---
+
+## 六、与其他组件的协作
+
+### 6.1 与 Self-Attention 的互动
+
+```
+Self-Attention:
+  输入: x (原始词向量)
+  输出: 融合了 Q·K^T 权重的上下文信息
+
+  作用：确定"看哪里"
+
+FFN + SwiGLU:
+  输入: Attention 的输出
+  输出: 根据上下文选择特征
+  
+  作用：确定"看到什么后，应该怎么处理"
+
+流程：
+[输入] → Self-Attention → [知道了前后文] 
+                           ↓
+                         SwiGLU
+                           ↓
+                       [根据前后文调整特征]
+                           ↓
+                       [输出给下一层]
+```
+
+### 6.2 与残差连接的互动
+
+```
+x → SelfAttn → y₁
+    ↓           ↓
+    + ←────────┘
+    ↓
+  LayerNorm
+    ↓ x'
+    → FFN+SwiGLU → y₂
+    ↓               ↓
+    + ←──────────┘
+    ↓
+  LayerNorm → 输出
+
+残差连接的作用：
+1. 梯度流：即使激活函数有梯度问题，也能通过直连路径反传
+2. 恒等映射：网络可以学到"保持原有信息"的策略
+3. 稳定性：大网络的训练更稳定
+
+SwiGLU 与残差的关系：
+- 残差确保即使 SwiGLU 学得不好，原信息也不会丢失
+- SwiGLU 学习增量变换（增强或抑制特定特征）
+- 最终输出 = 原有信息 + 动态选择后的增强特征
+```
+
+### 6.3 与层归一化的互动
+
+```
+层归一化顺序（Post-LN）：
+  SelfAttn → Linear + Bias → Add(残差) → LayerNorm → FFN+SwiGLU → Add → LayerNorm
+
+作用：
+1. 稳定激活值的范围
+2. 让 SwiGLU 的输入和输出都在稳定范围内
+3. 减少训练中的数值不稳定问题
+```
+
+---
+
+## 七、SwiGLU 的数值稳定性实现
+
+### 7.1 为什么需要数值稳定？
+
+```
+sigmoid_glu(x) = x / (1 + exp(-x))
+
+问题：
+- 当 x = 100 时，exp(-100) ≈ 0，计算没问题
+- 当 x = -100 时，exp(-(-100)) = exp(100) = 非常大的数！
+  导致 1 + exp(100) 溢出 → inf 或 NaN
+
+训练失败！
+```
+
+### 7.2 我们的实现方案
+
+```cpp
+float glu_val;
+if (gate_val >= 50.0f) {
+    // exp(-50) ≈ 1.9e-22，可以忽略
+    glu_val = gate_val;
+} else if (gate_val <= -50.0f) {
+    // exp(50) ≈ 5.2e21，非常大
+    // gate_val / (1 + 非常大的数) ≈ 0
+    glu_val = 0.0f;
+} else {
+    // 正常计算
+    glu_val = gate_val / (1.0f + std::exp(-gate_val));
+}
+```
+
+**为什么这样做**：
+- 避免 exp() 计算超大/超小值
+- 保留了 sigmoid_glu 的本质行为
+- 转换为 float 计算确保精度
+
+---
+
+## 八、SwiGLU 的优势总结
+
+| 特性 | 优势 |
+|------|------|
+| **动态门控** | 根据输入语境自适应激活，不像 ReLU/GELU 的固定策略 |
+| **双分支** | up 生成候选特征，gate 学习选择策略，分工明确 |
+| **非线性** | 突破线性变换的局限，学习复杂的特征关系 |
+| **梯度流** | 相比 ReLU，不存在死神经元问题；相比普通激活，有控制权重 |
+| **参数效率** | 虽然参数多一倍的线性层，但整体性能提升 7-15% |
+| **可解释性** | gate 的值反映了"这个特征的重要性"，可以分析和可视化 |
+
+---
+
+## 九、实现细节回顾
+
+### 9.1 完整的计算过程
+
+```cpp
+// 伪代码
+void swiglu_impl(float* out, const float* gate, const float* up, size_t n) {
+    for (size_t i = 0; i < n; ++i) {
+        float gate_val = gate[i];
+        float up_val = up[i];
+        
+        // 计算 sigmoid_glu(gate) = gate / (1 + exp(-gate))
+        float glu_val;
+        if (gate_val >= 50.0f) {
+            glu_val = gate_val;  // 接近 gate_val
+        } else if (gate_val <= -50.0f) {
+            glu_val = 0.0f;      // 接近 0
+        } else {
+            glu_val = gate_val / (1.0f + std::exp(-gate_val));
+        }
+        
+        // 输出 = up * glu
+        out[i] = up_val * glu_val;
+    }
+}
+```
+
+### 9.2 参数验证
+
+```
+✓ 形状检查：out、gate、up 必须相同 shape
+✓ 连续性检查：所有张量必须内存连续
+✓ 数据类型：支持 F32、F16、BF16
+✓ 设备一致性：所有张量在同一设备
+```
+
+---
+
+## 十、总结：SwiGLU 的完整图景
+
+### 10.1 核心洞察
+
+```
+问题：Transformer FFN 中的激活函数如何根据上下文调整？
+
+答案：SwiGLU
+  = 双分支线性层（生成候选 up，生成选择信号 gate）
+  + 元素乘积（将选择应用到候选上）
+  
+结果：
+  - 高效的特征选择机制
+  - 根据 Self-Attention 的输出（上下文）自适应激活
+  - 更好的梯度流和参数利用率
+```
+
+### 10.2 在完整系统中的角色
+
+```
+推理流程：
+  词向量
+    ↓
+  Block 0: Self-Attention(融合上下文) → SwiGLU(选择特征)
+    ↓
+  Block 1: Self-Attention(更高层融合) → SwiGLU(更高层选择)
+    ↓
+  ...
+    ↓
+  Block N: Self-Attention → SwiGLU → 最终输出
+
+SwiGLU 在其中：
+- 承上：接收 Self-Attention 融合的上下文信息
+- 启下：生成经过特征选择的表示，传给下一层
+- 中介：是上下文到特征的翻译器
+```
+
+### 10.3 为什么有效
+
+1. **生物学启发**：人脑也是通过注意力选择和激活特定神经通路
+2. **信息论**：最大化有用信息，最小化噪声
+3. **优化理论**：动态选择使得梯度信号更强
+4. **实验验证**：同等参数量性能提升 7-15%，被主流模型采用
+
+---
+
+## 参考实现
+
+### 完整代码（C++）
+
+```cpp
+template <typename T>
+void swiglu_impl(T *out_ptr, const T *gate_ptr, const T *up_ptr, size_t total_size) {
+    for (size_t i = 0; i < total_size; ++i) {
+        float gate_val = llaisys::utils::cast<float>(gate_ptr[i]);
+        float up_val = llaisys::utils::cast<float>(up_ptr[i]);
+        
+        // 数值稳定的 sigmoid_glu 计算
+        float glu_val;
+        if (gate_val >= 50.0f) {
+            glu_val = gate_val;
+        } else if (gate_val <= -50.0f) {
+            glu_val = 0.0f;
+        } else {
+            glu_val = gate_val / (1.0f + std::exp(-gate_val));
+        }
+        
+        out_ptr[i] = llaisys::utils::cast<T>(up_val * glu_val);
+    }
+}
+```
+
+### 简单验证（Python）
+
+```python
+import torch
+import torch.nn.functional as F
+
+def swiglu(up, gate):
+    """SwiGLU 激活"""
+    # gate / (1 + exp(-gate))
+    return up * (gate / (1.0 + torch.exp(-gate)))
+
+# 测试
+up = torch.randn(2, 3072)
+gate = torch.randn(2, 3072)
+output = swiglu(up, gate)
+print(output.shape)  # [2, 3072]
+```
+
+---
+
+**学习建议**：
+1. 理解"为什么需要激活" → 非线性
+2. 理解"为什么 SwiGLU 更好" → 动态选择
+3. 理解"gate 学到了什么" → 通过反向传播的梯度分析
+4. 在推理系统中追踪特征的变化 → 可视化不同层的激活模式
diff --git a/include/llaisys/models/qwen2.h b/include/llaisys/models/qwen2.h
index 7054626d4..938df1f4c 100644
--- a/include/llaisys/models/qwen2.h
+++ b/include/llaisys/models/qwen2.h
@@ -4,13 +4,29 @@
 #include "../tensor.h"
 
 __C {
+    // Qwen2 model meta info
     struct LlaisysQwen2Meta {
+        // Data type of the model weights. Only supports int8 and float16 for now.
         llaisysDataType_t dtype;
+        // Model hyperparameters
+        // nlayer: number of layers
+        // hs: hidden size
+        // nh: number of attention heads（Q头）
+        // nkvh: number of key/value heads
+        // dh: head dimension
+        // di: intermediate dimension
+        // maxseq: maximum sequence length
+        // voc: vocabulary size
         size_t nlayer, hs, nh, nkvh, dh, di, maxseq, voc;
+        // Sampling parameters
+        // epsilon: sampling parameter epsilon
+        // theta: sampling parameter theta
+        // end_token: end token ID
         float epsilon, theta;
         int64_t end_token;
     };
 
+    // Forward declaration of the model implementation, which is hidden from the API users.
     struct LlaisysQwen2Weights {
         llaisysTensor_t in_embed;
         llaisysTensor_t out_embed;
@@ -31,12 +47,37 @@ __C {
 
     struct LlaisysQwen2Model;
 
+    enum LlaisysQwen2WeightKind {
+        LLAISYS_QWEN2_WEIGHT_IN_EMBED = 0,
+        LLAISYS_QWEN2_WEIGHT_OUT_EMBED = 1,
+        LLAISYS_QWEN2_WEIGHT_OUT_NORM = 2,
+        LLAISYS_QWEN2_WEIGHT_ATTN_NORM = 3,
+        LLAISYS_QWEN2_WEIGHT_ATTN_Q_W = 4,
+        LLAISYS_QWEN2_WEIGHT_ATTN_Q_B = 5,
+        LLAISYS_QWEN2_WEIGHT_ATTN_K_W = 6,
+        LLAISYS_QWEN2_WEIGHT_ATTN_K_B = 7,
+        LLAISYS_QWEN2_WEIGHT_ATTN_V_W = 8,
+        LLAISYS_QWEN2_WEIGHT_ATTN_V_B = 9,
+        LLAISYS_QWEN2_WEIGHT_ATTN_O_W = 10,
+        LLAISYS_QWEN2_WEIGHT_MLP_NORM = 11,
+        LLAISYS_QWEN2_WEIGHT_MLP_GATE_W = 12,
+        LLAISYS_QWEN2_WEIGHT_MLP_UP_W = 13,
+        LLAISYS_QWEN2_WEIGHT_MLP_DOWN_W = 14
+    };
+
     __export struct LlaisysQwen2Model *llaisysQwen2ModelCreate(const LlaisysQwen2Meta *meta, llaisysDeviceType_t device, int *device_ids, int ndevice);
 
     __export void llaisysQwen2ModelDestroy(struct LlaisysQwen2Model * model);
 
     __export struct LlaisysQwen2Weights *llaisysQwen2ModelWeights(struct LlaisysQwen2Model * model);
 
-    __export int64_t llaisysQwen2ModelInfer(struct LlaisysQwen2Model * model, int64_t * token_ids, size_t ntoken);
+    __export llaisysTensor_t llaisysQwen2ModelGetWeight(struct LlaisysQwen2Model * model, int kind, size_t layer);
+
+    __export int64_t llaisysQwen2ModelInfer(struct LlaisysQwen2Model * model,
+                                            int64_t *token_ids,
+                                            size_t ntoken,
+                                            int top_k,
+                                            float top_p,
+                                            float temperature);
 }
 #endif // LLAISYS_MODELS_QWEN2_H
diff --git a/include/llaisys/ops.h b/include/llaisys/ops.h
index ddb3be246..c631f62d7 100644
--- a/include/llaisys/ops.h
+++ b/include/llaisys/ops.h
@@ -13,6 +13,7 @@ __C {
     __export void llaisysROPE(llaisysTensor_t out, llaisysTensor_t in, llaisysTensor_t pos_ids, float theta);
     __export void llaisysSelfAttention(llaisysTensor_t attn_val, llaisysTensor_t q, llaisysTensor_t k, llaisysTensor_t v, float scale);
     __export void llaisysSwiGLU(llaisysTensor_t out, llaisysTensor_t gate, llaisysTensor_t up);
+    __export void llaisysSample(llaisysTensor_t out_idx, llaisysTensor_t logits, float temperature, int top_k, float top_p);
 }
 
 #endif
diff --git a/python/llaisys/chat_cli.py b/python/llaisys/chat_cli.py
new file mode 100644
index 000000000..40772c7e9
--- /dev/null
+++ b/python/llaisys/chat_cli.py
@@ -0,0 +1,84 @@
+import argparse
+import json
+
+try:
+    import requests
+except Exception as exc:  # pragma: no cover
+    raise RuntimeError("requests is required. Install with: pip install requests") from exc
+
+
+def stream_chat(base_url: str, payload: dict) -> str:
+    url = f"{base_url.rstrip('/')}/v1/chat/completions"
+    with requests.post(url, json=payload, stream=True, timeout=3600) as resp:
+        resp.raise_for_status()
+        full_text = ""
+        for raw in resp.iter_lines(decode_unicode=True):
+            if not raw or not raw.startswith("data: "):
+                continue
+            data = raw[len("data: "):]
+            if data == "[DONE]":
+                break
+            obj = json.loads(data)
+            delta = obj["choices"][0].get("delta", {})
+            content = delta.get("content", "")
+            if content:
+                print(content, end="", flush=True)
+                full_text += content
+        print()
+        return full_text
+
+
+def once_chat(base_url: str, payload: dict) -> str:
+    url = f"{base_url.rstrip('/')}/v1/chat/completions"
+    resp = requests.post(url, json=payload, timeout=3600)
+    resp.raise_for_status()
+    data = resp.json()
+    text = data["choices"][0]["message"]["content"]
+    print(text)
+    return text
+
+
+def main():
+    parser = argparse.ArgumentParser(description="CLI chat UI for LLAISYS chat server")
+    parser.add_argument("--server", default="http://127.0.0.1:8000")
+    parser.add_argument("--model", default="llaisys-qwen2")
+    parser.add_argument("--stream", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.8)
+    parser.add_argument("--top_p", type=float, default=0.9)
+    parser.add_argument("--top_k", type=int, default=40)
+    parser.add_argument("--max_tokens", type=int, default=128)
+    parser.add_argument("--system", type=str, default="You are a helpful assistant.")
+    args = parser.parse_args()
+
+    messages = [{"role": "system", "content": args.system}]
+
+    print("LLAISYS Chat CLI. Type /exit to quit.")
+    while True:
+        user_text = input("You: ").strip()
+        if not user_text:
+            continue
+        if user_text in {"/exit", "/quit"}:
+            break
+
+        messages.append({"role": "user", "content": user_text})
+        payload = {
+            "model": args.model,
+            "messages": messages,
+            "max_tokens": args.max_tokens,
+            "temperature": args.temperature,
+            "top_p": args.top_p,
+            "top_k": args.top_k,
+            "stream": args.stream,
+        }
+
+        print("Assistant: ", end="", flush=True)
+        if args.stream:
+            assistant_text = stream_chat(args.server, payload)
+        else:
+            assistant_text = once_chat(args.server, payload)
+
+        messages.append({"role": "assistant", "content": assistant_text})
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/llaisys/chat_server.py b/python/llaisys/chat_server.py
new file mode 100644
index 000000000..be39b46e3
--- /dev/null
+++ b/python/llaisys/chat_server.py
@@ -0,0 +1,146 @@
+import argparse
+import json
+import threading
+import time
+import uuid
+from typing import List, Literal, Optional
+
+import llaisys
+from llaisys.models import Qwen2
+
+try:
+    from fastapi import FastAPI, HTTPException
+    from fastapi.responses import JSONResponse, StreamingResponse
+    from pydantic import BaseModel, Field
+except Exception as exc:  # pragma: no cover
+    raise RuntimeError(
+        "FastAPI dependencies are missing. Install with: pip install fastapi uvicorn"
+    ) from exc
+
+try:
+    from transformers import AutoTokenizer
+except Exception as exc:  # pragma: no cover
+    raise RuntimeError(
+        "transformers is required for chat server. Install with: pip install transformers"
+    ) from exc
+
+
+class ChatMessage(BaseModel):
+    role: Literal["system", "user", "assistant"]
+    content: str
+
+
+class ChatCompletionRequest(BaseModel):
+    model: str = "llaisys-qwen2"
+    messages: List[ChatMessage]
+    max_tokens: int = Field(default=128, ge=1, le=2048)
+    temperature: float = Field(default=0.8, ge=0.0, le=2.0)
+    top_p: float = Field(default=0.9, ge=0.0, le=1.0)
+    top_k: int = Field(default=40, ge=0)
+    stream: bool = False
+
+
+class ChatService:
+    def __init__(self, model_path: str, device: str):
+        self.model_path = model_path
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        self.model = Qwen2(
+            model_path,
+            llaisys.DeviceType.NVIDIA if device == "nvidia" else llaisys.DeviceType.CPU,
+        )
+        self._lock = threading.Lock()
+
+    def _build_input_ids(self, messages: List[ChatMessage]) -> List[int]:
+        conversation = [{"role": m.role, "content": m.content} for m in messages]
+        prompt = self.tokenizer.apply_chat_template(
+            conversation=conversation,
+            add_generation_prompt=True,
+            tokenize=False,
+        )
+        return self.tokenizer.encode(prompt)
+
+    def generate(self, req: ChatCompletionRequest):
+        with self._lock:
+            input_ids = self._build_input_ids(req.messages)
+            output_ids = self.model.generate(
+                input_ids,
+                max_new_tokens=req.max_tokens,
+                top_k=req.top_k,
+                top_p=req.top_p,
+                temperature=req.temperature,
+            )
+            new_ids = output_ids[len(input_ids):]
+            text = self.tokenizer.decode(new_ids, skip_special_tokens=True)
+            return text
+
+
+def create_app(model_path: str, device: str = "cpu") -> FastAPI:
+    app = FastAPI(title="LLAISYS Chat API", version="0.1.0")
+    svc = ChatService(model_path, device)
+
+    @app.get("/v1/models")
+    def list_models():
+        return {
+            "object": "list",
+            "data": [{"id": "llaisys-qwen2", "object": "model", "owned_by": "llaisys"}],
+        }
+
+    @app.post("/v1/chat/completions")
+    def chat_completions(req: ChatCompletionRequest):
+        if not req.messages:
+            raise HTTPException(status_code=400, detail="messages must not be empty")
+
+        completion_id = f"chatcmpl-{uuid.uuid4().hex[:12]}"
+        created = int(time.time())
+
+        if not req.stream:
+            text = svc.generate(req)
+            return JSONResponse(
+                {
+                    "id": completion_id,
+                    "object": "chat.completion",
+                    "created": created,
+                    "model": req.model,
+                    "choices": [
+                        {
+                            "index": 0,
+                            "finish_reason": "stop",
+                            "message": {"role": "assistant", "content": text},
+                        }
+                    ],
+                }
+            )
+
+        def event_stream():
+            text = svc.generate(req)
+            chunk = {
+                "id": completion_id,
+                "object": "chat.completion.chunk",
+                "created": created,
+                "model": req.model,
+                "choices": [{"index": 0, "delta": {"role": "assistant", "content": text}, "finish_reason": "stop"}],
+            }
+            yield f"data: {json.dumps(chunk, ensure_ascii=False)}\n\n"
+            yield "data: [DONE]\n\n"
+
+        return StreamingResponse(event_stream(), media_type="text/event-stream")
+
+    return app
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Run LLAISYS chat completion server")
+    parser.add_argument("--model", required=True, type=str, help="Local model directory")
+    parser.add_argument("--device", choices=["cpu", "nvidia"], default="cpu")
+    parser.add_argument("--host", default="127.0.0.1")
+    parser.add_argument("--port", default=8000, type=int)
+    args = parser.parse_args()
+
+    import uvicorn
+
+    app = create_app(args.model, args.device)
+    uvicorn.run(app, host=args.host, port=args.port)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/python/llaisys/libllaisys/__init__.py b/python/llaisys/libllaisys/__init__.py
index f536fb527..aa68f3e90 100644
--- a/python/llaisys/libllaisys/__init__.py
+++ b/python/llaisys/libllaisys/__init__.py
@@ -12,6 +12,8 @@
 from .tensor import llaisysTensor_t
 from .tensor import load_tensor
 from .ops import load_ops
+from .models import load_models
+from .models import LlaisysQwen2Meta, LlaisysQwen2Weights
 
 
 def load_shared_library():
@@ -38,6 +40,7 @@ def load_shared_library():
 load_runtime(LIB_LLAISYS)
 load_tensor(LIB_LLAISYS)
 load_ops(LIB_LLAISYS)
+load_models(LIB_LLAISYS)
 
 
 __all__ = [
@@ -52,4 +55,6 @@ def load_shared_library():
     "llaisysMemcpyKind_t",
     "MemcpyKind",
     "llaisysStream_t",
+    "LlaisysQwen2Meta",
+    "LlaisysQwen2Weights",
 ]
diff --git a/python/llaisys/libllaisys/models.py b/python/llaisys/libllaisys/models.py
new file mode 100644
index 000000000..231f4fe96
--- /dev/null
+++ b/python/llaisys/libllaisys/models.py
@@ -0,0 +1,71 @@
+import ctypes
+from ctypes import c_size_t, c_int, c_int64, c_float, POINTER
+
+from .llaisys_types import llaisysDataType_t, llaisysDeviceType_t
+from .tensor import llaisysTensor_t
+
+
+class LlaisysQwen2Meta(ctypes.Structure):
+    _fields_ = [
+        ("dtype", llaisysDataType_t),
+        ("nlayer", c_size_t),
+        ("hs", c_size_t),
+        ("nh", c_size_t),
+        ("nkvh", c_size_t),
+        ("dh", c_size_t),
+        ("di", c_size_t),
+        ("maxseq", c_size_t),
+        ("voc", c_size_t),
+        ("epsilon", c_float),
+        ("theta", c_float),
+        ("end_token", c_int64),
+    ]
+
+
+class LlaisysQwen2Weights(ctypes.Structure):
+    _fields_ = [
+        ("in_embed", llaisysTensor_t),
+        ("out_embed", llaisysTensor_t),
+        ("out_norm_w", llaisysTensor_t),
+        ("attn_norm_w", POINTER(llaisysTensor_t)),
+        ("attn_q_w", POINTER(llaisysTensor_t)),
+        ("attn_q_b", POINTER(llaisysTensor_t)),
+        ("attn_k_w", POINTER(llaisysTensor_t)),
+        ("attn_k_b", POINTER(llaisysTensor_t)),
+        ("attn_v_w", POINTER(llaisysTensor_t)),
+        ("attn_v_b", POINTER(llaisysTensor_t)),
+        ("attn_o_w", POINTER(llaisysTensor_t)),
+        ("mlp_norm_w", POINTER(llaisysTensor_t)),
+        ("mlp_gate_w", POINTER(llaisysTensor_t)),
+        ("mlp_up_w", POINTER(llaisysTensor_t)),
+        ("mlp_down_w", POINTER(llaisysTensor_t)),
+    ]
+
+
+def load_models(lib):
+    lib.llaisysQwen2ModelCreate.argtypes = [
+        ctypes.POINTER(LlaisysQwen2Meta),
+        llaisysDeviceType_t,
+        ctypes.POINTER(c_int),
+        c_int,
+    ]
+    lib.llaisysQwen2ModelCreate.restype = ctypes.c_void_p
+
+    lib.llaisysQwen2ModelDestroy.argtypes = [ctypes.c_void_p]
+    lib.llaisysQwen2ModelDestroy.restype = None
+
+    lib.llaisysQwen2ModelWeights.argtypes = [ctypes.c_void_p]
+    lib.llaisysQwen2ModelWeights.restype = ctypes.POINTER(LlaisysQwen2Weights)
+
+    lib.llaisysQwen2ModelGetWeight.argtypes = [ctypes.c_void_p, c_int, c_size_t]
+    lib.llaisysQwen2ModelGetWeight.restype = llaisysTensor_t
+
+    lib.llaisysQwen2ModelInfer.argtypes = [ctypes.c_void_p, ctypes.POINTER(c_int64), c_size_t, c_int, c_float, c_float]
+    lib.llaisysQwen2ModelInfer.restype = c_int64
+
+
+__all__ = [
+    "LlaisysQwen2Meta",
+    "LlaisysQwen2Weights",
+    "load_models",
+]
diff --git a/python/llaisys/libllaisys/ops.py b/python/llaisys/libllaisys/ops.py
index 5be095eff..2d195dc18 100644
--- a/python/llaisys/libllaisys/ops.py
+++ b/python/llaisys/libllaisys/ops.py
@@ -1,5 +1,5 @@
 from .tensor import llaisysTensor_t
-from ctypes import c_float
+from ctypes import c_float, c_int
 
 def load_ops(lib):
     lib.llaisysAdd.argtypes = [llaisysTensor_t, llaisysTensor_t, llaisysTensor_t]
@@ -34,3 +34,6 @@ def load_ops(lib):
 
     lib.llaisysSwiGLU.argtypes = [llaisysTensor_t, llaisysTensor_t, llaisysTensor_t]
     lib.llaisysSwiGLU.restype = None
+
+    lib.llaisysSample.argtypes = [llaisysTensor_t, llaisysTensor_t, c_float, c_int, c_float]
+    lib.llaisysSample.restype = None
diff --git a/python/llaisys/models/qwen2.py b/python/llaisys/models/qwen2.py
index 0d07b0b21..cc4eb2845 100644
--- a/python/llaisys/models/qwen2.py
+++ b/python/llaisys/models/qwen2.py
@@ -1,23 +1,192 @@
 from typing import Sequence
-from ..libllaisys import LIB_LLAISYS
-from ..libllaisys import DeviceType
-
+import ctypes
+import json
 from pathlib import Path
+
+import numpy as np
 import safetensors
+import torch
+
+from ..libllaisys import LIB_LLAISYS
+from ..libllaisys import DeviceType, DataType
+from ..libllaisys import LlaisysQwen2Meta, LlaisysQwen2Weights
+from ..tensor import Tensor
 
 
-class Qwen2:
+def _np_dtype(dtype: DataType):
+    if dtype == DataType.BF16:
+        return np.dtype("bfloat16")
+    if dtype == DataType.F16:
+        return np.float16
+    if dtype == DataType.F32:
+        return np.float32
+    raise ValueError(f"Unsupported dtype: {dtype}")
 
-    def __init__(self, model_path, device: DeviceType = DeviceType.CPU):
-        # TODO: Implement model constructor
 
+def _llaisys_dtype(dtype_str: str) -> DataType:
+    if dtype_str in ("bfloat16", "bf16"):
+        return DataType.BF16
+    if dtype_str in ("float16", "fp16", "f16"):
+        return DataType.F16
+    if dtype_str in ("float32", "fp32", "f32"):
+        return DataType.F32
+    return DataType.F32
+
+
+class Qwen2:
+    def __init__(self, model_path, device: DeviceType = DeviceType.CPU):
         model_path = Path(model_path)
+        config_path = model_path / "config.json"
+        if not config_path.exists():
+            raise FileNotFoundError(f"config.json not found in {model_path}")
+
+        config = json.loads(config_path.read_text(encoding="utf-8"))
+        self.config = config  # 保存 config 供后续使用
+
+        hs = int(config["hidden_size"])
+        nlayer = int(config["num_hidden_layers"])
+        nh = int(config["num_attention_heads"])
+        nkvh = int(config.get("num_key_value_heads", nh))
+        di = int(config["intermediate_size"])
+        maxseq = int(config.get("max_position_embeddings", 4096))
+        voc = int(config["vocab_size"])
+        epsilon = float(config.get("rms_norm_eps", 1e-5))
+        theta = float(config.get("rope_theta", 10000.0))
+
+        eos_token = config.get("eos_token_id", None)
+        if isinstance(eos_token, list):
+            end_token = int(eos_token[0])
+        elif eos_token is None:
+            end_token = -1
+        else:
+            end_token = int(eos_token)
+
+        dtype = _llaisys_dtype(str(config.get("torch_dtype", "float32")))
+        # 强制使用F32因为权重在加载时都转换为F32了
+        dtype = DataType.F32
+        try:
+            np.dtype("bfloat16")
+            self._np_bf16 = True
+        except TypeError:
+            self._np_bf16 = False
+        dh = hs // nh
+
+        meta = LlaisysQwen2Meta(
+            dtype,
+            nlayer,
+            hs,
+            nh,
+            nkvh,
+            dh,
+            di,
+            maxseq,
+            voc,
+            epsilon,
+            theta,
+            end_token,
+        )
+
+        device_ids = (ctypes.c_int * 1)(0)
+        self._model = LIB_LLAISYS.llaisysQwen2ModelCreate(
+            ctypes.byref(meta), ctypes.c_int(device), device_ids, ctypes.c_int(1)
+        )
+        if not self._model:
+            raise RuntimeError("Failed to create Qwen2 model")
+        self._end_token = end_token
+        self._dtype = dtype
+        self._nlayer = nlayer
+
+        self._load_weights(model_path)
+
+    def __del__(self):
+        if hasattr(self, "_model") and self._model:
+            LIB_LLAISYS.llaisysQwen2ModelDestroy(self._model)
+            self._model = None
+
+    def _load_tensor(self, tensor_handle, array: np.ndarray):
+        # 确保内存连续
+        array = np.ascontiguousarray(array)
+        # 直接调用 tensorLoad C API，避免创建临时 Tensor 对象（会触发析构器导致双重释放）
+        LIB_LLAISYS.tensorLoad(tensor_handle, array.ctypes.data_as(ctypes.c_void_p))
+
+    def _load_weights(self, model_path: Path):
+        found_out_embed = False
+
+        def get_weight(kind: int, layer: int = 0):
+            return LIB_LLAISYS.llaisysQwen2ModelGetWeight(self._model, kind, layer)
 
         for file in sorted(model_path.glob("*.safetensors")):
-            data_ = safetensors.safe_open(file, framework="numpy", device="cpu")
+            try:
+                # Try numpy first (faster for most dtypes)
+                data_ = safetensors.safe_open(file, framework="numpy", device="cpu")
+            except Exception:
+                # Fall back to torch for BF16 support
+                data_ = safetensors.safe_open(file, framework="pt", device="cpu")
+                
             for name_ in data_.keys():
-                ## TODO: load the model weights
-                pass
+                try:
+                    arr = data_.get_tensor(name_)
+                except Exception:
+                    # If get_tensor fails with numpy framework, reload with torch
+                    data_ = safetensors.safe_open(file, framework="pt", device="cpu")
+                    arr = data_.get_tensor(name_)
+                
+                # Convert to numpy as F32
+                if isinstance(arr, torch.Tensor):
+                    # Convert torch tensor to numpy - always use float32 for consistency
+                    arr = arr.float().cpu().numpy().astype(np.float32)
+                else:
+                    # numpy array
+                    arr = arr.astype(np.float32)
+
+                if name_ == "model.embed_tokens.weight":
+                    self._load_tensor(get_weight(0), arr)
+                    if not found_out_embed:
+                        self._load_tensor(get_weight(1), arr)
+                    continue
+
+                if name_ == "lm_head.weight":
+                    self._load_tensor(get_weight(1), arr)
+                    found_out_embed = True
+                    continue
+
+                if name_ == "model.norm.weight":
+                    self._load_tensor(get_weight(2), arr)
+                    continue
+
+                if not name_.startswith("model.layers."):
+                    continue
+
+                parts = name_.split(".")
+                if len(parts) < 4:
+                    continue
+                layer_id = int(parts[2])
+                rest = ".".join(parts[3:])
+
+                if rest == "input_layernorm.weight":
+                    self._load_tensor(get_weight(3, layer_id), arr)
+                elif rest == "self_attn.q_proj.weight":
+                    self._load_tensor(get_weight(4, layer_id), arr)
+                elif rest == "self_attn.q_proj.bias":
+                    self._load_tensor(get_weight(5, layer_id), arr)
+                elif rest == "self_attn.k_proj.weight":
+                    self._load_tensor(get_weight(6, layer_id), arr)
+                elif rest == "self_attn.k_proj.bias":
+                    self._load_tensor(get_weight(7, layer_id), arr)
+                elif rest == "self_attn.v_proj.weight":
+                    self._load_tensor(get_weight(8, layer_id), arr)
+                elif rest == "self_attn.v_proj.bias":
+                    self._load_tensor(get_weight(9, layer_id), arr)
+                elif rest == "self_attn.o_proj.weight":
+                    self._load_tensor(get_weight(10, layer_id), arr)
+                elif rest == "post_attention_layernorm.weight":
+                    self._load_tensor(get_weight(11, layer_id), arr)
+                elif rest == "mlp.gate_proj.weight":
+                    self._load_tensor(get_weight(12, layer_id), arr)
+                elif rest == "mlp.up_proj.weight":
+                    self._load_tensor(get_weight(13, layer_id), arr)
+                elif rest == "mlp.down_proj.weight":
+                    self._load_tensor(get_weight(14, layer_id), arr)
 
     def generate(
         self,
@@ -27,7 +196,28 @@ def generate(
         top_p: float = 0.8,
         temperature: float = 0.8,
     ):
+        # 当前实现为 argmax 采样（top_k=1）
+        tokens = list(inputs)
+        max_new_tokens = 128 if max_new_tokens is None else int(max_new_tokens)
+
+        top_k = int(top_k)
+        top_p = float(top_p)
+        temperature = float(temperature)
 
-        # TODO: Implement generate function
+        for _ in range(max_new_tokens):
+            arr = (ctypes.c_int64 * len(tokens))(*tokens)
+            next_token = int(
+                LIB_LLAISYS.llaisysQwen2ModelInfer(
+                    self._model,
+                    arr,
+                    len(tokens),
+                    ctypes.c_int(top_k),
+                    ctypes.c_float(top_p),
+                    ctypes.c_float(temperature),
+                )
+            )
+            tokens.append(next_token)
+            if self._end_token >= 0 and next_token == self._end_token:
+                break
 
-        return []
+        return tokens
diff --git a/python/llaisys/ops.py b/python/llaisys/ops.py
index ed0180bc8..3a0643c22 100644
--- a/python/llaisys/ops.py
+++ b/python/llaisys/ops.py
@@ -53,3 +53,9 @@ def self_attention(attn_val: Tensor, q: Tensor, k: Tensor, v: Tensor, scale: flo
     @staticmethod
     def swiglu(out: Tensor, gate: Tensor, up: Tensor):
         LIB_LLAISYS.llaisysSwiGLU(out.lib_tensor(), gate.lib_tensor(), up.lib_tensor())
+
+    @staticmethod
+    def sample(out_idx: Tensor, logits: Tensor, temperature: float = 1.0, top_k: int = 0, top_p: float = 1.0):
+        LIB_LLAISYS.llaisysSample(
+            out_idx.lib_tensor(), logits.lib_tensor(), c_float(float(temperature)), c_int(int(top_k)), c_float(float(top_p))
+        )
diff --git a/python/setup.cfg b/python/setup.cfg
index b35fc65f7..3906af12c 100644
--- a/python/setup.cfg
+++ b/python/setup.cfg
@@ -19,3 +19,9 @@ llaisys =
     libllaisys/*.so
     libllaisys/*.dll
     libllaisys/*.dylib
+
+[options.extras_require]
+chat =
+    fastapi
+    uvicorn
+    requests
diff --git a/src/device/nvidia/nvidia_resource.cu b/src/device/nvidia/nvidia_resource.cpp
similarity index 86%
rename from src/device/nvidia/nvidia_resource.cu
rename to src/device/nvidia/nvidia_resource.cpp
index 2e63647e5..01ecfd4a1 100644
--- a/src/device/nvidia/nvidia_resource.cu
+++ b/src/device/nvidia/nvidia_resource.cpp
@@ -4,4 +4,6 @@ namespace llaisys::device::nvidia {
 
 Resource::Resource(int device_id) : llaisys::device::DeviceResource(LLAISYS_DEVICE_NVIDIA, device_id) {}
 
+Resource::~Resource() = default;
+
 } // namespace llaisys::device::nvidia
diff --git a/src/device/nvidia/nvidia_runtime_api.cpp b/src/device/nvidia/nvidia_runtime_api.cpp
new file mode 100644
index 000000000..14a9ad7b3
--- /dev/null
+++ b/src/device/nvidia/nvidia_runtime_api.cpp
@@ -0,0 +1,161 @@
+#include "../runtime_api.hpp"
+
+#include <cuda_runtime.h>
+
+#include <sstream>
+#include <stdexcept>
+#include <vector>
+
+namespace llaisys::device::nvidia {
+
+namespace runtime_api {
+namespace {
+void check_cuda(cudaError_t err, const char *msg);
+
+std::vector<int> &available_devices() {
+    static std::vector<int> devices;
+    static bool initialized = false;
+    if (initialized) {
+        return devices;
+    }
+    initialized = true;
+
+    int ndev = 0;
+    cudaError_t err = cudaGetDeviceCount(&ndev);
+    if (err == cudaErrorNoDevice) {
+        return devices;
+    }
+    check_cuda(err, "cudaGetDeviceCount");
+
+    for (int dev = 0; dev < ndev; ++dev) {
+        if (cudaSetDevice(dev) != cudaSuccess) {
+            (void)cudaGetLastError();
+            continue;
+        }
+        // Warm up context creation to filter out temporarily unavailable devices.
+        if (cudaFree(nullptr) != cudaSuccess) {
+            (void)cudaGetLastError();
+            continue;
+        }
+        devices.push_back(dev);
+    }
+    return devices;
+}
+
+void check_cuda(cudaError_t err, const char *msg) {
+    if (err != cudaSuccess) {
+        std::ostringstream oss;
+        oss << "[CUDA] " << msg << " failed: " << cudaGetErrorString(err);
+        throw std::runtime_error(oss.str());
+    }
+}
+
+cudaMemcpyKind to_cuda_memcpy_kind(llaisysMemcpyKind_t kind) {
+    switch (kind) {
+    case LLAISYS_MEMCPY_H2H:
+        return cudaMemcpyHostToHost;
+    case LLAISYS_MEMCPY_H2D:
+        return cudaMemcpyHostToDevice;
+    case LLAISYS_MEMCPY_D2H:
+        return cudaMemcpyDeviceToHost;
+    case LLAISYS_MEMCPY_D2D:
+        return cudaMemcpyDeviceToDevice;
+    default:
+        throw std::invalid_argument("Unsupported memcpy kind");
+    }
+}
+} // namespace
+
+int getDeviceCount() {
+    return static_cast<int>(available_devices().size());
+}
+
+void setDevice(int device_id) {
+    auto &devices = available_devices();
+    if (device_id < 0 || static_cast<size_t>(device_id) >= devices.size()) {
+        throw std::invalid_argument("invalid nvidia device id");
+    }
+    check_cuda(cudaSetDevice(devices[static_cast<size_t>(device_id)]), "cudaSetDevice");
+    check_cuda(cudaFree(nullptr), "cudaFree(warmup)");
+}
+
+void deviceSynchronize() {
+    check_cuda(cudaDeviceSynchronize(), "cudaDeviceSynchronize");
+}
+
+llaisysStream_t createStream() {
+    cudaStream_t stream = nullptr;
+    check_cuda(cudaStreamCreate(&stream), "cudaStreamCreate");
+    return reinterpret_cast<llaisysStream_t>(stream);
+}
+
+void destroyStream(llaisysStream_t stream) {
+    if (!stream) {
+        return;
+    }
+    check_cuda(cudaStreamDestroy(reinterpret_cast<cudaStream_t>(stream)), "cudaStreamDestroy");
+}
+
+void streamSynchronize(llaisysStream_t stream) {
+    check_cuda(cudaStreamSynchronize(reinterpret_cast<cudaStream_t>(stream)), "cudaStreamSynchronize");
+}
+
+void *mallocDevice(size_t size) {
+    void *ptr = nullptr;
+    check_cuda(cudaMalloc(&ptr, size), "cudaMalloc");
+    return ptr;
+}
+
+void freeDevice(void *ptr) {
+    if (!ptr) {
+        return;
+    }
+    check_cuda(cudaFree(ptr), "cudaFree");
+}
+
+void *mallocHost(size_t size) {
+    void *ptr = nullptr;
+    check_cuda(cudaMallocHost(&ptr, size), "cudaMallocHost");
+    return ptr;
+}
+
+void freeHost(void *ptr) {
+    if (!ptr) {
+        return;
+    }
+    check_cuda(cudaFreeHost(ptr), "cudaFreeHost");
+}
+
+void memcpySync(void *dst, const void *src, size_t size, llaisysMemcpyKind_t kind) {
+    check_cuda(cudaMemcpy(dst, src, size, to_cuda_memcpy_kind(kind)), "cudaMemcpy");
+}
+
+void memcpyAsync(void *dst, const void *src, size_t size, llaisysMemcpyKind_t kind, llaisysStream_t stream) {
+    check_cuda(cudaMemcpyAsync(dst,
+                               src,
+                               size,
+                               to_cuda_memcpy_kind(kind),
+                               reinterpret_cast<cudaStream_t>(stream)),
+               "cudaMemcpyAsync");
+}
+
+static const LlaisysRuntimeAPI RUNTIME_API = {
+    &getDeviceCount,
+    &setDevice,
+    &deviceSynchronize,
+    &createStream,
+    &destroyStream,
+    &streamSynchronize,
+    &mallocDevice,
+    &freeDevice,
+    &mallocHost,
+    &freeHost,
+    &memcpySync,
+    &memcpyAsync};
+
+} // namespace runtime_api
+
+const LlaisysRuntimeAPI *getRuntimeAPI() {
+    return &runtime_api::RUNTIME_API;
+}
+} // namespace llaisys::device::nvidia
diff --git a/src/device/nvidia/nvidia_runtime_api.cu b/src/device/nvidia/nvidia_runtime_api.cu
deleted file mode 100644
index cab928261..000000000
--- a/src/device/nvidia/nvidia_runtime_api.cu
+++ /dev/null
@@ -1,75 +0,0 @@
-#include "../runtime_api.hpp"
-
-#include <cstdlib>
-#include <cstring>
-
-namespace llaisys::device::nvidia {
-
-namespace runtime_api {
-int getDeviceCount() {
-    TO_BE_IMPLEMENTED();
-}
-
-void setDevice(int) {
-    TO_BE_IMPLEMENTED();
-}
-
-void deviceSynchronize() {
-    TO_BE_IMPLEMENTED();
-}
-
-llaisysStream_t createStream() {
-    TO_BE_IMPLEMENTED();
-}
-
-void destroyStream(llaisysStream_t stream) {
-    TO_BE_IMPLEMENTED();
-}
-void streamSynchronize(llaisysStream_t stream) {
-    TO_BE_IMPLEMENTED();
-}
-
-void *mallocDevice(size_t size) {
-    TO_BE_IMPLEMENTED();
-}
-
-void freeDevice(void *ptr) {
-    TO_BE_IMPLEMENTED();
-}
-
-void *mallocHost(size_t size) {
-    TO_BE_IMPLEMENTED();
-}
-
-void freeHost(void *ptr) {
-    TO_BE_IMPLEMENTED();
-}
-
-void memcpySync(void *dst, const void *src, size_t size, llaisysMemcpyKind_t kind) {
-    TO_BE_IMPLEMENTED();
-}
-
-void memcpyAsync(void *dst, const void *src, size_t size, llaisysMemcpyKind_t kind) {
-    TO_BE_IMPLEMENTED();
-}
-
-static const LlaisysRuntimeAPI RUNTIME_API = {
-    &getDeviceCount,
-    &setDevice,
-    &deviceSynchronize,
-    &createStream,
-    &destroyStream,
-    &streamSynchronize,
-    &mallocDevice,
-    &freeDevice,
-    &mallocHost,
-    &freeHost,
-    &memcpySync,
-    &memcpyAsync};
-
-} // namespace runtime_api
-
-const LlaisysRuntimeAPI *getRuntimeAPI() {
-    return &runtime_api::RUNTIME_API;
-}
-} // namespace llaisys::device::nvidia
diff --git a/src/llaisys/ops.cc b/src/llaisys/ops.cc
index c99fbc32f..b84285c5d 100644
--- a/src/llaisys/ops.cc
+++ b/src/llaisys/ops.cc
@@ -9,6 +9,7 @@
 #include "../ops/rearrange/op.hpp"
 #include "../ops/rms_norm/op.hpp"
 #include "../ops/rope/op.hpp"
+#include "../ops/sampling/op.hpp"
 #include "../ops/self_attention/op.hpp"
 #include "../ops/swiglu/op.hpp"
 
@@ -40,4 +41,7 @@ __C {
     void llaisysSwiGLU(llaisysTensor_t out, llaisysTensor_t gate, llaisysTensor_t up) {
         llaisys::ops::swiglu(out->tensor, gate->tensor, up->tensor);
     }
+    void llaisysSample(llaisysTensor_t out_idx, llaisysTensor_t logits, float temperature, int top_k, float top_p) {
+        llaisys::ops::sample(out_idx->tensor, logits->tensor, temperature, top_k, top_p);
+    }
 }
diff --git a/src/llaisys/qwen2.cc b/src/llaisys/qwen2.cc
new file mode 100644
index 000000000..2b4c8986b
--- /dev/null
+++ b/src/llaisys/qwen2.cc
@@ -0,0 +1,449 @@
+#include "llaisys/models/qwen2.h"
+
+#include "llaisys_tensor.hpp"
+
+#include "../core/llaisys_core.hpp"
+
+#include "../ops/add/op.hpp"
+#include "../ops/argmax/op.hpp"
+#include "../ops/embedding/op.hpp"
+#include "../ops/linear/op.hpp"
+#include "../ops/rms_norm/op.hpp"
+#include "../ops/rope/op.hpp"
+#include "../ops/sampling/op.hpp"
+#include "../ops/self_attention/op.hpp"
+#include "../ops/swiglu/op.hpp"
+#include "../utils.hpp"
+
+#include <cmath>
+#include <cstring>
+#include <vector>
+
+namespace {
+using llaisys::tensor_t;
+
+llaisysTensor_t create_tensor_handle(const std::vector<size_t> &shape,
+                                     llaisysDataType_t dtype,
+                                     llaisysDeviceType_t device,
+                                     int device_id) {
+    auto t = llaisys::Tensor::create(shape, dtype, device, device_id);
+    return new LlaisysTensor{t};
+}
+
+void zero_tensor(llaisysTensor_t t) {
+    const size_t bytes = t->tensor->numel() * t->tensor->elementSize();
+    if (t->tensor->deviceType() == LLAISYS_DEVICE_CPU) {
+        std::memset(t->tensor->data(), 0, bytes);
+        return;
+    }
+
+    auto &ctx = llaisys::core::context();
+    ctx.setDevice(t->tensor->deviceType(), t->tensor->deviceId());
+    const auto *api = ctx.runtime().api();
+    std::vector<std::byte> zeros(bytes, std::byte{0});
+    api->memcpy_sync(t->tensor->data(), zeros.data(), bytes, LLAISYS_MEMCPY_H2D);
+}
+
+void tensor_write_i64(tensor_t t, int64_t value) {
+    if (t->deviceType() == LLAISYS_DEVICE_CPU) {
+        *reinterpret_cast<int64_t *>(t->data()) = value;
+        return;
+    }
+
+    auto &ctx = llaisys::core::context();
+    ctx.setDevice(t->deviceType(), t->deviceId());
+    const auto *api = ctx.runtime().api();
+    api->memcpy_sync(t->data(), &value, sizeof(int64_t), LLAISYS_MEMCPY_H2D);
+}
+
+int64_t tensor_read_i64(tensor_t t) {
+    if (t->deviceType() == LLAISYS_DEVICE_CPU) {
+        return *reinterpret_cast<int64_t *>(t->data());
+    }
+
+    auto &ctx = llaisys::core::context();
+    ctx.setDevice(t->deviceType(), t->deviceId());
+    const auto *api = ctx.runtime().api();
+    int64_t value = 0;
+    api->memcpy_sync(&value, t->data(), sizeof(int64_t), LLAISYS_MEMCPY_D2H);
+    return value;
+}
+
+void tensor_copy_bytes(tensor_t dst, size_t dst_offset, tensor_t src, size_t src_offset, size_t bytes) {
+    auto *dst_ptr = dst->data() + dst_offset;
+    auto *src_ptr = src->data() + src_offset;
+    if (dst->deviceType() == LLAISYS_DEVICE_CPU && src->deviceType() == LLAISYS_DEVICE_CPU) {
+        std::memcpy(dst_ptr, src_ptr, bytes);
+        return;
+    }
+
+    auto &ctx = llaisys::core::context();
+    ctx.setDevice(dst->deviceType(), dst->deviceId());
+    const auto *api = ctx.runtime().api();
+    api->memcpy_sync(dst_ptr, src_ptr, bytes, LLAISYS_MEMCPY_D2D);
+}
+
+struct Qwen2ModelImpl {
+    LlaisysQwen2Meta meta{};
+    llaisysDeviceType_t device = LLAISYS_DEVICE_CPU;
+    int device_id = 0;
+
+    LlaisysQwen2Weights weights{};
+
+    // Zero biases (not in weights struct)
+    llaisysTensor_t attn_o_b = nullptr;
+    llaisysTensor_t mlp_gate_b = nullptr;
+    llaisysTensor_t mlp_up_b = nullptr;
+    llaisysTensor_t mlp_down_b = nullptr;
+    llaisysTensor_t out_b = nullptr;
+
+    // KV cache: per layer
+    std::vector<tensor_t> k_cache;
+    std::vector<tensor_t> v_cache;
+
+    size_t cur_pos = 0;
+
+    Qwen2ModelImpl(const LlaisysQwen2Meta &m, llaisysDeviceType_t dev, int dev_id)
+        : meta(m), device(dev), device_id(dev_id) {}
+
+    void init_weights() {
+        // Global weights
+        weights.in_embed = create_tensor_handle({meta.voc, meta.hs}, meta.dtype, device, device_id);
+        weights.out_embed = create_tensor_handle({meta.voc, meta.hs}, meta.dtype, device, device_id);
+        weights.out_norm_w = create_tensor_handle({meta.hs}, meta.dtype, device, device_id);
+
+        // Per-layer weights
+        weights.attn_norm_w = new llaisysTensor_t[meta.nlayer];
+        weights.attn_q_w = new llaisysTensor_t[meta.nlayer];
+        weights.attn_q_b = new llaisysTensor_t[meta.nlayer];
+        weights.attn_k_w = new llaisysTensor_t[meta.nlayer];
+        weights.attn_k_b = new llaisysTensor_t[meta.nlayer];
+        weights.attn_v_w = new llaisysTensor_t[meta.nlayer];
+        weights.attn_v_b = new llaisysTensor_t[meta.nlayer];
+        weights.attn_o_w = new llaisysTensor_t[meta.nlayer];
+        weights.mlp_norm_w = new llaisysTensor_t[meta.nlayer];
+        weights.mlp_gate_w = new llaisysTensor_t[meta.nlayer];
+        weights.mlp_up_w = new llaisysTensor_t[meta.nlayer];
+        weights.mlp_down_w = new llaisysTensor_t[meta.nlayer];
+
+        for (size_t i = 0; i < meta.nlayer; ++i) {
+            weights.attn_norm_w[i] = create_tensor_handle({meta.hs}, meta.dtype, device, device_id);
+            weights.attn_q_w[i] = create_tensor_handle({meta.nh * meta.dh, meta.hs}, meta.dtype, device, device_id);
+            weights.attn_q_b[i] = create_tensor_handle({meta.nh * meta.dh}, meta.dtype, device, device_id);
+            weights.attn_k_w[i] = create_tensor_handle({meta.nkvh * meta.dh, meta.hs}, meta.dtype, device, device_id);
+            weights.attn_k_b[i] = create_tensor_handle({meta.nkvh * meta.dh}, meta.dtype, device, device_id);
+            weights.attn_v_w[i] = create_tensor_handle({meta.nkvh * meta.dh, meta.hs}, meta.dtype, device, device_id);
+            weights.attn_v_b[i] = create_tensor_handle({meta.nkvh * meta.dh}, meta.dtype, device, device_id);
+            weights.attn_o_w[i] = create_tensor_handle({meta.hs, meta.nh * meta.dh}, meta.dtype, device, device_id);
+            weights.mlp_norm_w[i] = create_tensor_handle({meta.hs}, meta.dtype, device, device_id);
+            weights.mlp_gate_w[i] = create_tensor_handle({meta.di, meta.hs}, meta.dtype, device, device_id);
+            weights.mlp_up_w[i] = create_tensor_handle({meta.di, meta.hs}, meta.dtype, device, device_id);
+            weights.mlp_down_w[i] = create_tensor_handle({meta.hs, meta.di}, meta.dtype, device, device_id);
+
+            zero_tensor(weights.attn_q_b[i]);
+            zero_tensor(weights.attn_k_b[i]);
+            zero_tensor(weights.attn_v_b[i]);
+        }
+
+        // Extra zero biases
+        attn_o_b = create_tensor_handle({meta.hs}, meta.dtype, device, device_id);
+        mlp_gate_b = create_tensor_handle({meta.di}, meta.dtype, device, device_id);
+        mlp_up_b = create_tensor_handle({meta.di}, meta.dtype, device, device_id);
+        mlp_down_b = create_tensor_handle({meta.hs}, meta.dtype, device, device_id);
+        out_b = create_tensor_handle({meta.voc}, meta.dtype, device, device_id);
+
+        zero_tensor(attn_o_b);
+        zero_tensor(mlp_gate_b);
+        zero_tensor(mlp_up_b);
+        zero_tensor(mlp_down_b);
+        zero_tensor(out_b);
+
+        // KV cache
+        k_cache.resize(meta.nlayer);
+        v_cache.resize(meta.nlayer);
+        for (size_t i = 0; i < meta.nlayer; ++i) {
+            k_cache[i] = llaisys::Tensor::create({meta.maxseq, meta.nkvh, meta.dh}, meta.dtype, device, device_id);
+            v_cache[i] = llaisys::Tensor::create({meta.maxseq, meta.nkvh, meta.dh}, meta.dtype, device, device_id);
+            if (device == LLAISYS_DEVICE_CPU) {
+                std::memset(k_cache[i]->data(), 0, k_cache[i]->numel() * k_cache[i]->elementSize());
+                std::memset(v_cache[i]->data(), 0, v_cache[i]->numel() * v_cache[i]->elementSize());
+            } else {
+                auto k_handle = new LlaisysTensor{k_cache[i]};
+                auto v_handle = new LlaisysTensor{v_cache[i]};
+                zero_tensor(k_handle);
+                zero_tensor(v_handle);
+                delete k_handle;
+                delete v_handle;
+            }
+        }
+    }
+
+    void destroy_weights() {
+        delete weights.in_embed;
+        delete weights.out_embed;
+        delete weights.out_norm_w;
+
+        for (size_t i = 0; i < meta.nlayer; ++i) {
+            delete weights.attn_norm_w[i];
+            delete weights.attn_q_w[i];
+            delete weights.attn_q_b[i];
+            delete weights.attn_k_w[i];
+            delete weights.attn_k_b[i];
+            delete weights.attn_v_w[i];
+            delete weights.attn_v_b[i];
+            delete weights.attn_o_w[i];
+            delete weights.mlp_norm_w[i];
+            delete weights.mlp_gate_w[i];
+            delete weights.mlp_up_w[i];
+            delete weights.mlp_down_w[i];
+        }
+
+        delete[] weights.attn_norm_w;
+        delete[] weights.attn_q_w;
+        delete[] weights.attn_q_b;
+        delete[] weights.attn_k_w;
+        delete[] weights.attn_k_b;
+        delete[] weights.attn_v_w;
+        delete[] weights.attn_v_b;
+        delete[] weights.attn_o_w;
+        delete[] weights.mlp_norm_w;
+        delete[] weights.mlp_gate_w;
+        delete[] weights.mlp_up_w;
+        delete[] weights.mlp_down_w;
+
+        delete attn_o_b;
+        delete mlp_gate_b;
+        delete mlp_up_b;
+        delete mlp_down_b;
+        delete out_b;
+    }
+
+    int64_t infer_next(const int64_t *token_ids, size_t ntoken, int top_k, float top_p, float temperature) {
+        if (ntoken == 0) {
+            return meta.end_token;
+        }
+
+        // Reset cache if input sequence is shorter than cached position
+        if (ntoken < cur_pos) {
+            cur_pos = 0;
+        }
+
+        int64_t next_token = meta.end_token;
+
+        // Process tokens from cur_pos onwards (KV-Cache optimization)
+        for (size_t i = cur_pos; i < ntoken; ++i) {
+            int64_t token_id = token_ids[i];
+
+            // Token embedding
+            auto token_tensor = llaisys::Tensor::create({1}, LLAISYS_DTYPE_I64, device, device_id);
+            tensor_write_i64(token_tensor, token_id);
+
+            auto x = llaisys::Tensor::create({1, meta.hs}, meta.dtype, device, device_id);
+            llaisys::ops::embedding(x, token_tensor, weights.in_embed->tensor);
+
+            // Transformer layers
+            for (size_t l = 0; l < meta.nlayer; ++l) {
+                // Self-Attention block
+                auto x_norm = llaisys::Tensor::create({1, meta.hs}, meta.dtype, device, device_id);
+                llaisys::ops::rms_norm(x_norm, x, weights.attn_norm_w[l]->tensor, meta.epsilon);
+
+                // Q, K, V projections
+                auto q_lin = llaisys::Tensor::create({1, meta.nh * meta.dh}, meta.dtype, device, device_id);
+                auto k_lin = llaisys::Tensor::create({1, meta.nkvh * meta.dh}, meta.dtype, device, device_id);
+                auto v_lin = llaisys::Tensor::create({1, meta.nkvh * meta.dh}, meta.dtype, device, device_id);
+
+                llaisys::ops::linear(q_lin, x_norm, weights.attn_q_w[l]->tensor, weights.attn_q_b[l]->tensor);
+                llaisys::ops::linear(k_lin, x_norm, weights.attn_k_w[l]->tensor, weights.attn_k_b[l]->tensor);
+                llaisys::ops::linear(v_lin, x_norm, weights.attn_v_w[l]->tensor, weights.attn_v_b[l]->tensor);
+
+                auto q = q_lin->view({1, meta.nh, meta.dh});
+                auto k = k_lin->view({1, meta.nkvh, meta.dh});
+                auto v = v_lin->view({1, meta.nkvh, meta.dh});
+
+                // RoPE
+                auto pos_ids = llaisys::Tensor::create({1}, LLAISYS_DTYPE_I64, device, device_id);
+                tensor_write_i64(pos_ids, static_cast<int64_t>(i));
+
+                auto q_rot = llaisys::Tensor::create({1, meta.nh, meta.dh}, meta.dtype, device, device_id);
+                auto k_rot = llaisys::Tensor::create({1, meta.nkvh, meta.dh}, meta.dtype, device, device_id);
+
+                llaisys::ops::rope(q_rot, q, pos_ids, meta.theta);
+                llaisys::ops::rope(k_rot, k, pos_ids, meta.theta);
+
+                // Update KV cache
+                size_t bytes_per_token = meta.nkvh * meta.dh * x->elementSize();
+                tensor_copy_bytes(k_cache[l], i * bytes_per_token, k_rot, 0, bytes_per_token);
+                tensor_copy_bytes(v_cache[l], i * bytes_per_token, v, 0, bytes_per_token);
+
+                // Get all cached K, V for attention
+                auto k_all = k_cache[l]->slice(0, 0, i + 1);
+                auto v_all = v_cache[l]->slice(0, 0, i + 1);
+
+                // Self-attention
+                auto attn_out = llaisys::Tensor::create({1, meta.nh, meta.dh}, meta.dtype, device, device_id);
+                float scale = 1.0f / std::sqrt(static_cast<float>(meta.dh));
+                llaisys::ops::self_attention(attn_out, q_rot, k_all, v_all, scale);
+
+                // Output projection
+                auto attn_out_2d = attn_out->view({1, meta.nh * meta.dh});
+                auto attn_proj = llaisys::Tensor::create({1, meta.hs}, meta.dtype, device, device_id);
+                llaisys::ops::linear(attn_proj, attn_out_2d, weights.attn_o_w[l]->tensor, attn_o_b->tensor);
+
+                // Residual connection
+                llaisys::ops::add(x, x, attn_proj);
+
+                // MLP block
+                auto x_norm2 = llaisys::Tensor::create({1, meta.hs}, meta.dtype, device, device_id);
+                llaisys::ops::rms_norm(x_norm2, x, weights.mlp_norm_w[l]->tensor, meta.epsilon);
+
+                auto gate = llaisys::Tensor::create({1, meta.di}, meta.dtype, device, device_id);
+                auto up = llaisys::Tensor::create({1, meta.di}, meta.dtype, device, device_id);
+                llaisys::ops::linear(gate, x_norm2, weights.mlp_gate_w[l]->tensor, mlp_gate_b->tensor);
+                llaisys::ops::linear(up, x_norm2, weights.mlp_up_w[l]->tensor, mlp_up_b->tensor);
+
+                auto swiglu_out = llaisys::Tensor::create({1, meta.di}, meta.dtype, device, device_id);
+                llaisys::ops::swiglu(swiglu_out, gate, up);
+
+                auto mlp_out = llaisys::Tensor::create({1, meta.hs}, meta.dtype, device, device_id);
+                llaisys::ops::linear(mlp_out, swiglu_out, weights.mlp_down_w[l]->tensor, mlp_down_b->tensor);
+
+                // Residual connection
+                llaisys::ops::add(x, x, mlp_out);
+            }
+
+            // Final layer norm
+            auto x_norm_final = llaisys::Tensor::create({1, meta.hs}, meta.dtype, device, device_id);
+            llaisys::ops::rms_norm(x_norm_final, x, weights.out_norm_w->tensor, meta.epsilon);
+
+            // Output logits
+            auto logits = llaisys::Tensor::create({1, meta.voc}, meta.dtype, device, device_id);
+            llaisys::ops::linear(logits, x_norm_final, weights.out_embed->tensor, out_b->tensor);
+
+            // Sampling for next token (argmax when top_k<=1 or temperature<=0)
+            auto logits_1d = logits->view({meta.voc});
+            auto sampled_idx = llaisys::Tensor::create({1}, LLAISYS_DTYPE_I64, device, device_id);
+            llaisys::ops::sample(sampled_idx, logits_1d, temperature, top_k, top_p);
+            next_token = tensor_read_i64(sampled_idx);
+        }
+
+        cur_pos = ntoken;
+        return next_token;
+    }
+};
+} // namespace
+
+struct LlaisysQwen2Model {
+    Qwen2ModelImpl *impl;
+};
+
+__C {
+
+    struct LlaisysQwen2Model *llaisysQwen2ModelCreate(const LlaisysQwen2Meta *meta,
+                                                      llaisysDeviceType_t device,
+                                                      int *device_ids,
+                                                      int ndevice) {
+        try {
+            CHECK_ARGUMENT(meta != nullptr, "Qwen2: meta must not be null.");
+            CHECK_ARGUMENT(ndevice >= 1, "Qwen2: must have at least one device.");
+            int device_id = device_ids ? device_ids[0] : 0;
+            auto *model = new LlaisysQwen2Model();
+            model->impl = new Qwen2ModelImpl(*meta, device, device_id);
+            model->impl->init_weights();
+            return model;
+        } catch (...) {
+            return nullptr;
+        }
+    }
+
+    void llaisysQwen2ModelDestroy(struct LlaisysQwen2Model * model) {
+        try {
+            if (!model) {
+                return;
+            }
+            model->impl->destroy_weights();
+            delete model->impl;
+            delete model;
+        } catch (...) {
+            return;
+        }
+    }
+
+    struct LlaisysQwen2Weights *llaisysQwen2ModelWeights(struct LlaisysQwen2Model * model) {
+        try {
+            CHECK_ARGUMENT(model != nullptr, "Qwen2: model must not be null.");
+            return &model->impl->weights;
+        } catch (...) {
+            return nullptr;
+        }
+    }
+
+    llaisysTensor_t llaisysQwen2ModelGetWeight(struct LlaisysQwen2Model * model, int kind, size_t layer) {
+        try {
+            CHECK_ARGUMENT(model != nullptr, "Qwen2: model must not be null.");
+            auto &w = model->impl->weights;
+            switch (kind) {
+            case LLAISYS_QWEN2_WEIGHT_IN_EMBED:
+                return w.in_embed;
+            case LLAISYS_QWEN2_WEIGHT_OUT_EMBED:
+                return w.out_embed;
+            case LLAISYS_QWEN2_WEIGHT_OUT_NORM:
+                return w.out_norm_w;
+            case LLAISYS_QWEN2_WEIGHT_ATTN_NORM:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_norm_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_Q_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_q_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_Q_B:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_q_b[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_K_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_k_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_K_B:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_k_b[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_V_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_v_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_V_B:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_v_b[layer];
+            case LLAISYS_QWEN2_WEIGHT_ATTN_O_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.attn_o_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_MLP_NORM:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.mlp_norm_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_MLP_GATE_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.mlp_gate_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_MLP_UP_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.mlp_up_w[layer];
+            case LLAISYS_QWEN2_WEIGHT_MLP_DOWN_W:
+                CHECK_ARGUMENT(layer < model->impl->meta.nlayer, "Qwen2: layer out of range.");
+                return w.mlp_down_w[layer];
+            default:
+                return nullptr;
+            }
+        } catch (...) {
+            return nullptr;
+        }
+    }
+
+    int64_t llaisysQwen2ModelInfer(struct LlaisysQwen2Model * model,
+                                   int64_t *token_ids,
+                                   size_t ntoken,
+                                   int top_k,
+                                   float top_p,
+                                   float temperature) {
+        try {
+            CHECK_ARGUMENT(model != nullptr, "Qwen2: model must not be null.");
+            CHECK_ARGUMENT(token_ids != nullptr, "Qwen2: token_ids must not be null.");
+            return model->impl->infer_next(token_ids, ntoken, top_k, top_p, temperature);
+        } catch (...) {
+            return -1;
+        }
+    }
+}
diff --git a/src/ops/add/cpu/add_cpu.cpp b/src/ops/add/cpu/add_cpu.cpp
index 47f6a3d49..766d47f57 100644
--- a/src/ops/add/cpu/add_cpu.cpp
+++ b/src/ops/add/cpu/add_cpu.cpp
@@ -5,8 +5,9 @@
 #include <cmath>
 
 template <typename T>
-void add_(T *c, const T *a, const T *b, size_t numel) {
+void add_(T *c, const T *a, const T *b, size_t numel) { //  a + b -> c
     for (size_t i = 0; i < numel; i++) {
+        // if T is bf16_t or fp16_t, need to cast to float for addition, avoid overflow/underflow
         if constexpr (std::is_same_v<T, llaisys::bf16_t> || std::is_same_v<T, llaisys::fp16_t>) {
             c[i] = llaisys::utils::cast<T>(llaisys::utils::cast<float>(a[i]) + llaisys::utils::cast<float>(b[i]));
         } else {
diff --git a/src/ops/add/op.cpp b/src/ops/add/op.cpp
index a057330d7..0f79a0a23 100644
--- a/src/ops/add/op.cpp
+++ b/src/ops/add/op.cpp
@@ -3,6 +3,8 @@
 #include "../../core/llaisys_core.hpp"
 #include "../../utils.hpp"
 
+#include <vector>
+
 #include "cpu/add_cpu.hpp"
 
 namespace llaisys::ops {
@@ -18,15 +20,28 @@ void add(tensor_t c, tensor_t a, tensor_t b) {
         return cpu::add(c->data(), a->data(), b->data(), c->dtype(), c->numel());
     }
 
+    // set device context
     llaisys::core::context().setDevice(c->deviceType(), c->deviceId());
 
     switch (c->deviceType()) {
     case LLAISYS_DEVICE_CPU:
         return cpu::add(c->data(), a->data(), b->data(), c->dtype(), c->numel());
 #ifdef ENABLE_NVIDIA_API
-    case LLAISYS_DEVICE_NVIDIA:
-        TO_BE_IMPLEMENTED();
+    case LLAISYS_DEVICE_NVIDIA: {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(c->deviceType(), c->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t bytes = c->numel() * c->elementSize();
+        std::vector<std::byte> hc(bytes), ha(bytes), hb(bytes);
+        api->memcpy_sync(ha.data(), a->data(), bytes, LLAISYS_MEMCPY_D2H);
+        api->memcpy_sync(hb.data(), b->data(), bytes, LLAISYS_MEMCPY_D2H);
+
+        cpu::add(hc.data(), ha.data(), hb.data(), c->dtype(), c->numel());
+
+        api->memcpy_sync(c->data(), hc.data(), bytes, LLAISYS_MEMCPY_H2D);
         return;
+    }
 #endif
     default:
         EXCEPTION_UNSUPPORTED_DEVICE;
diff --git a/src/ops/argmax/op.cpp b/src/ops/argmax/op.cpp
index 6dc37d426..18214235c 100644
--- a/src/ops/argmax/op.cpp
+++ b/src/ops/argmax/op.cpp
@@ -1,7 +1,117 @@
 #include "op.hpp"
 
+#include "../../core/llaisys_core.hpp"
+#include "../../utils.hpp"
+
+#include <vector>
+
+namespace {
+template <typename T>
+void argmax_impl(int64_t *out_idx, T *out_val, const T *vals, size_t numel) {
+    size_t max_i = 0;
+    float max_v = llaisys::utils::cast<float>(vals[0]);
+    for (size_t i = 1; i < numel; ++i) {
+        float v = llaisys::utils::cast<float>(vals[i]);
+        if (v > max_v) {
+            max_v = v;
+            max_i = i;
+        }
+    }
+    // store results
+    // out_idx is of type int64_t*
+    // out_val is of type T*
+    *out_idx = static_cast<int64_t>(max_i);
+    out_val[0] = llaisys::utils::cast<T>(max_v);
+}
+} // namespace
+
 namespace llaisys::ops {
 void argmax(tensor_t max_idx, tensor_t max_val, tensor_t vals) {
-    TO_BE_IMPLEMENTED();
+    // 获取vals的最大值及其索引，分别存储在max_val和max_idx中
+    // 判断max_idx, max_val, vals是否在同一设备上
+    CHECK_SAME_DEVICE(max_idx, max_val, vals);
+    // 判断数据类型是否符合要求
+    CHECK_SAME_DTYPE(max_val->dtype(), vals->dtype());
+    // 判断张量是否是连续存储的
+    ASSERT(max_idx->isContiguous() && max_val->isContiguous() && vals->isContiguous(),
+           "Argmax: all tensors must be contiguous.");
+    CHECK_ARGUMENT(max_idx->dtype() == LLAISYS_DTYPE_I64, "Argmax: max_idx must be of dtype int64.");
+    // 判断形状是否符合要求
+    CHECK_SAME_SHAPE(max_idx->shape(), max_val->shape());
+    CHECK_ARGUMENT(vals->ndim() == 1, "Argmax: vals must be 1D.");
+    CHECK_ARGUMENT(max_idx->numel() == 1 && max_val->numel() == 1,
+                   "Argmax: max_idx and max_val must have exactly one element.");
+    CHECK_ARGUMENT(vals->numel() > 0, "Argmax: vals must not be empty.");
+
+    CHECK_ARGUMENT(vals->deviceType() == LLAISYS_DEVICE_CPU || vals->deviceType() == LLAISYS_DEVICE_NVIDIA,
+                   "Argmax: only cpu/nvidia are supported.");
+
+    auto numel = vals->numel();
+    auto type = vals->dtype();
+
+    if (vals->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(vals->deviceType(), vals->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t vals_bytes = vals->numel() * vals->elementSize();
+        const size_t val_bytes = max_val->elementSize();
+
+        std::vector<std::byte> h_vals(vals_bytes);
+        std::vector<std::byte> h_max_val(val_bytes);
+        int64_t h_idx = 0;
+
+        api->memcpy_sync(h_vals.data(), vals->data(), vals_bytes, LLAISYS_MEMCPY_D2H);
+
+        switch (type) {
+        case LLAISYS_DTYPE_F32:
+            argmax_impl(&h_idx,
+                        reinterpret_cast<float *>(h_max_val.data()),
+                        reinterpret_cast<const float *>(h_vals.data()),
+                        numel);
+            break;
+        case LLAISYS_DTYPE_F16:
+            argmax_impl(&h_idx,
+                        reinterpret_cast<llaisys::fp16_t *>(h_max_val.data()),
+                        reinterpret_cast<const llaisys::fp16_t *>(h_vals.data()),
+                        numel);
+            break;
+        case LLAISYS_DTYPE_BF16:
+            argmax_impl(&h_idx,
+                        reinterpret_cast<llaisys::bf16_t *>(h_max_val.data()),
+                        reinterpret_cast<const llaisys::bf16_t *>(h_vals.data()),
+                        numel);
+            break;
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        api->memcpy_sync(max_idx->data(), &h_idx, sizeof(int64_t), LLAISYS_MEMCPY_H2D);
+        api->memcpy_sync(max_val->data(), h_max_val.data(), val_bytes, LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    // 获取数据指针，并调用相应的数据类型处理函数， 使max_idx->data()的数据类型强制转换为int64_t*
+    auto idx_ptr = reinterpret_cast<int64_t *>(max_idx->data());
+
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        return argmax_impl(idx_ptr,
+                           reinterpret_cast<float *>(max_val->data()),
+                           reinterpret_cast<const float *>(vals->data()),
+                           numel);
+    case LLAISYS_DTYPE_F16:
+        return argmax_impl(idx_ptr,
+                           reinterpret_cast<llaisys::fp16_t *>(max_val->data()),
+                           reinterpret_cast<const llaisys::fp16_t *>(vals->data()),
+                           numel);
+    case LLAISYS_DTYPE_BF16:
+        return argmax_impl(idx_ptr,
+                           reinterpret_cast<llaisys::bf16_t *>(max_val->data()),
+                           reinterpret_cast<const llaisys::bf16_t *>(vals->data()),
+                           numel);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
 }
 } // namespace llaisys::ops
diff --git a/src/ops/embedding/op.cpp b/src/ops/embedding/op.cpp
index 84b9a5d06..d9361c4d0 100644
--- a/src/ops/embedding/op.cpp
+++ b/src/ops/embedding/op.cpp
@@ -1,7 +1,81 @@
 #include "op.hpp"
 
+#include "../../core/llaisys_core.hpp"
+
+#include <cstring>
+#include <vector>
+
 namespace llaisys::ops {
 void embedding(tensor_t out, tensor_t index, tensor_t weight) {
-    TO_BE_IMPLEMENTED();
+    CHECK_SAME_DEVICE(out, index, weight);
+    CHECK_SAME_DTYPE(out->dtype(), weight->dtype());
+    ASSERT(out->isContiguous() && index->isContiguous() && weight->isContiguous(),
+           "Embedding: all tensors must be contiguous.");
+    CHECK_ARGUMENT(index->dtype() == LLAISYS_DTYPE_I64, "Embedding: index must be of dtype int64.");
+    CHECK_ARGUMENT(index->ndim() == 1, "Embedding: index must be 1D.");
+    CHECK_ARGUMENT(out->ndim() == 2, "Embedding: out must be 2D.");
+    CHECK_ARGUMENT(weight->ndim() == 2, "Embedding: weight must be 2D.");
+    CHECK_ARGUMENT(out->shape()[0] == index->shape()[0], "Embedding: out.shape[0] must equal index.shape[0].");
+    CHECK_ARGUMENT(out->shape()[1] == weight->shape()[1], "Embedding:   out.shape[1] must equal weight.shape[1].");
+    CHECK_ARGUMENT(weight->shape()[0] > 0, "Embedding: weight.shape[0] must be greater than 0.");
+    CHECK_ARGUMENT(index->numel() > 0, "Embedding: index must not be empty.");
+    if (out->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(out->deviceType(), out->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t out_bytes = out->numel() * out->elementSize();
+        const size_t idx_bytes = index->numel() * index->elementSize();
+        const size_t w_bytes = weight->numel() * weight->elementSize();
+
+        std::vector<std::byte> h_out(out_bytes);
+        std::vector<std::byte> h_index(idx_bytes);
+        std::vector<std::byte> h_weight(w_bytes);
+
+        api->memcpy_sync(h_index.data(), index->data(), idx_bytes, LLAISYS_MEMCPY_D2H);
+        api->memcpy_sync(h_weight.data(), weight->data(), w_bytes, LLAISYS_MEMCPY_D2H);
+
+        auto index_ptr = reinterpret_cast<const int64_t *>(h_index.data());
+        auto out_ptr = h_out.data();
+        auto weight_ptr = h_weight.data();
+        size_t embed_dim = weight->shape()[1];
+        size_t dtype_size = out->elementSize();
+        for (size_t i = 0; i < index->numel(); ++i) {
+            int64_t idx = index_ptr[i];
+            CHECK_ARGUMENT(idx >= 0 && static_cast<size_t>(idx) < weight->shape()[0],
+                           "Embedding: index value out of range.");
+            std::memcpy(out_ptr + i * embed_dim * dtype_size,
+                        weight_ptr + static_cast<size_t>(idx) * embed_dim * dtype_size,
+                        embed_dim * dtype_size);
+        }
+
+        api->memcpy_sync(out->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    if (out->deviceType() != LLAISYS_DEVICE_CPU) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+
+    // perform embedding lookup
+    // get data pointers
+    auto index_ptr = reinterpret_cast<const int64_t *>(index->data());
+    auto out_ptr = out->data();
+    auto weight_ptr = weight->data();
+    size_t embed_dim = weight->shape()[1];
+    size_t dtype_size = out->elementSize();
+    for (size_t i = 0; i < index->numel(); ++i) {
+        int64_t idx = index_ptr[i];
+        CHECK_ARGUMENT(idx >= 0 && static_cast<size_t>(idx) < weight->shape()[0],
+                       "Embedding: index value out of range.");
+        // copy embedding vector
+        // memcpy(目标地址, 源地址, 复制的字节数)
+        // out_ptr + i * embed_dim * dtype_size: 目标地址，指向out张量中第i个embedding向量的起始位置
+        // weight_ptr + static_cast<size_t>(idx) * embed_dim * dtype_size: 源地址，指向weight张量中索引为idx的embedding向量的起始位置
+        // embed_dim * dtype_size: 复制的字节数，即一个embedding向量的大小
+        std::memcpy(out_ptr + i * embed_dim * dtype_size,
+                    weight_ptr + static_cast<size_t>(idx) * embed_dim * dtype_size,
+                    embed_dim * dtype_size);
+    }
 }
 } // namespace llaisys::ops
diff --git a/src/ops/linear/op.cpp b/src/ops/linear/op.cpp
index 97d1f8655..f6302282c 100644
--- a/src/ops/linear/op.cpp
+++ b/src/ops/linear/op.cpp
@@ -1,7 +1,358 @@
+#if defined(__AVX2__)
+#include <immintrin.h>
+#endif
+
 #include "op.hpp"
 
+#include "../../core/llaisys_core.hpp"
+#include "../../utils.hpp"
+
+#include <vector>
+
+#if defined(_OPENMP)
+#include <omp.h>
+#endif
+
+#if defined(ENABLE_OPENBLAS)
+#include <cblas.h>
+#endif
+
+namespace {
+// 匿名命名空间：本文件私有实现，避免与其他翻译单元符号冲突。
+#if defined(__AVX2__)
+// AVX2 点积内核：每次处理 8 个 float，剩余元素走标量尾处理。
+inline float dot_f32_avx2(const float *a, const float *b, size_t k) {
+    __m256 acc = _mm256_setzero_ps();
+    size_t i = 0;
+    for (; i + 8 <= k; i += 8) {
+        __m256 va = _mm256_loadu_ps(a + i);
+        __m256 vb = _mm256_loadu_ps(b + i);
+        acc = _mm256_fmadd_ps(va, vb, acc);
+    }
+
+    alignas(32) float lanes[8];
+    _mm256_store_ps(lanes, acc);
+    float sum = lanes[0] + lanes[1] + lanes[2] + lanes[3] + lanes[4] + lanes[5] + lanes[6] + lanes[7];
+
+    for (; i < k; ++i) {
+        sum += a[i] * b[i];
+    }
+    return sum;
+}
+#endif
+
+inline float dot_f32(const float *a, const float *b, size_t k) {
+#if defined(__AVX2__)
+    // 编译期选择：支持 AVX2 时走 SIMD 快路径。
+    return dot_f32_avx2(a, b, k);
+#else
+    float sum = 0.0f;
+#if defined(_OPENMP) && !defined(_MSC_VER)
+#pragma omp simd reduction(+ : sum)
+#endif
+    for (size_t i = 0; i < k; ++i) {
+        sum += a[i] * b[i];
+    }
+    return sum;
+#endif
+}
+
+void linear_impl_f32(float *out_ptr, const float *in_ptr, const float *weight_ptr, const float *bias_ptr,
+                     size_t M, size_t K, size_t N) {
+#if defined(ENABLE_OPENBLAS)
+    // out = in * weight^T，weight 当前布局是 [N, K]，因此 GEMM 里用 TransB。
+    cblas_sgemm(CblasRowMajor,
+                CblasNoTrans,
+                CblasTrans,
+                static_cast<int>(M),
+                static_cast<int>(N),
+                static_cast<int>(K),
+                1.0f,
+                in_ptr,
+                static_cast<int>(K),
+                weight_ptr,
+                static_cast<int>(K),
+                0.0f,
+                out_ptr,
+                static_cast<int>(N));
+
+    if (bias_ptr) {
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+        for (ptrdiff_t m = 0; m < static_cast<ptrdiff_t>(M); ++m) {
+            float *out_row = out_ptr + static_cast<size_t>(m) * N;
+            for (size_t n = 0; n < N; ++n) {
+                out_row[n] += bias_ptr[n];
+            }
+        }
+    }
+    return;
+#endif
+
+#if defined(_OPENMP)
+// 线程按行切分（m 维）：每个线程写不同 out_row，无写冲突。
+#pragma omp parallel for schedule(static)
+#endif
+    for (ptrdiff_t m = 0; m < static_cast<ptrdiff_t>(M); ++m) {
+        const float *in_row = in_ptr + static_cast<size_t>(m) * K;
+        float *out_row = out_ptr + static_cast<size_t>(m) * N;
+        for (size_t n = 0; n < N; ++n) {
+            const float *w_row = weight_ptr + n * K;
+            float sum = dot_f32(in_row, w_row, K);
+            if (bias_ptr) {
+                sum += bias_ptr[n];
+            }
+            out_row[n] = sum;
+        }
+    }
+}
+
+template <typename T>
+void linear_impl(T *out_ptr, const T *in_ptr, const T *weight_ptr, const T *bias_ptr,
+                 size_t M, size_t K, size_t N) {
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+    for (ptrdiff_t m = 0; m < static_cast<ptrdiff_t>(M); ++m) {
+        for (size_t n = 0; n < N; ++n) {
+            float sum = 0.0f;
+#if defined(_OPENMP) && !defined(_MSC_VER)
+#pragma omp simd reduction(+ : sum)
+#endif
+            for (size_t k = 0; k < K; ++k) {
+                // 转换为float进行计算，避免精度损失
+                sum += llaisys::utils::cast<float>(in_ptr[static_cast<size_t>(m) * K + k]) * llaisys::utils::cast<float>(weight_ptr[n * K + k]);
+            }
+            if (bias_ptr) {
+                sum += llaisys::utils::cast<float>(bias_ptr[n]);
+            }
+            // 转换回目标类型
+            out_ptr[static_cast<size_t>(m) * N + n] = llaisys::utils::cast<T>(sum);
+        }
+    }
+}
+
+template <typename LowpT>
+void linear_impl_lowp_fast(LowpT *out_ptr, const LowpT *in_ptr, const LowpT *weight_ptr, const LowpT *bias_ptr,
+                           size_t M, size_t K, size_t N) {
+    // 低精度专用路径：将低精度张量批量转换为 float 后计算，避免在最内层循环重复 cast。
+    std::vector<float> weight_f(N * K);
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+    for (ptrdiff_t idx = 0; idx < static_cast<ptrdiff_t>(N * K); ++idx) {
+        weight_f[static_cast<size_t>(idx)] = llaisys::utils::cast<float>(weight_ptr[static_cast<size_t>(idx)]);
+    }
+
+    std::vector<float> bias_f;
+    if (bias_ptr) {
+        bias_f.resize(N);
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+        for (ptrdiff_t n = 0; n < static_cast<ptrdiff_t>(N); ++n) {
+            bias_f[static_cast<size_t>(n)] = llaisys::utils::cast<float>(bias_ptr[static_cast<size_t>(n)]);
+        }
+    }
+
+#if defined(ENABLE_OPENBLAS)
+    std::vector<float> in_f(M * K);
+    std::vector<float> out_f(M * N);
+
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+    for (ptrdiff_t idx = 0; idx < static_cast<ptrdiff_t>(M * K); ++idx) {
+        in_f[static_cast<size_t>(idx)] = llaisys::utils::cast<float>(in_ptr[static_cast<size_t>(idx)]);
+    }
+
+    cblas_sgemm(CblasRowMajor,
+                CblasNoTrans,
+                CblasTrans,
+                static_cast<int>(M),
+                static_cast<int>(N),
+                static_cast<int>(K),
+                1.0f,
+                in_f.data(),
+                static_cast<int>(K),
+                weight_f.data(),
+                static_cast<int>(K),
+                0.0f,
+                out_f.data(),
+                static_cast<int>(N));
+
+    if (bias_ptr) {
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+        for (ptrdiff_t m = 0; m < static_cast<ptrdiff_t>(M); ++m) {
+            float *out_row = out_f.data() + static_cast<size_t>(m) * N;
+            for (size_t n = 0; n < N; ++n) {
+                out_row[n] += bias_f[n];
+            }
+        }
+    }
+
+#if defined(_OPENMP)
+#pragma omp parallel for schedule(static)
+#endif
+    for (ptrdiff_t idx = 0; idx < static_cast<ptrdiff_t>(M * N); ++idx) {
+        out_ptr[static_cast<size_t>(idx)] = llaisys::utils::cast<LowpT>(out_f[static_cast<size_t>(idx)]);
+    }
+    return;
+#endif
+
+    // 无 OpenBLAS 时，复用 SIMD/标量点积内核，仍保持 float 累加。
+#if defined(_OPENMP)
+#pragma omp parallel
+#endif
+    {
+        std::vector<float> in_row_f(K);
+#if defined(_OPENMP)
+#pragma omp for schedule(static)
+#endif
+        for (ptrdiff_t m = 0; m < static_cast<ptrdiff_t>(M); ++m) {
+            const size_t m_u = static_cast<size_t>(m);
+            const LowpT *in_row_lowp = in_ptr + m_u * K;
+            LowpT *out_row = out_ptr + m_u * N;
+
+            for (size_t k = 0; k < K; ++k) {
+                in_row_f[k] = llaisys::utils::cast<float>(in_row_lowp[k]);
+            }
+
+            for (size_t n = 0; n < N; ++n) {
+                const float *w_row = weight_f.data() + n * K;
+                float sum = dot_f32(in_row_f.data(), w_row, K);
+                if (bias_ptr) {
+                    sum += bias_f[n];
+                }
+                out_row[n] = llaisys::utils::cast<LowpT>(sum);
+            }
+        }
+    }
+}
+
+void validate_linear_args(llaisys::tensor_t out, llaisys::tensor_t in, llaisys::tensor_t weight, llaisys::tensor_t bias) {
+    // 检查输入输出张量是否在同一设备上
+    CHECK_SAME_DEVICE(out, in, weight, bias);
+    // 检查数据类型是否匹配
+    CHECK_SAME_DTYPE(out->dtype(), in->dtype(), weight->dtype(), bias->dtype());
+    // 检查张量是否是连续存储的
+    ASSERT(out->isContiguous() && in->isContiguous() && weight->isContiguous() && bias->isContiguous(),
+           "Linear: all tensors must be contiguous.");
+    // 检查形状是否符合要求
+    CHECK_ARGUMENT(out->ndim() == 2, "Linear: out must be 2D.");
+    CHECK_ARGUMENT(in->ndim() == 2, "Linear: in must be 2D.");
+    CHECK_ARGUMENT(weight->ndim() == 2, "Linear: weight must be 2D.");
+    CHECK_ARGUMENT(bias->ndim() == 1, "Linear: bias must be 1D.");
+    // 此时weight还没转置，故in的第二维度应等于weight的第二维度
+    CHECK_ARGUMENT(in->shape()[1] == weight->shape()[1], "Linear: in.shape[1] must equal weight.shape[1].");
+    // 输出张量的第一维度应等于输入张量的第一维度
+    CHECK_ARGUMENT(out->shape()[0] == in->shape()[0], "Linear: out.shape[0] must equal in.shape[0].");
+    // 输出张量的第二维度应等于weight的第一维度
+    CHECK_ARGUMENT(out->shape()[1] == weight->shape()[0], "Linear: out.shape[1] must equal weight.shape[0].");
+    // bias可为空
+    // 若不为空，bias的大小应等于输出张量的第二维度
+    if (bias) {
+        CHECK_ARGUMENT(bias->shape()[0] == out->shape()[1], "Linear: bias.shape[0] must equal out.shape[1].");
+    }
+    // 目前支持 CPU / NVIDIA
+    if (out->deviceType() != LLAISYS_DEVICE_CPU && out->deviceType() != LLAISYS_DEVICE_NVIDIA) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+}
+
+void dispatch_linear_kernel(llaisys::tensor_t out, llaisys::tensor_t in, llaisys::tensor_t weight, llaisys::tensor_t bias) {
+    const size_t M = in->shape()[0];     // in行数
+    const size_t K = in->shape()[1];     // in列数
+    const size_t N = weight->shape()[0]; // weight行数
+    const auto type = in->dtype();
+
+    if (out->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(out->deviceType(), out->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t out_bytes = out->numel() * out->elementSize();
+        const size_t in_bytes = in->numel() * in->elementSize();
+        const size_t w_bytes = weight->numel() * weight->elementSize();
+        const size_t b_bytes = bias ? (bias->numel() * bias->elementSize()) : 0;
+
+        std::vector<std::byte> h_out(out_bytes);
+        std::vector<std::byte> h_in(in_bytes);
+        std::vector<std::byte> h_w(w_bytes);
+        std::vector<std::byte> h_b;
+        if (bias) {
+            h_b.resize(b_bytes);
+        }
+
+        api->memcpy_sync(h_in.data(), in->data(), in_bytes, LLAISYS_MEMCPY_D2H);
+        api->memcpy_sync(h_w.data(), weight->data(), w_bytes, LLAISYS_MEMCPY_D2H);
+        if (bias) {
+            api->memcpy_sync(h_b.data(), bias->data(), b_bytes, LLAISYS_MEMCPY_D2H);
+        }
+
+        switch (type) {
+        case LLAISYS_DTYPE_F32:
+            linear_impl_f32(reinterpret_cast<float *>(h_out.data()),
+                            reinterpret_cast<const float *>(h_in.data()),
+                            reinterpret_cast<const float *>(h_w.data()),
+                            bias ? reinterpret_cast<const float *>(h_b.data()) : nullptr,
+                            M, K, N);
+            break;
+        case LLAISYS_DTYPE_F16:
+            linear_impl_lowp_fast(reinterpret_cast<llaisys::fp16_t *>(h_out.data()),
+                                  reinterpret_cast<const llaisys::fp16_t *>(h_in.data()),
+                                  reinterpret_cast<const llaisys::fp16_t *>(h_w.data()),
+                                  bias ? reinterpret_cast<const llaisys::fp16_t *>(h_b.data()) : nullptr,
+                                  M, K, N);
+            break;
+        case LLAISYS_DTYPE_BF16:
+            linear_impl_lowp_fast(reinterpret_cast<llaisys::bf16_t *>(h_out.data()),
+                                  reinterpret_cast<const llaisys::bf16_t *>(h_in.data()),
+                                  reinterpret_cast<const llaisys::bf16_t *>(h_w.data()),
+                                  bias ? reinterpret_cast<const llaisys::bf16_t *>(h_b.data()) : nullptr,
+                                  M, K, N);
+            break;
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        api->memcpy_sync(out->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    // 运行时 dtype 分发：f32 用特化快路径，f16/bf16 走模板通用路径。
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        return linear_impl_f32(reinterpret_cast<float *>(out->data()),
+                               reinterpret_cast<const float *>(in->data()),
+                               reinterpret_cast<const float *>(weight->data()),
+                               bias ? reinterpret_cast<const float *>(bias->data()) : nullptr,
+                               M, K, N);
+    case LLAISYS_DTYPE_F16:
+        return linear_impl_lowp_fast(reinterpret_cast<llaisys::fp16_t *>(out->data()),
+                                     reinterpret_cast<const llaisys::fp16_t *>(in->data()),
+                                     reinterpret_cast<const llaisys::fp16_t *>(weight->data()),
+                                     bias ? reinterpret_cast<const llaisys::fp16_t *>(bias->data()) : nullptr,
+                                     M, K, N);
+    case LLAISYS_DTYPE_BF16:
+        return linear_impl_lowp_fast(reinterpret_cast<llaisys::bf16_t *>(out->data()),
+                                     reinterpret_cast<const llaisys::bf16_t *>(in->data()),
+                                     reinterpret_cast<const llaisys::bf16_t *>(weight->data()),
+                                     bias ? reinterpret_cast<const llaisys::bf16_t *>(bias->data()) : nullptr,
+                                     M, K, N);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
+}
+} // namespace
+
 namespace llaisys::ops {
 void linear(tensor_t out, tensor_t in, tensor_t weight, tensor_t bias) {
-    TO_BE_IMPLEMENTED();
+    // 接口层：先校验参数，再分发内核实现。
+    validate_linear_args(out, in, weight, bias);
+    dispatch_linear_kernel(out, in, weight, bias);
 }
 } // namespace llaisys::ops
diff --git a/src/ops/rearrange/op.cpp b/src/ops/rearrange/op.cpp
index 017a6ae59..4778514d0 100644
--- a/src/ops/rearrange/op.cpp
+++ b/src/ops/rearrange/op.cpp
@@ -1,7 +1,135 @@
 #include "op.hpp"
 
+#include "../../core/llaisys_core.hpp"
+#include "../../utils.hpp"
+
+#include <vector>
+
+namespace {
+template <typename T>
+void rearrange_impl(T *out_ptr, const T *in_ptr, const std::vector<size_t> &shape,
+                    const std::vector<ptrdiff_t> &in_strides, size_t ndim) {
+    // 多维递归遍历，处理任意形状的张量
+    // out_ptr 写入位置始终连续（步长 = 后续所有维度的乘积）
+    // in_ptr 读取位置按照 in_strides 跳跃
+
+    if (ndim == 0) {
+        return;
+    }
+
+    // 栈用于迭代而非递归（避免栈溢出）
+    std::vector<size_t> indices(ndim, 0);
+
+    // 计算输出的连续步长（每维后续维度的乘积）
+    std::vector<size_t> out_strides(ndim);
+    out_strides[ndim - 1] = 1;
+    for (int i = static_cast<int>(ndim) - 2; i >= 0; --i) {
+        out_strides[i] = out_strides[i + 1] * shape[i + 1];
+    }
+
+    // 总元素数
+    size_t total_elements = 1;
+    for (size_t i = 0; i < ndim; ++i) {
+        total_elements *= shape[i];
+    }
+
+    // 遍历所有元素
+    for (size_t elem_idx = 0; elem_idx < total_elements; ++elem_idx) {
+        // 从线性索引计算多维索引
+        size_t temp = elem_idx;
+        for (int i = static_cast<int>(ndim) - 1; i >= 0; --i) {
+            indices[i] = temp % shape[i];
+            temp /= shape[i];
+        }
+
+        // 根据 indices 计算两个指针的偏移
+        ptrdiff_t in_offset = 0;
+        size_t out_offset = 0;
+        for (size_t i = 0; i < ndim; ++i) {
+            in_offset += static_cast<ptrdiff_t>(indices[i]) * in_strides[i];
+            out_offset += indices[i] * out_strides[i];
+        }
+
+        // 复制单个元素
+        out_ptr[out_offset] = in_ptr[in_offset];
+    }
+}
+} // namespace
+
 namespace llaisys::ops {
 void rearrange(tensor_t out, tensor_t in) {
-    TO_BE_IMPLEMENTED();
+    // 基本验证
+    CHECK_SAME_DEVICE(out, in);
+    CHECK_SAME_DTYPE(out->dtype(), in->dtype());
+
+    // 形状检查：必须相同
+    CHECK_ARGUMENT(out->ndim() == in->ndim(),
+                   "rearrange: out and in must have the same number of dimensions.");
+    CHECK_ARGUMENT(out->shape() == in->shape(),
+                   "rearrange: out and in must have the same shape.");
+
+    size_t ndim = out->ndim();
+    auto type = out->dtype();
+
+    // 获取步长信息
+    const auto &in_strides = in->strides();
+    const auto &shape = out->shape();
+
+    if (out->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(out->deviceType(), out->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t out_bytes = out->numel() * out->elementSize();
+        const size_t in_bytes = in->numel() * in->elementSize();
+        std::vector<std::byte> h_out(out_bytes), h_in(in_bytes);
+        api->memcpy_sync(h_in.data(), in->data(), in_bytes, LLAISYS_MEMCPY_D2H);
+
+        switch (type) {
+        case LLAISYS_DTYPE_F32:
+            rearrange_impl(reinterpret_cast<float *>(h_out.data()),
+                           reinterpret_cast<const float *>(h_in.data()),
+                           shape, in_strides, ndim);
+            break;
+        case LLAISYS_DTYPE_F16:
+            rearrange_impl(reinterpret_cast<llaisys::fp16_t *>(h_out.data()),
+                           reinterpret_cast<const llaisys::fp16_t *>(h_in.data()),
+                           shape, in_strides, ndim);
+            break;
+        case LLAISYS_DTYPE_BF16:
+            rearrange_impl(reinterpret_cast<llaisys::bf16_t *>(h_out.data()),
+                           reinterpret_cast<const llaisys::bf16_t *>(h_in.data()),
+                           shape, in_strides, ndim);
+            break;
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        api->memcpy_sync(out->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    // 设备检查
+    if (out->deviceType() != LLAISYS_DEVICE_CPU) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+
+    // dtype 分发
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        return rearrange_impl(reinterpret_cast<float *>(out->data()),
+                              reinterpret_cast<const float *>(in->data()),
+                              shape, in_strides, ndim);
+    case LLAISYS_DTYPE_F16:
+        return rearrange_impl(reinterpret_cast<llaisys::fp16_t *>(out->data()),
+                              reinterpret_cast<const llaisys::fp16_t *>(in->data()),
+                              shape, in_strides, ndim);
+    case LLAISYS_DTYPE_BF16:
+        return rearrange_impl(reinterpret_cast<llaisys::bf16_t *>(out->data()),
+                              reinterpret_cast<const llaisys::bf16_t *>(in->data()),
+                              shape, in_strides, ndim);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
 }
 } // namespace llaisys::ops
diff --git a/src/ops/rms_norm/op.cpp b/src/ops/rms_norm/op.cpp
index 529553d9d..c881c16c5 100644
--- a/src/ops/rms_norm/op.cpp
+++ b/src/ops/rms_norm/op.cpp
@@ -1,7 +1,115 @@
 #include "op.hpp"
 
+#include "../../core/llaisys_core.hpp"
+#include <cmath>
+
+#include "../../utils.hpp"
+
+#include <vector>
+
+namespace {
+template <typename T>
+void rms_norm_impl(T *out_ptr, const T *in_ptr, const T *weight_ptr,
+                   size_t outer_size, size_t norm_dim, float eps) {
+    for (size_t i = 0; i < outer_size; ++i) {
+        // compute rms，转换为float计算避免精度损失
+        float rms = 0.0f;
+        for (size_t j = 0; j < norm_dim; ++j) {
+            float val = llaisys::utils::cast<float>(in_ptr[i * norm_dim + j]);
+            rms += val * val;
+        }
+        rms = std::sqrt(rms / static_cast<float>(norm_dim) + eps);
+        // normalize and scale
+        for (size_t j = 0; j < norm_dim; ++j) {
+            float val = llaisys::utils::cast<float>(in_ptr[i * norm_dim + j]);
+            float weight_val = llaisys::utils::cast<float>(weight_ptr[j]);
+            out_ptr[i * norm_dim + j] = llaisys::utils::cast<T>((val / rms) * weight_val);
+        }
+    }
+}
+} // namespace
+
 namespace llaisys::ops {
 void rms_norm(tensor_t out, tensor_t in, tensor_t weight, float eps) {
-    TO_BE_IMPLEMENTED();
+    CHECK_SAME_DEVICE(out, in, weight);
+    CHECK_SAME_DTYPE(out->dtype(), in->dtype(), weight->dtype());
+    ASSERT(out->isContiguous() && in->isContiguous() && weight->isContiguous(),
+           "RMSNorm: all tensors must be contiguous.");
+    CHECK_ARGUMENT(out->ndim() == in->ndim(), "RMSNorm: out and in must have the same number of dimensions.");
+    for (size_t i = 0; i < out->ndim() - 1; ++i) {
+        CHECK_ARGUMENT(out->shape()[i] == in->shape()[i],
+                       "RMSNorm: out and in must have the same shape except for the last dimension.");
+    }
+    size_t norm_dim = out->shape().back();
+    CHECK_ARGUMENT(weight->ndim() == 1 && weight->shape()[0] == norm_dim,
+                   "RMSNorm: weight must be 1D with size equal to the last dimension of in/out.");
+    // outer_size: number of RMSNorm operations to perform
+    // norm_dim: size of each RMSNorm operation
+    size_t outer_size = out->numel() / norm_dim;
+    auto type = out->dtype();
+
+    if (out->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(out->deviceType(), out->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t out_bytes = out->numel() * out->elementSize();
+        const size_t in_bytes = in->numel() * in->elementSize();
+        const size_t w_bytes = weight->numel() * weight->elementSize();
+
+        std::vector<std::byte> h_out(out_bytes), h_in(in_bytes), h_w(w_bytes);
+        api->memcpy_sync(h_in.data(), in->data(), in_bytes, LLAISYS_MEMCPY_D2H);
+        api->memcpy_sync(h_w.data(), weight->data(), w_bytes, LLAISYS_MEMCPY_D2H);
+
+        switch (type) {
+        case LLAISYS_DTYPE_F32:
+            rms_norm_impl(reinterpret_cast<float *>(h_out.data()),
+                          reinterpret_cast<const float *>(h_in.data()),
+                          reinterpret_cast<const float *>(h_w.data()),
+                          outer_size, norm_dim, eps);
+            break;
+        case LLAISYS_DTYPE_F16:
+            rms_norm_impl(reinterpret_cast<llaisys::fp16_t *>(h_out.data()),
+                          reinterpret_cast<const llaisys::fp16_t *>(h_in.data()),
+                          reinterpret_cast<const llaisys::fp16_t *>(h_w.data()),
+                          outer_size, norm_dim, eps);
+            break;
+        case LLAISYS_DTYPE_BF16:
+            rms_norm_impl(reinterpret_cast<llaisys::bf16_t *>(h_out.data()),
+                          reinterpret_cast<const llaisys::bf16_t *>(h_in.data()),
+                          reinterpret_cast<const llaisys::bf16_t *>(h_w.data()),
+                          outer_size, norm_dim, eps);
+            break;
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        api->memcpy_sync(out->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    if (out->deviceType() != LLAISYS_DEVICE_CPU) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        return rms_norm_impl(reinterpret_cast<float *>(out->data()),
+                             reinterpret_cast<const float *>(in->data()),
+                             reinterpret_cast<const float *>(weight->data()),
+                             outer_size, norm_dim, eps);
+    case LLAISYS_DTYPE_F16:
+        return rms_norm_impl(reinterpret_cast<llaisys::fp16_t *>(out->data()),
+                             reinterpret_cast<const llaisys::fp16_t *>(in->data()),
+                             reinterpret_cast<const llaisys::fp16_t *>(weight->data()),
+                             outer_size, norm_dim, eps);
+    case LLAISYS_DTYPE_BF16:
+        return rms_norm_impl(reinterpret_cast<llaisys::bf16_t *>(out->data()),
+                             reinterpret_cast<const llaisys::bf16_t *>(in->data()),
+                             reinterpret_cast<const llaisys::bf16_t *>(weight->data()),
+                             outer_size, norm_dim, eps);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
 }
 } // namespace llaisys::ops
diff --git a/src/ops/rope/op.cpp b/src/ops/rope/op.cpp
index d60dbe64e..7f26b06b8 100644
--- a/src/ops/rope/op.cpp
+++ b/src/ops/rope/op.cpp
@@ -1,7 +1,133 @@
 #include "op.hpp"
+#include "../../core/llaisys_core.hpp"
+#include <cmath>
+
+#include <vector>
+
+namespace {
+template <typename T>
+void rope_impl(T *out_ptr, const T *in_ptr, const int64_t *pos_ids,
+               size_t batch_size, size_t seq_len, size_t dim, float theta) {
+    size_t half_dim = dim / 2;
+
+    for (size_t s = 0; s < batch_size; ++s) {
+        // 获取当前位置的ID
+        int64_t pos_id = pos_ids[s];
+        for (size_t h = 0; h < seq_len; ++h) {
+            // 维度对循环
+            // s是第几个token
+            // h是第几个head
+            // 每个token对应seq_len个head，列如hidden=512, num_head=8,则seq_len=64，实际上每个词对应64个head
+            // 每个(s, h)对应一个长度为dim的向量，base就是这个向量的起始索引，dim是每个头的向量长度
+            size_t base = (s * seq_len + h) * dim;
+            for (size_t d = 0; d < half_dim; ++d) {
+                const double exponent = (2.0 * static_cast<double>(d)) / static_cast<double>(dim);
+                const double angle = static_cast<double>(pos_id) / std::pow(static_cast<double>(theta), exponent);
+                const double cos_angle = std::cos(angle);
+                const double sin_angle = std::sin(angle);
+                // original values (split half/half)
+                const double x1 = static_cast<double>(llaisys::utils::cast<float>(in_ptr[base + d]));
+                const double x2 = static_cast<double>(llaisys::utils::cast<float>(in_ptr[base + half_dim + d]));
+                // apply rotation
+                out_ptr[base + d] = llaisys::utils::cast<T>(x1 * cos_angle - x2 * sin_angle);
+                out_ptr[base + half_dim + d] = llaisys::utils::cast<T>(x2 * cos_angle + x1 * sin_angle);
+            }
+        }
+    }
+}
+} // namespace
 
 namespace llaisys::ops {
 void rope(tensor_t out, tensor_t in, tensor_t pos_ids, float theta) {
-    TO_BE_IMPLEMENTED();
+    // 检查设备和数据类型
+    CHECK_SAME_DEVICE(out, in, pos_ids);
+    CHECK_SAME_DTYPE(out->dtype(), in->dtype());
+
+    // 检查张量是否是连续存储的
+    ASSERT(out->isContiguous() && in->isContiguous() && pos_ids->isContiguous(),
+           "Rope: all tensors must be contiguous.");
+
+    // 检查维度和pos_ids数据类型
+    CHECK_ARGUMENT(in->ndim() == 3, "Rope: in must be 3D.");
+    CHECK_ARGUMENT(out->ndim() == in->ndim(), "Rope: out and in must have the same number of dimensions.");
+    CHECK_ARGUMENT(pos_ids->dtype() == LLAISYS_DTYPE_I64, "Rope: pos_ids must be of dtype int64.");
+    CHECK_ARGUMENT(pos_ids->ndim() == 1, "Rope: pos_ids must be 1D.");
+
+    // 检查维度匹配
+    CHECK_ARGUMENT(out->shape()[0] == in->shape()[0], "Rope: out.shape[0] must equal in.shape[0].");
+    CHECK_ARGUMENT(out->shape()[1] == in->shape()[1], "Rope: out.shape[1] must equal in.shape[1].");
+    CHECK_ARGUMENT(out->shape()[2] == in->shape()[2], "Rope: out.shape[2] must equal in.shape[2].");
+    CHECK_ARGUMENT(in->shape()[0] == pos_ids->shape()[0], "Rope: in.shape[0] must equal pos_ids.shape[0].");
+
+    // 维度必须为偶数
+    size_t dim = in->shape()[2];
+    CHECK_ARGUMENT(dim % 2 == 0, "Rope: the last dimension must be even.");
+
+    // 获取数据类型用于分发
+    auto type = in->dtype();
+
+    if (out->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(out->deviceType(), out->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t p_bytes = pos_ids->numel() * pos_ids->elementSize();
+        std::vector<int64_t> h_pos(pos_ids->numel());
+        api->memcpy_sync(h_pos.data(), pos_ids->data(), p_bytes, LLAISYS_MEMCPY_D2H);
+
+        switch (type) {
+        case LLAISYS_DTYPE_F32: {
+            std::vector<float> h_in(in->numel()), h_out(out->numel());
+            api->memcpy_sync(h_in.data(), in->data(), in->numel() * sizeof(float), LLAISYS_MEMCPY_D2H);
+            rope_impl(h_out.data(), h_in.data(), h_pos.data(), in->shape()[0], in->shape()[1], dim, theta);
+            api->memcpy_sync(out->data(), h_out.data(), out->numel() * sizeof(float), LLAISYS_MEMCPY_H2D);
+            break;
+        }
+        case LLAISYS_DTYPE_F16: {
+            std::vector<llaisys::fp16_t> h_in(in->numel()), h_out(out->numel());
+            api->memcpy_sync(h_in.data(), in->data(), in->numel() * sizeof(llaisys::fp16_t), LLAISYS_MEMCPY_D2H);
+            rope_impl(h_out.data(), h_in.data(), h_pos.data(), in->shape()[0], in->shape()[1], dim, theta);
+            api->memcpy_sync(out->data(), h_out.data(), out->numel() * sizeof(llaisys::fp16_t), LLAISYS_MEMCPY_H2D);
+            break;
+        }
+        case LLAISYS_DTYPE_BF16: {
+            std::vector<llaisys::bf16_t> h_in(in->numel()), h_out(out->numel());
+            api->memcpy_sync(h_in.data(), in->data(), in->numel() * sizeof(llaisys::bf16_t), LLAISYS_MEMCPY_D2H);
+            rope_impl(h_out.data(), h_in.data(), h_pos.data(), in->shape()[0], in->shape()[1], dim, theta);
+            api->memcpy_sync(out->data(), h_out.data(), out->numel() * sizeof(llaisys::bf16_t), LLAISYS_MEMCPY_H2D);
+            break;
+        }
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        return;
+    }
+
+    // 检查设备类型
+    if (out->deviceType() != LLAISYS_DEVICE_CPU) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+
+    // 获取数据指针并调用相应的数据类型处理函数
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        return rope_impl(reinterpret_cast<float *>(out->data()),
+                         reinterpret_cast<const float *>(in->data()),
+                         reinterpret_cast<const int64_t *>(pos_ids->data()),
+                         in->shape()[0], in->shape()[1], dim, theta);
+    case LLAISYS_DTYPE_F16:
+        return rope_impl(reinterpret_cast<llaisys::fp16_t *>(out->data()),
+                         reinterpret_cast<const llaisys::fp16_t *>(in->data()),
+                         reinterpret_cast<const int64_t *>(pos_ids->data()),
+                         in->shape()[0], in->shape()[1], dim, theta);
+    case LLAISYS_DTYPE_BF16:
+        return rope_impl(reinterpret_cast<llaisys::bf16_t *>(out->data()),
+                         reinterpret_cast<const llaisys::bf16_t *>(in->data()),
+                         reinterpret_cast<const int64_t *>(pos_ids->data()),
+                         in->shape()[0], in->shape()[1], dim, theta);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
 }
 } // namespace llaisys::ops
diff --git a/src/ops/sampling/op.cpp b/src/ops/sampling/op.cpp
new file mode 100644
index 000000000..af8037926
--- /dev/null
+++ b/src/ops/sampling/op.cpp
@@ -0,0 +1,190 @@
+#include "op.hpp"
+
+#include <algorithm>
+#include <cmath>
+#include <cstdlib>
+#include <ctime>
+#include <numeric>
+#include <utility>
+#include <vector>
+
+#include "../../core/llaisys_core.hpp"
+#include "../../utils.hpp"
+
+namespace {
+
+template <typename T>
+void load_logits(std::vector<float> &dst, const T *src, size_t n) {
+    dst.resize(n);
+    for (size_t i = 0; i < n; ++i) {
+        dst[i] = llaisys::utils::cast<float>(src[i]);
+    }
+}
+
+int64_t argmax_index(const std::vector<float> &vals) {
+    return static_cast<int64_t>(
+        std::distance(vals.begin(), std::max_element(vals.begin(), vals.end())));
+}
+
+int64_t sample_from_logits(std::vector<float> logits, float temperature, int top_k, float top_p) {
+    if (logits.empty()) {
+        return 0;
+    }
+
+    if (temperature <= 0.0f) {
+        return argmax_index(logits);
+    }
+
+    for (auto &v : logits) {
+        v /= temperature;
+    }
+
+    float max_logit = *std::max_element(logits.begin(), logits.end());
+    std::vector<float> probs(logits.size(), 0.0f);
+    float sum = 0.0f;
+    for (size_t i = 0; i < logits.size(); ++i) {
+        probs[i] = std::exp(logits[i] - max_logit);
+        sum += probs[i];
+    }
+    if (sum <= 0.0f) {
+        return argmax_index(logits);
+    }
+    for (auto &p : probs) {
+        p /= sum;
+    }
+
+    std::vector<size_t> keep_idx(probs.size());
+    std::iota(keep_idx.begin(), keep_idx.end(), 0);
+    std::sort(keep_idx.begin(), keep_idx.end(), [&probs](size_t a, size_t b) {
+        return probs[a] > probs[b];
+    });
+
+    if (top_k > 0 && static_cast<size_t>(top_k) < keep_idx.size()) {
+        keep_idx.resize(static_cast<size_t>(top_k));
+    }
+
+    if (top_p > 0.0f && top_p < 1.0f && !keep_idx.empty()) {
+        std::vector<size_t> nucleus;
+        nucleus.reserve(keep_idx.size());
+        float cumulative = 0.0f;
+        for (size_t idx : keep_idx) {
+            nucleus.push_back(idx);
+            cumulative += probs[idx];
+            if (cumulative >= top_p) {
+                break;
+            }
+        }
+        keep_idx.swap(nucleus);
+    }
+
+    if (keep_idx.empty()) {
+        return argmax_index(logits);
+    }
+
+    std::vector<float> filtered;
+    filtered.reserve(keep_idx.size());
+    for (size_t idx : keep_idx) {
+        filtered.push_back(probs[idx]);
+    }
+
+    float filtered_sum = std::accumulate(filtered.begin(), filtered.end(), 0.0f);
+    if (filtered_sum <= 0.0f) {
+        return static_cast<int64_t>(keep_idx[0]);
+    }
+    for (auto &v : filtered) {
+        v /= filtered_sum;
+    }
+
+    static bool seeded = false;
+    if (!seeded) {
+        std::srand(static_cast<unsigned int>(std::time(nullptr)));
+        seeded = true;
+    }
+
+    float r = static_cast<float>(std::rand()) / static_cast<float>(RAND_MAX);
+    float cdf = 0.0f;
+    size_t chosen = filtered.size() - 1;
+    for (size_t i = 0; i < filtered.size(); ++i) {
+        cdf += filtered[i];
+        if (r <= cdf) {
+            chosen = i;
+            break;
+        }
+    }
+    return static_cast<int64_t>(keep_idx[chosen]);
+}
+
+template <typename T>
+int64_t sample_impl(const T *logits, size_t n, float temperature, int top_k, float top_p) {
+    std::vector<float> host_logits;
+    load_logits(host_logits, logits, n);
+    return sample_from_logits(std::move(host_logits), temperature, top_k, top_p);
+}
+
+} // namespace
+
+namespace llaisys::ops {
+
+void sample(tensor_t out_idx, tensor_t logits, float temperature, int top_k, float top_p) {
+    CHECK_SAME_DEVICE(out_idx, logits);
+    ASSERT(out_idx->isContiguous() && logits->isContiguous(),
+           "Sample: out_idx and logits must be contiguous.");
+    CHECK_ARGUMENT(out_idx->dtype() == LLAISYS_DTYPE_I64, "Sample: out_idx must be int64.");
+    CHECK_ARGUMENT(out_idx->numel() == 1, "Sample: out_idx must contain exactly one element.");
+    CHECK_ARGUMENT(logits->ndim() == 1, "Sample: logits must be 1D.");
+    CHECK_ARGUMENT(logits->numel() > 0, "Sample: logits must not be empty.");
+    CHECK_ARGUMENT(top_k >= 0, "Sample: top_k must be >= 0.");
+    CHECK_ARGUMENT(top_p >= 0.0f && top_p <= 1.0f, "Sample: top_p must be in [0, 1].");
+
+    auto type = logits->dtype();
+    size_t n = logits->numel();
+
+    if (logits->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(logits->deviceType(), logits->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t logits_bytes = logits->numel() * logits->elementSize();
+        std::vector<std::byte> h_logits(logits_bytes);
+        api->memcpy_sync(h_logits.data(), logits->data(), logits_bytes, LLAISYS_MEMCPY_D2H);
+
+        int64_t sampled = 0;
+        switch (type) {
+        case LLAISYS_DTYPE_F32:
+            sampled = sample_impl(reinterpret_cast<const float *>(h_logits.data()), n, temperature, top_k, top_p);
+            break;
+        case LLAISYS_DTYPE_F16:
+            sampled = sample_impl(reinterpret_cast<const llaisys::fp16_t *>(h_logits.data()), n, temperature, top_k, top_p);
+            break;
+        case LLAISYS_DTYPE_BF16:
+            sampled = sample_impl(reinterpret_cast<const llaisys::bf16_t *>(h_logits.data()), n, temperature, top_k, top_p);
+            break;
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        api->memcpy_sync(out_idx->data(), &sampled, sizeof(int64_t), LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    if (logits->deviceType() != LLAISYS_DEVICE_CPU) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+
+    auto *idx_ptr = reinterpret_cast<int64_t *>(out_idx->data());
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        *idx_ptr = sample_impl(reinterpret_cast<const float *>(logits->data()), n, temperature, top_k, top_p);
+        return;
+    case LLAISYS_DTYPE_F16:
+        *idx_ptr = sample_impl(reinterpret_cast<const llaisys::fp16_t *>(logits->data()), n, temperature, top_k, top_p);
+        return;
+    case LLAISYS_DTYPE_BF16:
+        *idx_ptr = sample_impl(reinterpret_cast<const llaisys::bf16_t *>(logits->data()), n, temperature, top_k, top_p);
+        return;
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
+}
+
+} // namespace llaisys::ops
diff --git a/src/ops/sampling/op.hpp b/src/ops/sampling/op.hpp
new file mode 100644
index 000000000..e815ff784
--- /dev/null
+++ b/src/ops/sampling/op.hpp
@@ -0,0 +1,7 @@
+#pragma once
+
+#include "../../tensor/tensor.hpp"
+
+namespace llaisys::ops {
+void sample(tensor_t out_idx, tensor_t logits, float temperature, int top_k, float top_p);
+}
diff --git a/src/ops/self_attention/op.cpp b/src/ops/self_attention/op.cpp
index 43d620142..051ea8db6 100644
--- a/src/ops/self_attention/op.cpp
+++ b/src/ops/self_attention/op.cpp
@@ -1,7 +1,231 @@
 #include "op.hpp"
+#include "../../core/llaisys_core.hpp"
+#include <cmath>
+#include <limits>
+
+#include <vector>
+
+namespace {
+template <typename T>
+void self_attention_impl(T *attn_val_ptr, const T *q_ptr, const T *k_ptr, const T *v_ptr,
+                         size_t seqlen, size_t nhead, size_t d,
+                         size_t total_len, size_t nkvhead, size_t dv,
+                         float scale) {
+    size_t group_size = nhead / nkvhead; // GQA: 每个KV头对应group_size个Q头
+
+    // 为每个序列位置计算attention
+    for (size_t s = 0; s < seqlen; ++s) {
+        // 为每个Q头计算attention
+        for (size_t h = 0; h < nhead; ++h) {
+            size_t kv_head = h / group_size; // GQA头映射，每group_size个Q头对应同一个KV头
+
+            // Step 1: 计算所有位置的attention scores: Q[s,h] @ K[:,kv_head]^T
+            std::vector<float> scores(total_len);                      // 用来存储当前Q头与所有其他词条其他相对应的K位置的得分
+            float max_score = -std::numeric_limits<float>::infinity(); // 初始化用于softmax的最大值
+
+            for (size_t t = 0; t < total_len; ++t) {
+                float dot = 0.0f;
+                for (size_t i = 0; i < d; ++i) {
+                    float q_val = llaisys::utils::cast<float>(q_ptr[(s * nhead + h) * d + i]);
+                    float k_val = llaisys::utils::cast<float>(k_ptr[(t * nkvhead + kv_head) * d + i]);
+                    dot += q_val * k_val;
+                }
+                scores[t] = dot * scale;
+                max_score = std::max(max_score, scores[t]);
+            }
+
+            // Step 2: 应用因果掩码
+            size_t visible_len = s + (total_len - seqlen) + 1;
+            for (size_t t = visible_len; t < total_len; ++t) {
+                scores[t] = -std::numeric_limits<float>::infinity();
+            }
+
+            // Step 3: Softmax（数值稳定版本）
+            float sum_exp = 0.0f;
+            for (size_t t = 0; t < total_len; ++t) {
+                if (std::isinf(scores[t]) && scores[t] < 0) {
+                    scores[t] = 0.0f; // -inf会被exp为0
+                } else {
+                    scores[t] = std::exp(scores[t] - max_score);
+                    sum_exp += scores[t];
+                }
+            }
+
+            // 防止除零
+            if (sum_exp == 0.0f) {
+                sum_exp = 1.0f; // 避免NaN
+            }
+
+            // 归一化
+            for (size_t t = 0; t < total_len; ++t) {
+                scores[t] /= sum_exp;
+            }
+
+            // Step 4: 加权聚合V
+            for (size_t j = 0; j < dv; ++j) {
+                float output_val = 0.0f;
+                for (size_t t = 0; t < total_len; ++t) {
+                    float v_val = llaisys::utils::cast<float>(v_ptr[(t * nkvhead + kv_head) * dv + j]);
+                    output_val += scores[t] * v_val;
+                }
+                attn_val_ptr[(s * nhead + h) * dv + j] = llaisys::utils::cast<T>(output_val);
+            }
+        }
+    }
+}
+} // namespace
 
 namespace llaisys::ops {
 void self_attention(tensor_t attn_val, tensor_t q, tensor_t k, tensor_t v, float scale) {
-    TO_BE_IMPLEMENTED();
+    // 检查数据和设备类型
+    CHECK_SAME_DEVICE(attn_val, q, k, v);
+
+    auto dtype = q->dtype();
+    ASSERT(attn_val->dtype() == dtype && k->dtype() == dtype && v->dtype() == dtype,
+           "SelfAttention: all tensors must have the same dtype.");
+    ASSERT(attn_val->isContiguous() && q->isContiguous() && k->isContiguous() && v->isContiguous(),
+           "SelfAttention: all tensors must be contiguous.");
+
+    // 检查维度
+    CHECK_ARGUMENT(attn_val->ndim() == 3, "SelfAttention: attn_val must be 3D.");
+    CHECK_ARGUMENT(q->ndim() == 3, "SelfAttention: q must be 3D.");
+    CHECK_ARGUMENT(k->ndim() == 3, "SelfAttention: k must be 3D.");
+    CHECK_ARGUMENT(v->ndim() == 3, "SelfAttention: v must be 3D.");
+
+    // 提取维度
+    size_t seqlen = q->shape()[0];
+    size_t nhead = q->shape()[1];
+    size_t d = q->shape()[2];
+    size_t total_len = k->shape()[0];
+    size_t nkvhead = k->shape()[1];
+    size_t dv = v->shape()[2];
+
+    // 检查形状匹配
+    // 序列长度：Q的序列长度应该 <= K/V的总长度（K/V可能包含缓存）
+    CHECK_ARGUMENT(seqlen <= total_len,
+                   "SelfAttention: q.shape[0] must be <= k.shape[0].");
+
+    // Q/K/V的特征维度应该相同
+    CHECK_ARGUMENT(k->shape()[2] == d,
+                   "SelfAttention: k.shape[2] must equal q.shape[2].");
+    CHECK_ARGUMENT(v->shape()[0] == total_len,
+                   "SelfAttention: v.shape[0] must equal k.shape[0].");
+    CHECK_ARGUMENT(v->shape()[1] == nkvhead,
+                   "SelfAttention: v.shape[1] must equal k.shape[1].");
+
+    // 输出张量形状验证
+    CHECK_ARGUMENT(attn_val->shape()[0] == seqlen,
+                   "SelfAttention: attn_val.shape[0] must equal q.shape[0].");
+    CHECK_ARGUMENT(attn_val->shape()[1] == nhead,
+                   "SelfAttention: attn_val.shape[1] must equal q.shape[1].");
+    CHECK_ARGUMENT(attn_val->shape()[2] == dv,
+                   "SelfAttention: attn_val.shape[2] must equal v.shape[2].");
+
+    // GQA验证：Q头数必须能被KV头数整除
+    CHECK_ARGUMENT(nhead % nkvhead == 0,
+                   "SelfAttention: q.shape[1] (nhead) must be divisible by k.shape[1] (nkvhead).");
+
+    switch (dtype) {
+    case LLAISYS_DTYPE_F32:
+        if (attn_val->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+            auto &ctx = llaisys::core::context();
+            ctx.setDevice(attn_val->deviceType(), attn_val->deviceId());
+            const auto *api = ctx.runtime().api();
+
+            const size_t out_bytes = attn_val->numel() * attn_val->elementSize();
+            const size_t q_bytes = q->numel() * q->elementSize();
+            const size_t k_bytes = k->numel() * k->elementSize();
+            const size_t v_bytes = v->numel() * v->elementSize();
+            std::vector<std::byte> h_out(out_bytes), h_q(q_bytes), h_k(k_bytes), h_v(v_bytes);
+
+            api->memcpy_sync(h_q.data(), q->data(), q_bytes, LLAISYS_MEMCPY_D2H);
+            api->memcpy_sync(h_k.data(), k->data(), k_bytes, LLAISYS_MEMCPY_D2H);
+            api->memcpy_sync(h_v.data(), v->data(), v_bytes, LLAISYS_MEMCPY_D2H);
+
+            self_attention_impl(reinterpret_cast<float *>(h_out.data()),
+                                reinterpret_cast<const float *>(h_q.data()),
+                                reinterpret_cast<const float *>(h_k.data()),
+                                reinterpret_cast<const float *>(h_v.data()),
+                                seqlen, nhead, d, total_len, nkvhead, dv, scale);
+            api->memcpy_sync(attn_val->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+            return;
+        }
+        if (attn_val->deviceType() != LLAISYS_DEVICE_CPU) {
+            EXCEPTION_UNSUPPORTED_DEVICE;
+        }
+        return self_attention_impl(reinterpret_cast<float *>(attn_val->data()),
+                                   reinterpret_cast<const float *>(q->data()),
+                                   reinterpret_cast<const float *>(k->data()),
+                                   reinterpret_cast<const float *>(v->data()),
+                                   seqlen, nhead, d, total_len, nkvhead, dv, scale);
+
+    case LLAISYS_DTYPE_F16:
+        if (attn_val->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+            auto &ctx = llaisys::core::context();
+            ctx.setDevice(attn_val->deviceType(), attn_val->deviceId());
+            const auto *api = ctx.runtime().api();
+
+            const size_t out_bytes = attn_val->numel() * attn_val->elementSize();
+            const size_t q_bytes = q->numel() * q->elementSize();
+            const size_t k_bytes = k->numel() * k->elementSize();
+            const size_t v_bytes = v->numel() * v->elementSize();
+            std::vector<std::byte> h_out(out_bytes), h_q(q_bytes), h_k(k_bytes), h_v(v_bytes);
+
+            api->memcpy_sync(h_q.data(), q->data(), q_bytes, LLAISYS_MEMCPY_D2H);
+            api->memcpy_sync(h_k.data(), k->data(), k_bytes, LLAISYS_MEMCPY_D2H);
+            api->memcpy_sync(h_v.data(), v->data(), v_bytes, LLAISYS_MEMCPY_D2H);
+
+            self_attention_impl(reinterpret_cast<llaisys::fp16_t *>(h_out.data()),
+                                reinterpret_cast<const llaisys::fp16_t *>(h_q.data()),
+                                reinterpret_cast<const llaisys::fp16_t *>(h_k.data()),
+                                reinterpret_cast<const llaisys::fp16_t *>(h_v.data()),
+                                seqlen, nhead, d, total_len, nkvhead, dv, scale);
+            api->memcpy_sync(attn_val->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+            return;
+        }
+        if (attn_val->deviceType() != LLAISYS_DEVICE_CPU) {
+            EXCEPTION_UNSUPPORTED_DEVICE;
+        }
+        return self_attention_impl(reinterpret_cast<llaisys::fp16_t *>(attn_val->data()),
+                                   reinterpret_cast<const llaisys::fp16_t *>(q->data()),
+                                   reinterpret_cast<const llaisys::fp16_t *>(k->data()),
+                                   reinterpret_cast<const llaisys::fp16_t *>(v->data()),
+                                   seqlen, nhead, d, total_len, nkvhead, dv, scale);
+
+    case LLAISYS_DTYPE_BF16:
+        if (attn_val->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+            auto &ctx = llaisys::core::context();
+            ctx.setDevice(attn_val->deviceType(), attn_val->deviceId());
+            const auto *api = ctx.runtime().api();
+
+            const size_t out_bytes = attn_val->numel() * attn_val->elementSize();
+            const size_t q_bytes = q->numel() * q->elementSize();
+            const size_t k_bytes = k->numel() * k->elementSize();
+            const size_t v_bytes = v->numel() * v->elementSize();
+            std::vector<std::byte> h_out(out_bytes), h_q(q_bytes), h_k(k_bytes), h_v(v_bytes);
+
+            api->memcpy_sync(h_q.data(), q->data(), q_bytes, LLAISYS_MEMCPY_D2H);
+            api->memcpy_sync(h_k.data(), k->data(), k_bytes, LLAISYS_MEMCPY_D2H);
+            api->memcpy_sync(h_v.data(), v->data(), v_bytes, LLAISYS_MEMCPY_D2H);
+
+            self_attention_impl(reinterpret_cast<llaisys::bf16_t *>(h_out.data()),
+                                reinterpret_cast<const llaisys::bf16_t *>(h_q.data()),
+                                reinterpret_cast<const llaisys::bf16_t *>(h_k.data()),
+                                reinterpret_cast<const llaisys::bf16_t *>(h_v.data()),
+                                seqlen, nhead, d, total_len, nkvhead, dv, scale);
+            api->memcpy_sync(attn_val->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+            return;
+        }
+        if (attn_val->deviceType() != LLAISYS_DEVICE_CPU) {
+            EXCEPTION_UNSUPPORTED_DEVICE;
+        }
+        return self_attention_impl(reinterpret_cast<llaisys::bf16_t *>(attn_val->data()),
+                                   reinterpret_cast<const llaisys::bf16_t *>(q->data()),
+                                   reinterpret_cast<const llaisys::bf16_t *>(k->data()),
+                                   reinterpret_cast<const llaisys::bf16_t *>(v->data()),
+                                   seqlen, nhead, d, total_len, nkvhead, dv, scale);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(dtype);
+    }
 }
-} // namespace llaisys::ops
+} // namespace llaisys::ops
\ No newline at end of file
diff --git a/src/ops/swiglu/op.cpp b/src/ops/swiglu/op.cpp
index 47edbcc97..06b7d3624 100644
--- a/src/ops/swiglu/op.cpp
+++ b/src/ops/swiglu/op.cpp
@@ -1,7 +1,121 @@
 #include "op.hpp"
 
+#include "../../core/llaisys_core.hpp"
+#include "../../utils.hpp"
+#include <cmath>
+
+#include <vector>
+
+namespace {
+template <typename T>
+void swiglu_impl(T *out_ptr, const T *gate_ptr, const T *up_ptr, size_t total_size) {
+    // out[i] = up[i] * gate[i] / (1 + exp(-gate[i]))
+    // 这是门控线性单元的一个变体
+    for (size_t i = 0; i < total_size; ++i) {
+        // 转换为float计算，保证数值精度
+        float gate_val = llaisys::utils::cast<float>(gate_ptr[i]);
+        float up_val = llaisys::utils::cast<float>(up_ptr[i]);
+
+        // 计算 gate / (1 + exp(-gate))
+        // 处理数值稳定性：避免exp(-gate_val)爆炸
+        float glu_val;
+        if (gate_val >= 50.0f) {
+            // 当gate_val >= 50时，exp(-gate_val) ≈ 0，所以 gate / (1 + 0) ≈ gate
+            glu_val = gate_val;
+        } else if (gate_val <= -50.0f) {
+            // 当gate_val <= -50时，exp(-gate_val)很大，gate / (1 + exp(-gate)) ≈ 0
+            glu_val = 0.0f;
+        } else {
+            // 一般情况：gate / (1 + exp(-gate))
+            glu_val = gate_val / (1.0f + std::exp(-gate_val));
+        }
+
+        // 计算输出：up * glu
+        out_ptr[i] = llaisys::utils::cast<T>(up_val * glu_val);
+    }
+}
+} // namespace
+
 namespace llaisys::ops {
 void swiglu(tensor_t out, tensor_t gate, tensor_t up) {
-    TO_BE_IMPLEMENTED();
+    // 基本验证：设备类型、数据类型、连续性
+    CHECK_SAME_DEVICE(out, gate, up);
+    CHECK_SAME_DTYPE(out->dtype(), gate->dtype(), up->dtype());
+    ASSERT(out->isContiguous() && gate->isContiguous() && up->isContiguous(),
+           "SwiGLU: all tensors must be contiguous.");
+
+    // 维度和形状检查
+    CHECK_ARGUMENT(out->ndim() == gate->ndim() && gate->ndim() == up->ndim(),
+                   "SwiGLU: out, gate, and up must have the same number of dimensions.");
+    CHECK_ARGUMENT(out->shape() == gate->shape() && gate->shape() == up->shape(),
+                   "SwiGLU: out, gate, and up must have the same shape.");
+
+    size_t total_size = out->numel();
+    auto type = out->dtype();
+
+    if (out->deviceType() == LLAISYS_DEVICE_NVIDIA) {
+        auto &ctx = llaisys::core::context();
+        ctx.setDevice(out->deviceType(), out->deviceId());
+        const auto *api = ctx.runtime().api();
+
+        const size_t out_bytes = out->numel() * out->elementSize();
+        const size_t gate_bytes = gate->numel() * gate->elementSize();
+        const size_t up_bytes = up->numel() * up->elementSize();
+        std::vector<std::byte> h_out(out_bytes), h_gate(gate_bytes), h_up(up_bytes);
+        api->memcpy_sync(h_gate.data(), gate->data(), gate_bytes, LLAISYS_MEMCPY_D2H);
+        api->memcpy_sync(h_up.data(), up->data(), up_bytes, LLAISYS_MEMCPY_D2H);
+
+        switch (type) {
+        case LLAISYS_DTYPE_F32:
+            swiglu_impl(reinterpret_cast<float *>(h_out.data()),
+                        reinterpret_cast<const float *>(h_gate.data()),
+                        reinterpret_cast<const float *>(h_up.data()),
+                        total_size);
+            break;
+        case LLAISYS_DTYPE_F16:
+            swiglu_impl(reinterpret_cast<llaisys::fp16_t *>(h_out.data()),
+                        reinterpret_cast<const llaisys::fp16_t *>(h_gate.data()),
+                        reinterpret_cast<const llaisys::fp16_t *>(h_up.data()),
+                        total_size);
+            break;
+        case LLAISYS_DTYPE_BF16:
+            swiglu_impl(reinterpret_cast<llaisys::bf16_t *>(h_out.data()),
+                        reinterpret_cast<const llaisys::bf16_t *>(h_gate.data()),
+                        reinterpret_cast<const llaisys::bf16_t *>(h_up.data()),
+                        total_size);
+            break;
+        default:
+            EXCEPTION_UNSUPPORTED_DATATYPE(type);
+        }
+
+        api->memcpy_sync(out->data(), h_out.data(), out_bytes, LLAISYS_MEMCPY_H2D);
+        return;
+    }
+
+    // 当前仅支持CPU设备
+    if (out->deviceType() != LLAISYS_DEVICE_CPU) {
+        EXCEPTION_UNSUPPORTED_DEVICE;
+    }
+
+    // dtype分发：支持F32、F16、BF16
+    switch (type) {
+    case LLAISYS_DTYPE_F32:
+        return swiglu_impl(reinterpret_cast<float *>(out->data()),
+                           reinterpret_cast<const float *>(gate->data()),
+                           reinterpret_cast<const float *>(up->data()),
+                           total_size);
+    case LLAISYS_DTYPE_F16:
+        return swiglu_impl(reinterpret_cast<llaisys::fp16_t *>(out->data()),
+                           reinterpret_cast<const llaisys::fp16_t *>(gate->data()),
+                           reinterpret_cast<const llaisys::fp16_t *>(up->data()),
+                           total_size);
+    case LLAISYS_DTYPE_BF16:
+        return swiglu_impl(reinterpret_cast<llaisys::bf16_t *>(out->data()),
+                           reinterpret_cast<const llaisys::bf16_t *>(gate->data()),
+                           reinterpret_cast<const llaisys::bf16_t *>(up->data()),
+                           total_size);
+    default:
+        EXCEPTION_UNSUPPORTED_DATATYPE(type);
+    }
 }
 } // namespace llaisys::ops
diff --git a/src/tensor/tensor.cpp b/src/tensor/tensor.cpp
index 2f594bb65..84fc0eb34 100644
--- a/src/tensor/tensor.cpp
+++ b/src/tensor/tensor.cpp
@@ -164,27 +164,82 @@ void Tensor::debug() const {
 }
 
 bool Tensor::isContiguous() const {
-    TO_BE_IMPLEMENTED();
+    ptrdiff_t expected = 1;
+    for (int i = static_cast<int>(this->ndim()) - 1; i >= 0; i--) {
+        if (this->strides()[i] != expected) {
+            return false;
+        }
+        expected *= static_cast<ptrdiff_t>(this->shape()[i]);
+    }
     return true;
 }
 
 tensor_t Tensor::permute(const std::vector<size_t> &order) const {
-    TO_BE_IMPLEMENTED();
-    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
+    auto ndim_ = this->ndim();
+    CHECK_ARGUMENT(order.size() == ndim_, "permute: order size must equal ndim");
+
+    std::vector<bool> seen(ndim_, false);
+    for (size_t i = 0; i < order.size(); ++i) {
+        CHECK_ARGUMENT(order[i] < ndim_, "permute: order index out of range");
+        CHECK_ARGUMENT(!seen[order[i]], "permute: order contains duplicate dims");
+        seen[order[i]] = true;
+    }
+
+    std::vector<size_t> new_shape(ndim_);
+    std::vector<ptrdiff_t> new_strides(ndim_);
+    for (size_t i = 0; i < ndim_; ++i) {
+        new_shape[i] = _meta.shape[order[i]];
+        new_strides[i] = _meta.strides[order[i]];
+    }
+
+    TensorMeta new_meta{_meta.dtype, std::move(new_shape), std::move(new_strides)};
+    return std::shared_ptr<Tensor>(new Tensor(std::move(new_meta), _storage, _offset));
 }
 
 tensor_t Tensor::view(const std::vector<size_t> &shape) const {
-    TO_BE_IMPLEMENTED();
-    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
+    auto elems = std::accumulate(shape.begin(), shape.end(), size_t(1), std::multiplies<size_t>());
+    CHECK_ARGUMENT(this->numel() == elems, "view: total elements mismatch");
+    CHECK_ARGUMENT(this->isContiguous(), "view requires contiguous tensor; call contiguous() first");
+
+    std::vector<ptrdiff_t> new_strides(shape.size());
+    size_t stride = 1;
+    for (int i = static_cast<int>(shape.size()) - 1; i >= 0; --i) {
+        new_strides[i] = static_cast<ptrdiff_t>(stride);
+        stride *= shape[i];
+    }
+
+    TensorMeta new_meta{_meta.dtype, shape, std::move(new_strides)};
+    return std::shared_ptr<Tensor>(new Tensor(std::move(new_meta), _storage, _offset));
 }
 
 tensor_t Tensor::slice(size_t dim, size_t start, size_t end) const {
-    TO_BE_IMPLEMENTED();
-    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
+    auto ndim_ = this->ndim();
+    CHECK_ARGUMENT(dim < ndim_, "slice: dim out of range");
+    CHECK_ARGUMENT(start <= end, "slice: start must be <= end");
+    CHECK_ARGUMENT(end <= _meta.shape[dim], "slice: end exceeds dimension size");
+
+    //_meta is of type TensorMeta defined in tensor.hpp
+    auto new_shape = _meta.shape; // _meta.shape is std::vector<size_t>
+    new_shape[dim] = end - start;
+    auto new_strides = _meta.strides; // stride layout stays the same
+
+    auto byte_offset = _offset + start * static_cast<size_t>(_meta.strides[dim]) * this->elementSize();
+    TensorMeta new_meta{_meta.dtype, std::move(new_shape), std::move(new_strides)};
+    return std::shared_ptr<Tensor>(new Tensor(std::move(new_meta), _storage, byte_offset));
 }
 
 void Tensor::load(const void *src_) {
-    TO_BE_IMPLEMENTED();
+    core::context().setDevice(this->deviceType(), this->deviceId());
+    auto bytes = this->numel() * this->elementSize();
+    if (_storage->isHost()) {
+        std::memcpy(this->data(), src_, bytes);
+    } else {
+        core::context().runtime().api()->memcpy_sync(
+            this->data(),
+            src_,
+            bytes,
+            LLAISYS_MEMCPY_H2D);
+    }
 }
 
 tensor_t Tensor::contiguous() const {
diff --git a/src/tensor/tensor.hpp b/src/tensor/tensor.hpp
index 35e340922..4a206afe4 100644
--- a/src/tensor/tensor.hpp
+++ b/src/tensor/tensor.hpp
@@ -27,20 +27,23 @@ class Tensor {
         int device = 0);
     ~Tensor() = default;
     // Info
-    std::byte *data();
-    const std::byte *data() const;
-    size_t ndim() const;
+    std::byte *data();             // 指向张量数据的指针
+    const std::byte *data() const; // 指向张量数据的指针
+    size_t ndim() const;           // 维度数量
     const std::vector<size_t> &shape() const;
     const std::vector<ptrdiff_t> &strides() const;
-    llaisysDataType_t dtype() const;
-    llaisysDeviceType_t deviceType() const;
-    int deviceId() const;
-    size_t numel() const;
-    size_t elementSize() const;
+    llaisysDataType_t dtype() const;        // 数据类型
+    llaisysDeviceType_t deviceType() const; // 设备类型
+    int deviceId() const;                   // 设备ID
+    size_t numel() const;                   // 元素总数
+    size_t elementSize() const;             // 每个元素的大小（字节）
 
     std::string info() const;
     void debug() const;
 
+    // shape 和 strides 决定了张量在内存中的布局
+    // 判断张量是否是连续存储的
+    // 最后一维的步长为一，倒数第二维的步长为最后一维的大小，依此类推
     bool isContiguous() const;
 
     // Meta Transform
@@ -50,6 +53,8 @@ class Tensor {
 
     // Load data from host memory
     void load(const void *src);
+    // src_: 指向源数据的指针, 实际是主机内存中的数据
+    // 将数据加载到张量中，如果张量在设备上，则进行相应的内存拷贝；否则直接拷贝
 
     // Challenging features
     tensor_t contiguous() const;
diff --git a/test/ops/sampling.py b/test/ops/sampling.py
new file mode 100644
index 000000000..110e02c52
--- /dev/null
+++ b/test/ops/sampling.py
@@ -0,0 +1,61 @@
+import sys
+import os
+
+parent_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+sys.path.insert(0, parent_dir)
+
+import torch
+import llaisys
+
+
+def _make_logits_tensor(logits: torch.Tensor, device_name: str):
+    t = llaisys.Tensor(
+        logits.shape,
+        dtype=llaisys.DataType.F32,
+        device=llaisys.DeviceType.NVIDIA if device_name == "nvidia" else llaisys.DeviceType.CPU,
+        device_id=0,
+    )
+    api = llaisys.RuntimeAPI(t.device_type())
+    api.memcpy_sync(t.data_ptr(), logits.data_ptr(), logits.numel() * logits.element_size(), llaisys.MemcpyKind.D2D)
+    return t
+
+
+def _read_i64_scalar(t: llaisys.Tensor) -> int:
+    out = torch.zeros((1,), dtype=torch.int64, device=torch.device("cuda" if t.device_type() == llaisys.DeviceType.NVIDIA else "cpu"))
+    api = llaisys.RuntimeAPI(t.device_type())
+    api.memcpy_sync(out.data_ptr(), t.data_ptr(), out.numel() * out.element_size(), llaisys.MemcpyKind.D2D)
+    return int(out.item())
+
+
+def test_sampling(device_name: str):
+    print(f"Testing Ops.sample on {device_name}")
+    device = torch.device("cuda" if device_name == "nvidia" else "cpu")
+    logits = torch.tensor([0.1, 2.0, 0.5, 1.0], dtype=torch.float32, device=device)
+
+    logits_t = _make_logits_tensor(logits, device_name)
+    out_idx = llaisys.Tensor((1,), dtype=llaisys.DataType.I64,
+                             device=llaisys.DeviceType.NVIDIA if device_name == "nvidia" else llaisys.DeviceType.CPU,
+                             device_id=0)
+
+    # top_k=1 should behave like argmax.
+    llaisys.Ops.sample(out_idx, logits_t, temperature=1.0, top_k=1, top_p=1.0)
+    idx = _read_i64_scalar(out_idx)
+    assert idx == 1, f"Expected argmax index 1, got {idx}"
+
+    # For top_k=2, sampled index should always be one of top-2 entries.
+    allowed = {1, 3}
+    for _ in range(64):
+        llaisys.Ops.sample(out_idx, logits_t, temperature=0.9, top_k=2, top_p=1.0)
+        idx = _read_i64_scalar(out_idx)
+        assert idx in allowed, f"Expected sampled idx in {allowed}, got {idx}"
+
+    print("\033[92mTest passed!\033[0m\n")
+
+
+if __name__ == "__main__":
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--device", default="cpu", choices=["cpu", "nvidia"], type=str)
+    args = parser.parse_args()
+    test_sampling(args.device)
diff --git a/test/ops/self_attention.py b/test/ops/self_attention.py
index a042b51be..12bb239b5 100644
--- a/test/ops/self_attention.py
+++ b/test/ops/self_attention.py
@@ -15,9 +15,8 @@ def torch_self_attention(attn_val, query, key, value, scale):
     L, S = query.size(-2), key.size(-2)
     attn_bias = torch.zeros(L, S, dtype=query.dtype, device=query.device)
 
-    temp_mask = torch.ones(L, S, dtype=torch.bool).tril(diagonal=S-L)
+    temp_mask = torch.ones(L, S, dtype=torch.bool, device=query.device).tril(diagonal=S-L)
     attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
-    attn_bias.to(query.dtype)
 
     key = key.repeat_interleave(query.size(-3) // key.size(-3), -3)
     value = value.repeat_interleave(query.size(-3) // value.size(-3), -3)
diff --git a/xmake.lua b/xmake.lua
index 1f65f7a95..52f7e5985 100644
--- a/xmake.lua
+++ b/xmake.lua
@@ -13,6 +13,28 @@ option("nv-gpu")
     set_description("Whether to compile implementations for Nvidia GPU")
 option_end()
 
+option("openmp")
+    set_default(true)
+    set_showmenu(true)
+    set_description("Whether to enable OpenMP for CPU operators")
+option_end()
+
+option("cpu-avx2")
+    set_default(true)
+    set_showmenu(true)
+    set_description("Whether to enable AVX2/FMA for CPU operators")
+option_end()
+
+option("openblas")
+    set_default(false)
+    set_showmenu(true)
+    set_description("Whether to enable OpenBLAS backend for CPU linear f32")
+option_end()
+
+if has_config("openblas") then
+    add_requires("openblas", {optional = true})
+end
+
 if has_config("nv-gpu") then
     add_defines("ENABLE_NVIDIA_API")
     includes("xmake/nvidia.lua")
@@ -26,6 +48,21 @@ target("llaisys-utils")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
 
     add_files("src/utils/*.cpp")
 
@@ -37,12 +74,30 @@ target("llaisys-device")
     set_kind("static")
     add_deps("llaisys-utils")
     add_deps("llaisys-device-cpu")
+    if has_config("nv-gpu") then
+        add_deps("llaisys-device-nvidia")
+    end
 
     set_languages("cxx17")
     set_warnings("all", "error")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
 
     add_files("src/device/*.cpp")
 
@@ -59,6 +114,21 @@ target("llaisys-core")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
 
     add_files("src/core/*/*.cpp")
 
@@ -74,6 +144,21 @@ target("llaisys-tensor")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
 
     add_files("src/tensor/*.cpp")
 
@@ -89,6 +174,21 @@ target("llaisys-ops")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
     
     add_files("src/ops/*/*.cpp")
 
@@ -105,6 +205,22 @@ target("llaisys")
 
     set_languages("cxx17")
     set_warnings("all", "error")
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+            add_syslinks("gomp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
     add_files("src/llaisys/*.cc")
     set_installdir(".")
 
diff --git a/xmake/cpu.lua b/xmake/cpu.lua
index 101d894e6..632841f5d 100644
--- a/xmake/cpu.lua
+++ b/xmake/cpu.lua
@@ -5,6 +5,21 @@ target("llaisys-device-cpu")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
 
     add_files("../src/device/cpu/*.cpp")
 
@@ -19,6 +34,25 @@ target("llaisys-ops-cpu")
     if not is_plat("windows") then
         add_cxflags("-fPIC", "-Wno-unknown-pragmas")
     end
+    if has_config("openmp") then
+        if is_plat("windows") then
+            add_cxflags("/openmp")
+        else
+            add_cxflags("-fopenmp")
+            add_ldflags("-fopenmp")
+        end
+    end
+    if has_config("cpu-avx2") and is_arch("x64", "x86_64") then
+        if is_plat("windows") then
+            add_cxflags("/arch:AVX2")
+        else
+            add_cxflags("-mavx2", "-mfma")
+        end
+    end
+    if has_config("openblas") then
+        add_defines("ENABLE_OPENBLAS")
+        add_packages("openblas")
+    end
 
     add_files("../src/ops/*/cpu/*.cpp")
 
diff --git a/xmake/nvidia.lua b/xmake/nvidia.lua
new file mode 100644
index 000000000..2509a8b5d
--- /dev/null
+++ b/xmake/nvidia.lua
@@ -0,0 +1,20 @@
+target("llaisys-device-nvidia")
+    set_kind("static")
+    set_languages("cxx17")
+    set_warnings("all", "error")
+
+    if not is_plat("windows") then
+        add_cxflags("-fPIC")
+        add_includedirs("/usr/local/cuda/include")
+        add_linkdirs("/usr/local/cuda/lib64")
+        add_links("cudart")
+    else
+        add_includedirs("$(env CUDA_PATH)/include")
+        add_linkdirs("$(env CUDA_PATH)/lib/x64")
+        add_links("cudart")
+    end
+
+    add_files("../src/device/nvidia/*.cpp")
+
+    on_install(function (target) end)
+target_end()
diff --git "a/\344\273\273\345\212\241\344\270\211-\346\216\245\345\217\243\345\261\202\346\227\266\345\272\217\345\233\276\344\270\216\345\217\243\350\277\260\347\250\277.md" "b/\344\273\273\345\212\241\344\270\211-\346\216\245\345\217\243\345\261\202\346\227\266\345\272\217\345\233\276\344\270\216\345\217\243\350\277\260\347\250\277.md"
new file mode 100644
index 000000000..2134835e5
--- /dev/null
+++ "b/\344\273\273\345\212\241\344\270\211-\346\216\245\345\217\243\345\261\202\346\227\266\345\272\217\345\233\276\344\270\216\345\217\243\350\277\260\347\250\277.md"
@@ -0,0 +1,114 @@
+# 任务三：接口层时序图与口述稿（可直接背）
+
+## 1. 一张图看懂调用链
+
+```mermaid
+sequenceDiagram
+    participant U as 用户代码
+    participant P as Python Qwen2类
+    participant C as ctypes绑定层
+    participant L as libllaisys C API
+    participant M as C++ Qwen2ModelImpl
+    participant O as 算子层(ops)
+
+    U->>P: Qwen2(model_path)
+    P->>P: 读取config.json, 组装Meta
+    P->>C: 调用llaisysQwen2ModelCreate参数
+    C->>L: llaisysQwen2ModelCreate(meta,...)
+    L->>M: new Qwen2ModelImpl + init_weights()
+    M-->>L: 返回model句柄
+    L-->>C: void* model
+    C-->>P: self._model
+
+    P->>P: _load_weights()遍历safetensors
+    loop 每个权重张量
+        P->>C: llaisysQwen2ModelGetWeight(kind, layer)
+        C->>L: llaisysQwen2ModelGetWeight(...)
+        L->>M: 返回目标tensor句柄
+        M-->>L: llaisysTensor_t
+        L-->>C: tensor handle
+        C-->>P: tensor handle
+        P->>L: tensorLoad(handle, numpy_ptr)
+    end
+
+    U->>P: generate(inputs)
+    loop 每步生成
+        P->>C: llaisysQwen2ModelInfer(model, token_ids, ntoken)
+        C->>L: llaisysQwen2ModelInfer(...)
+        L->>M: infer_next(...)
+        M->>O: embedding/rms_norm/linear/rope/self_attention/swiglu/argmax
+        O-->>M: next_token
+        M-->>L: int64 next_token
+        L-->>C: next_token
+        C-->>P: next_token
+        P->>P: append token, 判断eos
+    end
+
+    U->>P: 对象释放
+    P->>C: llaisysQwen2ModelDestroy(model)
+    C->>L: llaisysQwen2ModelDestroy(...)
+    L->>M: destroy_weights + delete
+```
+
+---
+
+## 2. 每一步“为什么要这样设计”
+
+### 2.1 Create/Destroy 一定要成对
+
+- `Create`负责分配：模型对象、每层权重句柄、KV Cache。
+- `Destroy`负责释放：否则会有内存泄漏。
+- Python有GC，但C++对象不会被Python自动安全释放，所以必须显式提供销毁接口。
+
+### 2.2 GetWeight(kind, layer) 设计成“整数协议”
+
+- Python加载权重时，先把权重名解析成 `kind + layer`。
+- C++按这个协议返回目标张量句柄。
+- 这样避免跨语言传字符串匹配，接口更稳、成本更低。
+
+### 2.3 Infer 只返回一个 next_token
+
+- 自回归生成本来就是逐步预测。
+- 每步返回一个token，便于利用KV Cache增量计算。
+- 你的 `generate()` 正是循环调用 Infer 来完成整段文本。
+
+---
+
+## 3. 你代码里的接口对照表
+
+- C头文件定义：include/llaisys/models/qwen2.h
+- C++接口实现：src/llaisys/qwen2.cc
+- ctypes声明：python/llaisys/libllaisys/models.py
+- Python封装入口：python/llaisys/models/qwen2.py
+
+关键函数对应：
+- Python `Qwen2.__init__` -> C `llaisysQwen2ModelCreate`
+- Python `_load_weights` -> C `llaisysQwen2ModelGetWeight` + `tensorLoad`
+- Python `generate` -> C `llaisysQwen2ModelInfer`
+- Python `__del__` -> C `llaisysQwen2ModelDestroy`
+
+---
+
+## 4. 复试口述稿（60秒）
+
+我在任务三先设计了稳定的接口层，把模型实现和调用方解耦：
+第一步用 `LlaisysQwen2Meta` 统一传入模型结构参数，调用 `llaisysQwen2ModelCreate` 在C++侧分配权重和KV Cache。
+第二步通过 `llaisysQwen2ModelGetWeight(kind, layer)` 建立跨语言的权重注入协议，Python把safetensors逐个映射后用 `tensorLoad` 写入后端张量。
+第三步生成时，Python循环调 `llaisysQwen2ModelInfer`，C++内部执行完整算子链并利用KV Cache做增量推理，每步返回一个next token。
+最后通过 `llaisysQwen2ModelDestroy` 释放所有资源，保证生命周期完整。
+
+---
+
+## 5. 高频追问速答
+
+1) 为什么要 opaque handle（`struct LlaisysQwen2Model;`）？
+- 为了隐藏实现细节、稳定ABI、便于后续替换内部实现。
+
+2) 为什么 `Meta` 里放这么多字段？
+- 因为后端要按这些字段一次性分配权重形状和缓存形状。
+
+3) 为什么还要 `Weights` 结构体？
+- 它是后端参数槽位总表，`GetWeight` 本质就是在这张表里取目标句柄。
+
+4) 为什么不直接 Python 算前向？
+- 任务要求核心推理在后端实现，Python只做封装与调度。
diff --git "a/\344\273\273\345\212\241\344\270\211-\346\216\250\347\220\206\345\261\202\346\227\266\345\272\217\345\233\276\344\270\216\345\274\240\351\207\217\346\265\201.md" "b/\344\273\273\345\212\241\344\270\211-\346\216\250\347\220\206\345\261\202\346\227\266\345\272\217\345\233\276\344\270\216\345\274\240\351\207\217\346\265\201.md"
new file mode 100644
index 000000000..6514c591a
--- /dev/null
+++ "b/\344\273\273\345\212\241\344\270\211-\346\216\250\347\220\206\345\261\202\346\227\266\345\272\217\345\233\276\344\270\216\345\274\240\351\207\217\346\265\201.md"
@@ -0,0 +1,163 @@
+# 任务三：推理层时序图与张量流（infer_next 逐行可讲）
+
+## 1. 推理层总览（单步 next token）
+
+下面这张图对应 C++ 的 `infer_next` 主流程（从输入 token 到输出 next_token）。
+
+```mermaid
+flowchart TD
+    A[输入 token_ids 与 ntoken] --> B{ntoken == 0?}
+    B -->|是| Z[返回 end_token]
+    B -->|否| C{ntoken < cur_pos?}
+    C -->|是| C1[cur_pos = 0]
+    C -->|否| D[从 i=cur_pos 开始处理]
+    C1 --> D
+
+    D --> E[取 token_ids[i] -> token_tensor[1]]
+    E --> F[embedding -> x[1, hs]]
+
+    F --> G[循环 layer l = 0..nlayer-1]
+
+    G --> H1[x_norm = RMSNorm x]
+    H1 --> H2[q_lin = Linear x_norm -> 1 x nh*dh]
+    H1 --> H3[k_lin = Linear x_norm -> 1 x nkvh*dh]
+    H1 --> H4[v_lin = Linear x_norm -> 1 x nkvh*dh]
+
+    H2 --> I1[view q -> 1 x nh x dh]
+    H3 --> I2[view k -> 1 x nkvh x dh]
+    H4 --> I3[view v -> 1 x nkvh x dh]
+
+    I1 --> J1[rope q_rot]
+    I2 --> J2[rope k_rot]
+
+    J2 --> K1[写入 k_cache l,i]
+    I3 --> K2[写入 v_cache l,i]
+
+    K1 --> L1[k_all = slice 0..i]
+    K2 --> L2[v_all = slice 0..i]
+
+    L1 --> M[self_attention q_rot,k_all,v_all]
+    L2 --> M
+
+    M --> N1[attn_out view -> 1 x nh*dh]
+    N1 --> N2[attn_proj = Linear -> 1 x hs]
+    N2 --> N3[残差 add: x = x + attn_proj]
+
+    N3 --> P1[x_norm2 = RMSNorm x]
+    P1 --> P2[gate = Linear -> 1 x di]
+    P1 --> P3[up = Linear -> 1 x di]
+    P2 --> P4[swiglu_out = SwiGLU gate,up]
+    P3 --> P4
+    P4 --> P5[mlp_out = Linear -> 1 x hs]
+    P5 --> P6[残差 add: x = x + mlp_out]
+
+    P6 --> G
+
+    G --> Q[final_norm = RMSNorm x]
+    Q --> R[logits = Linear -> 1 x voc]
+    R --> S[argmax logits -> max_idx]
+    S --> T[next_token = max_idx]
+
+    T --> U[cur_pos = ntoken]
+    U --> V[返回 next_token]
+```
+
+---
+
+## 2. 与代码逐段对应（你可以边看边讲）
+
+核心文件：src/llaisys/qwen2.cc
+
+- 入口函数：`infer_next(...)`
+  - 位置：`int64_t infer_next(const int64_t *token_ids, size_t ntoken)`
+- 缓存控制：`cur_pos` 与重置逻辑
+- 层内算子链：RMSNorm -> QKV Linear -> RoPE -> Attention -> MLP
+- 输出头：FinalNorm -> Linear(vocab) -> Argmax
+
+你可以按下面的“口述模板”讲：
+
+1) 先处理边界：空输入直接返回 `end_token`。  
+2) 如果用户输入变短，说明上下文换了，`cur_pos` 清零，避免旧 cache 污染。  
+3) 只处理新增位置 `[cur_pos, ntoken)`，这是增量推理核心。  
+4) 每个新增 token 都经过完整层栈，但注意力只读到当前 `i` 为止的 K/V。  
+5) 每步得到 `logits[1,voc]` 后 `argmax`，产出一个 `next_token`。  
+6) 更新 `cur_pos = ntoken`，为下次增量调用做准备。
+
+---
+
+## 3. 关键张量形状（最容易被问）
+
+设：
+- `hs = hidden_size`
+- `nh = num_attention_heads`
+- `nkvh = num_key_value_heads`
+- `dh = hs / nh`
+- `di = intermediate_size`
+- `voc = vocab_size`
+
+层内关键张量：
+
+- `x`: `[1, hs]`
+- `q_lin`: `[1, nh*dh]` -> `q`: `[1, nh, dh]`
+- `k_lin`: `[1, nkvh*dh]` -> `k`: `[1, nkvh, dh]`
+- `v_lin`: `[1, nkvh*dh]` -> `v`: `[1, nkvh, dh]`
+- `k_cache[l]`: `[maxseq, nkvh, dh]`
+- `v_cache[l]`: `[maxseq, nkvh, dh]`
+- `k_all/v_all`: `[i+1, nkvh, dh]`
+- `attn_out`: `[1, nh, dh]`
+- `attn_proj`: `[1, hs]`
+- `gate/up/swiglu_out`: `[1, di]`
+- `mlp_out`: `[1, hs]`
+- `logits`: `[1, voc]`
+
+---
+
+## 4. 为什么要这么实现（设计动机）
+
+### 4.1 为什么是“逐 token + 返回一个 next_token”
+
+因为 decoder-only 模型是自回归：
+- 第 t 步只预测第 t+1 个 token。
+- Python `generate()` 循环调用 Infer，拼接序列。
+
+### 4.2 为什么要 KV Cache
+
+如果不用 cache，每步都要重算历史所有 token 的 K/V，成本近似翻倍增长。  
+有 cache 后：
+- 历史 K/V 只算一次并保存；
+- 新步只算当前 token 的 K/V，并和历史拼接使用。
+
+### 4.3 为什么每层都有两次残差
+
+这是标准 Transformer block：
+- Attention 子层后一次残差；
+- MLP 子层后一次残差；
+保证训练与推理稳定。
+
+### 4.4 为什么最终用 argmax
+
+你当前实现是 MVP 路径：
+- 简单、稳定、便于和参考结果对齐；
+- 后续可扩展 top-k / top-p / temperature 采样。
+
+---
+
+## 5. 复试 90 秒口述稿（推理层）
+
+在后端推理里，我把 `infer_next` 设计成增量自回归流程：首先根据 `cur_pos` 只处理新增 token，避免重复计算。每个新增 token 先做 embedding，然后在每一层执行 RMSNorm、QKV 线性投影、RoPE、self-attention、attention输出投影和残差，再执行 MLP 的 RMSNorm、gate/up 线性层、SwiGLU、down 投影和残差。与此同时，我把每层的新 `k_rot` 和 `v` 写入 KV cache，并通过切片读取到当前步的全部历史 K/V 来做注意力。层栈结束后做 final norm、输出线性层得到 logits，再通过 argmax 取 next token。最后更新 `cur_pos`，使下一次调用继续增量推理。
+
+---
+
+## 6. 高频追问（推理层）
+
+1) 为什么 cache 存 K/V，不存 Q？
+- 因为 Q 只用于当前步查询，历史 Q 不会被未来步复用。
+
+2) 为什么 RoPE 用在 Q/K，不用在 V？
+- 位置关系通过 QK 相似度建模，V 主要承载被聚合的内容值。
+
+3) 为什么 scale 用 `1/sqrt(dh)`？
+- 防止点积随维度变大导致 softmax 饱和。
+
+4) 为什么会有 `ntoken < cur_pos` 的重置逻辑？
+- 防止用户切换上下文时继续复用旧 cache 导致错答案。
diff --git "a/\344\273\273\345\212\241\344\270\211-\347\255\224\350\276\251\346\200\273\347\250\277\357\274\2103\345\210\206\351\222\237+8\345\210\206\351\222\237\357\274\211.md" "b/\344\273\273\345\212\241\344\270\211-\347\255\224\350\276\251\346\200\273\347\250\277\357\274\2103\345\210\206\351\222\237+8\345\210\206\351\222\237\357\274\211.md"
new file mode 100644
index 000000000..39ce19487
--- /dev/null
+++ "b/\344\273\273\345\212\241\344\270\211-\347\255\224\350\276\251\346\200\273\347\250\277\357\274\2103\345\210\206\351\222\237+8\345\210\206\351\222\237\357\274\211.md"
@@ -0,0 +1,191 @@
+# 任务三答辩总稿（3分钟 + 8分钟）
+
+## 一、3分钟精简版（可直接背）
+
+### 1) 我做了什么
+
+任务三我完成了一个 Qwen2 的最小可用推理闭环，包含四部分：
+1. 定义 C 接口层（模型创建、销毁、取权重、推理）；
+2. 做 Python ctypes 绑定，让 Python 能调 C++；
+3. 把 safetensors 权重映射到后端张量；
+4. 在 C++ 里把任务二算子串成完整单步推理，并做 KV Cache 增量优化。
+
+### 2) 接口层怎么设计
+
+我在头文件里定义了三类核心结构：
+- `Meta`：模型超参数（层数、head、hidden、rope 参数、词表等）；
+- `Weights`：所有权重句柄；
+- `WeightKind`：跨语言的整数协议，给每种权重一个编号。
+
+然后提供 4 个核心 API：
+- `Create`：根据 `Meta` 初始化模型和缓存；
+- `Destroy`：释放资源；
+- `GetWeight(kind, layer)`：让 Python 按编号拿到目标权重句柄并写入数据；
+- `Infer`：输入 token 序列，返回下一 token。
+
+### 3) 推理主流程怎么跑
+
+在 `infer_next` 里，我按 decoder-only 的单步逻辑实现：
+- embedding -> 多层 block -> final norm -> vocab linear -> argmax。
+- 每层 block 是：RMSNorm -> QKV 线性 -> RoPE -> Self-Attention -> 残差 -> RMSNorm -> MLP(SwiGLU) -> 残差。
+
+### 4) KV Cache 怎么做
+
+我维护了 `cur_pos`，只处理新增 token 区间 `[cur_pos, ntoken)`。
+- 新 token 的 K/V 写入 cache；
+- attention 读取历史 `0..i` 的 K/V 切片；
+- 下次调用继续增量，不重算历史 K/V。
+
+### 5) 当前版本定位
+
+这是一个“先正确、再优化”的版本：
+- 已打通模型结构、权重加载、推理和缓存；
+- 当前采样是 argmax，后续可以扩展 top-k/top-p/temperature；
+- 当前是 CPU 路径，后续可做并行与设备优化。
+
+---
+
+## 二、8分钟详细版（老师深入追问可用）
+
+## 1. 设计目标与边界
+
+我的目标不是直接追求最高性能，而是先完成一个稳定、可解释、可验证的推理系统：
+- 边界1：接口稳定（C ABI + ctypes）；
+- 边界2：权重可正确落地（名字 -> 槽位）；
+- 边界3：算子链可闭环（任务二成果复用）；
+- 边界4：推理可增量（KV Cache）。
+
+## 2. 接口层：为什么这样拆
+
+### 2.1 Meta
+
+`Meta` 把结构性参数一次性传入后端：
+- 便于后端在 `Create` 时确定所有张量形状；
+- 避免函数参数爆炸和顺序出错；
+- Python 到 C++ 的协议更明确。
+
+### 2.2 Weights + WeightKind
+
+我用 `Weights` 存放后端权重句柄，用 `WeightKind` 编号映射权重类型。
+这样 Python 加载 safetensors 时，不需要知道后端内存细节，只要：
+1) 把名称解析成 `(kind, layer)`；
+2) 调 `GetWeight` 拿句柄；
+3) 调 `tensorLoad` 写入。
+
+这是跨语言工程里很实用的“整数协议”做法。
+
+### 2.3 四个核心接口
+
+- `Create(meta, device, ...)`：创建模型实例 + 分配权重 + 分配 cache；
+- `Destroy(model)`：成对释放，防泄漏；
+- `GetWeight(model, kind, layer)`：返回目标权重句柄；
+- `Infer(model, token_ids, ntoken)`：执行一次 next-token 预测。
+
+## 3. Python 侧封装：从 config 到权重写入
+
+### 3.1 构造阶段
+
+`Qwen2.__init__` 做了三件事：
+1. 读 `config.json` 得到 `hs/nlayer/nh/nkvh/di/maxseq/voc/eps/theta/eos`；
+2. 组装 `Meta`；
+3. 调 `Create` 获得后端模型句柄。
+
+### 3.2 权重加载阶段
+
+`_load_weights` 遍历 safetensors：
+- 先取出张量（numpy/torch 两套兼容）；
+- 统一转成 float32 再写入；
+- 根据名字分发到对应 `kind/layer`。
+
+这样后端推理时就不依赖 Python 框架了。
+
+### 3.3 生成阶段
+
+`generate` 里循环调用 `Infer`：
+- 每轮把当前 tokens 传进去；
+- 返回一个 `next_token` 就 append；
+- 命中 eos 或达到长度停止。
+
+## 4. C++ 推理核心：infer_next
+
+## 4.1 外层逻辑
+
+- 空输入直接返回 `end_token`；
+- 若 `ntoken < cur_pos`，说明上下文变化，重置 cache 游标；
+- 只处理新增 token：`for i in [cur_pos, ntoken)`。
+
+## 4.2 单 token、单层的计算路径
+
+对每个新增 token，先 `embedding -> x[1,hs]`。
+
+每层执行：
+1. `x_norm = RMSNorm(x)`；
+2. `q_lin/k_lin/v_lin = Linear(x_norm)`；
+3. reshape 成三维头部形状；
+4. 对 `q/k` 应用 RoPE（位置是 i）；
+5. 把 `k_rot/v` 写入该层 cache 的第 i 位置；
+6. 切片得到 `k_all/v_all = [0..i]`；
+7. `self_attention(q_rot, k_all, v_all)`；
+8. `attn_proj` 后做残差 `x = x + attn_proj`；
+9. 第二个 RMSNorm；
+10. MLP：`gate/up` -> `swiglu` -> `down`；
+11. 残差 `x = x + mlp_out`。
+
+层循环后：
+- final norm；
+- vocab 线性层得 logits；
+- argmax 得 next token。
+
+最后更新 `cur_pos = ntoken`。
+
+## 4.3 为什么 KV Cache 有效
+
+不带 cache：每步都要重算历史 token 的 K/V。  
+带 cache：历史 K/V 只算一次，后续只算新 token K/V。
+
+你的实现里这一点体现在：
+- 写 cache（第 i 位）；
+- attention 读 `0..i` 的切片；
+- 下轮从 `cur_pos` 继续。
+
+## 5. 你这个版本的工程取舍
+
+### 已完成
+- 从接口到推理闭环全打通；
+- 权重映射准确；
+- KV Cache 增量机制完成。
+
+### 有意保留为后续优化
+- 采样策略先用 argmax；
+- dtype 先统一 F32 保障兼容；
+- CPU 路径优先，后续再做并行和设备扩展。
+
+这是很合理的“先正确再优化”的实现顺序。
+
+---
+
+## 三、高频追问（建议背）
+
+1) 为什么要 opaque handle（不透明模型指针）？
+- 隐藏内部实现，稳定 ABI，便于后续重构不影响 Python。
+
+2) 为什么 `GetWeight` 用整数 kind，不用字符串？
+- 跨语言更稳定、性能更好、协议更清晰。
+
+3) 为什么 cache 存 K/V，不存 Q？
+- 历史 Q 不会被未来步骤复用，K/V 会被反复查询。
+
+4) 为什么 RoPE 只作用于 Q/K？
+- 相对位置关系体现在 QK 相似度里，V 负责内容承载。
+
+5) 为什么用 `1/sqrt(dh)`？
+- 防止点积尺度过大导致 softmax 饱和。
+
+6) 为什么当前用 argmax？
+- 先保证正确和可复现，采样策略可作为可插拔优化。
+
+---
+
+## 四、20秒收尾模板
+
+任务三我完成的是一个可运行的 LLM 推理系统最小闭环：接口协议、跨语言绑定、权重注入、算子组网和 KV Cache 增量推理全部打通。这个版本重点保证正确性和工程稳定性，并为后续采样策略和性能优化留出了清晰接口。
\ No newline at end of file
diff --git "a/\344\273\273\345\212\241\344\272\214-Transformer\346\240\270\345\277\203\347\256\227\345\255\220\347\273\237\344\270\200\347\254\224\350\256\260.md" "b/\344\273\273\345\212\241\344\272\214-Transformer\346\240\270\345\277\203\347\256\227\345\255\220\347\273\237\344\270\200\347\254\224\350\256\260.md"
new file mode 100644
index 000000000..cc47cffbc
--- /dev/null
+++ "b/\344\273\273\345\212\241\344\272\214-Transformer\346\240\270\345\277\203\347\256\227\345\255\220\347\273\237\344\270\200\347\254\224\350\256\260.md"
@@ -0,0 +1,273 @@
+# 任务二统一笔记（完整版）：RMSNorm、RoPE、Self-Attention、SwiGLU
+
+## 0. 笔记目标与阅读方式
+
+这份笔记把你任务二实现的四个核心模块串成一条完整逻辑链：
+
+**RMSNorm -> 线性投影(Q/K/V) -> RoPE(Q,K) -> Self-Attention -> FFN中的SwiGLU**。
+
+建议按下面三层来读：
+1. **先看“作用”**：每个算子解决什么问题；
+2. **再看“数学”**：核心公式与直觉；
+3. **最后看“工程”**：你代码里的输入约束、数值稳定、复杂度和常见坑。
+
+---
+
+## 1. Transformer Block 里的位置关系（总览）
+
+以你当前实现对齐的 Pre-Norm 思路，可写成：
+
+1. 输入隐藏状态 `x`
+2. `x1 = RMSNorm(x)`
+3. `q,k,v = Linear(x1)`
+4. `q,k = RoPE(q,k,pos_ids)`
+5. `attn = SelfAttention(q,k,v,scale)`（含 causal mask，支持 GQA）
+6. 残差连接
+7. `x2 = RMSNorm(...)`
+8. FFN 内部 `ffn_mid = SwiGLU(gate,up)`，再线性投影回 `d_model`
+9. 残差连接输出
+
+四个模块的分工：
+- **RMSNorm**：稳定向量尺度，减少训练/推理数值漂移；
+- **RoPE**：将位置信息注入 Q/K 的方向中；
+- **Self-Attention**：按相关性检索并聚合上下文；
+- **SwiGLU**：在 FFN 中进行动态门控的非线性特征选择。
+
+---
+
+## 2. RMSNorm：先把输入尺度压稳
+
+### 2.1 数学定义（纯文本）
+
+对每一行向量 `x_i in R^d`：
+- `rms(x_i) = sqrt((1/d) * sum_{j=1..d}(x_{i,j}^2) + eps)`
+- `y_{i,j} = w_j * x_{i,j} / rms(x_i)`
+
+### 2.2 为什么有效
+
+- 不做去均值，只控制尺度；
+- 对深层网络来说，尺度稳定通常比“强中心化”更关键；
+- 相比 LayerNorm 少一部分统计量计算，工程上更轻量。
+
+### 2.3 你实现里的落地点
+
+对应：`src/ops/rms_norm/op.cpp`
+
+- 校验：同 device、同 dtype、contiguous；
+- 形状：`weight` 必须是 1D 且长度等于最后一维；
+- 路径：CPU-only；
+- dtype：`F32/F16/BF16` 分发；
+- 数值：内部转 `float` 计算，再 cast 回目标类型。
+
+### 2.4 实现步骤（工程视角）
+
+1. 计算 `norm_dim = last_dim`，`outer_size = numel / norm_dim`；
+2. 对每行先求平方和，再开方得到 `rms`；
+3. 对每元素做 `x/rms * weight`；
+4. 写回输出。
+
+### 2.5 常见坑
+
+- `eps` 太小会在极小输入上不稳定；
+- 忘记 `weight` 与最后一维对齐；
+- 直接用半精度累加易引入明显误差。
+
+---
+
+## 3. RoPE：将“位置”编码到方向
+
+### 3.1 数学定义（纯文本）
+
+对位置 `pos_id` 和维度对 `j`：
+- `phi_j = pos_id / (theta^(2j/d))`
+
+向量按前后半维配对：`[x_j, x_{j+d/2}]`，旋转为：
+- `x'_j = x_j * cos(phi_j) - x_{j+d/2} * sin(phi_j)`
+- `x'_{j+d/2} = x_{j+d/2} * cos(phi_j) + x_j * sin(phi_j)`
+
+### 3.2 为什么有效
+
+- 旋转保持模长，主要改变方向；
+- 多频率（不同 `j`）编码多尺度位置关系；
+- Attention 的内积对相对位移敏感，利于长上下文泛化。
+
+### 3.3 你实现里的落地点
+
+对应：`src/ops/rope/op.cpp`
+
+- 输入按 3D：`[seqlen, nhead(or nkvhead), d]`；
+- `pos_ids`：1D 且 `int64`；
+- 约束：`d` 必须为偶数；
+- 路径：CPU-only；
+- dtype：`F32/F16/BF16` 分发。
+
+### 3.4 三重循环结构（与你代码一致）
+
+1. 遍历 `seq`（取 `pos_id`）；
+2. 遍历 `head`；
+3. 遍历 `j in [0, d/2)` 进行成对旋转。
+
+### 3.5 常见坑
+
+- `d` 不是偶数会直接破坏配对逻辑；
+- `pos_ids` 类型错（非 int64）会导致位置读取错误；
+- 索引展平时 base 偏移计算错误最常见。
+
+---
+
+## 4. Self-Attention：上下文检索与聚合核心
+
+### 4.1 核心公式（纯文本）
+
+- `A = QK^T * scale`
+- `P = causal_softmax(A)`
+- `Y = P * V`
+
+其中 `scale` 常取 `1/sqrt(d)`，用于抑制点积随维度增长而放大。
+
+### 4.2 你实现中的形状定义
+
+对应：`src/ops/self_attention/op.cpp`
+
+- `q: [seqlen, nhead, d]`
+- `k: [total_len, nkvhead, d]`
+- `v: [total_len, nkvhead, dv]`
+- `out: [seqlen, nhead, dv]`
+
+核心约束：
+- `seqlen <= total_len`（支持 KV cache 场景）；
+- `k.shape[2] == d`；
+- `v.shape[0..1]` 与 `k.shape[0..1]` 对齐；
+- `nhead % nkvhead == 0`（GQA）。
+
+### 4.3 GQA 映射逻辑
+
+- `group_size = nhead / nkvhead`
+- `kv_head = q_head / group_size`
+
+直觉：多个 Q 头共享较少的 KV 头，显著降低 KV cache 占用，同时尽量保留效果。
+
+### 4.4 计算步骤（与你实现一致）
+
+1. 逐 `(s,h)` 计算 `scores[t] = dot(Q[s,h], K[t,kv_head]) * scale`；
+2. 根据因果可见范围做 mask（未来位置置 `-inf`）；
+3. softmax（减最大值稳定化）；
+4. 用 `scores` 加权求和 `V[:,kv_head,:]`，得到 `out[s,h,:]`。
+
+### 4.5 数值稳定细节
+
+- 先减 `max_score` 再 `exp`；
+- `-inf` 分支先置 0；
+- `sum_exp == 0` 时回退 1，避免 NaN；
+- 累加统一用 `float`。
+
+### 4.6 复杂度与性能认知
+
+- 朴素实现主复杂度约为 `O(seqlen * nhead * total_len * d)`；
+- 你这版更偏“正确性与清晰度优先”；
+- 后续可优化方向：向量化（SIMD）、并行（OpenMP）、分块 softmax、KV cache 访问局部性优化。
+
+---
+
+## 5. SwiGLU：FFN 中的动态门控非线性
+
+### 5.1 数学定义（纯文本）
+
+- `out_i = up_i * gate_i / (1 + exp(-gate_i))`
+- 等价写法：`out_i = up_i * SiLU(gate_i)`
+
+### 5.2 机制直觉
+
+- `up` 分支负责“给候选值”；
+- `gate` 分支负责“给通过强度”；
+- 二者逐元素相乘，实现对特征的动态筛选。
+
+### 5.3 你实现里的落地点
+
+对应：`src/ops/swiglu/op.cpp`
+
+- 校验：同 device、同 dtype、同 shape、contiguous；
+- 路径：CPU-only；
+- dtype：`F32/F16/BF16`；
+- 数值稳定阈值：
+  - `gate >= 50` 近似 `gate`；
+  - `gate <= -50` 近似 `0`；
+  - 中间区间正常计算 `exp`。
+
+### 5.4 为什么要做阈值分支
+
+- 直接算 `exp(-gate)` 在极值时容易上溢/下溢；
+- 阈值近似保留函数形状，同时消除不必要的数值风险。
+
+### 5.5 常见坑
+
+- 忘记形状一致性检查；
+- 极值输入导致 `exp` 溢出；
+- 半精度直接算导致误差放大。
+
+---
+
+## 6. 四模块闭环：为什么必须一起看
+
+一个 token 在块内经历的是：
+
+1. **RMSNorm**：先稳尺度，避免后续点积失控；
+2. **RoPE**：把位置写入 Q/K 方向，构建相对位置信号；
+3. **Self-Attention**：在历史上下文中检索并聚合信息；
+4. **SwiGLU**：在 FFN 中对聚合后的特征做动态筛选与增强。
+
+它们是互补关系，不是替代关系：
+- RMSNorm 解决“稳定性”；
+- RoPE 解决“位置建模”；
+- Attention 解决“信息路由”；
+- SwiGLU 解决“表达能力”。
+
+---
+
+## 7. 与你任务二代码的对齐清单（可直接用于答辩）
+
+### 7.1 一致的工程规范
+
+- 前置检查统一：`device / dtype / contiguous / shape`；
+- 统一 dtype 分发：`F32/F16/BF16`；
+- 统一数值策略：关键累加转 `float`；
+- 统一设备边界：当前实现聚焦 CPU。
+
+### 7.2 模块级亮点
+
+- RMSNorm：按最后一维归一化，`weight` 对齐严格；
+- RoPE：偶数维约束 + 前后半维配对旋转；
+- Self-Attention：causal mask + 稳定 softmax + GQA 映射；
+- SwiGLU：门控激活 + 极值阈值近似。
+
+### 7.3 你当前版本的定位
+
+这是一版“**先正确、再优化**”的实现：
+- 优势：逻辑清晰、可验证、便于对齐 PyTorch；
+- 后续：可在不改变接口的前提下继续做算子级性能优化。
+
+---
+
+## 8. 常见追问与简答模板
+
+1. **为什么 RMSNorm 不减均值？**
+  - 目标是先稳定尺度，实践中效果足够好且计算更轻量。
+
+2. **为什么 RoPE 要偶数维？**
+  - 旋转是二维配对操作，必须按 `d/2` 对元素进行映射。
+
+3. **为什么 attention 要乘 `1/sqrt(d)`？**
+  - 抑制点积尺度膨胀，避免 softmax 饱和导致梯度问题。
+
+4. **GQA 为什么可行？**
+  - KV 头共享减少内存和带宽压力，通常以很小精度代价换更大推理收益。
+
+5. **SwiGLU 为什么要做阈值分支？**
+  - 规避极值输入下 `exp` 数值不稳定，提升鲁棒性。
+
+---
+
+## 9. 面试30秒总结（可直接背）
+
+在任务二中，我把 Transformer 核心算子链路打通了：
+先用 RMSNorm 稳定激活尺度，再对 Q/K 做 RoPE 注入位置信息，之后用带 causal mask 和 GQA 的 self-attention 聚合上下文，最后在 FFN 中用 SwiGLU 做动态门控非线性。工程上我统一了 shape/dtype/device/contiguous 检查，CPU 路径支持 F32/F16/BF16，并通过 float 累加与稳定 softmax 处理保障数值稳定。
diff --git "a/\344\273\273\345\212\241\344\272\214-\351\235\242\350\257\225\351\200\237\350\256\260\347\211\210.md" "b/\344\273\273\345\212\241\344\272\214-\351\235\242\350\257\225\351\200\237\350\256\260\347\211\210.md"
new file mode 100644
index 000000000..e1c1ecabf
--- /dev/null
+++ "b/\344\273\273\345\212\241\344\272\214-\351\235\242\350\257\225\351\200\237\350\256\260\347\211\210.md"
@@ -0,0 +1,91 @@
+# 任务二面试速记版（3分钟可讲完）
+
+## 1）30秒总览（先开场）
+
+我在任务二里完成了 Transformer 核心算子的 CPU 实现，主链路是：
+**RMSNorm → RoPE(Q/K) → Self-Attention(含 causal mask + GQA) → SwiGLU**。
+工程上统一做了 **shape / dtype / device / contiguous** 检查，
+并支持 **F32/F16/BF16**，核心计算使用 `float` 累加来保证数值稳定。
+
+---
+
+## 2）3分钟讲解稿（建议按这个顺序）
+
+### 第1段：我解决了什么问题（约30秒）
+
+任务二要求实现多个算子并通过测试，我重点把 Transformer 关键路径打通：
+- 用 RMSNorm 稳定每层激活尺度；
+- 用 RoPE 注入相对位置信息；
+- 用 Self-Attention 做上下文检索和聚合；
+- 用 SwiGLU 在 FFN 做动态门控激活。
+
+### 第2段：核心数学与实现（约90秒）
+
+1. **RMSNorm**
+   - 公式：$y_{i,j}=w_j\cdot x_{i,j}/\sqrt{\frac{1}{d}\sum x^2+\epsilon}$
+   - 实现点：按“最后一维”做归一化，`weight` 为 1D；内部转 `float` 计算，再 cast 回目标 dtype。
+
+2. **RoPE**
+   - 公式：$\phi_j=pos/\theta^{2j/d}$，对 $[x_j,x_{j+d/2}]$ 做二维旋转。
+   - 实现点：输入按 `[seqlen, nhead, d]`；要求 `d` 为偶数；`pos_ids` 为 `int64`；前后半维配对旋转。
+
+3. **Self-Attention**
+   - 公式：$Y=\text{causal\_softmax}(QK^T\cdot scale)V$。
+   - 实现点：逐 `token/head` 计算分数，做因果 mask，稳定 softmax，再加权聚合 `V`。
+   - GQA：通过 `group_size=nhead/nkvhead` 把多个 Q 头映射到同一 KV 头。
+
+4. **SwiGLU**
+   - 公式：$out_i=up_i\cdot gate_i/(1+e^{-gate_i})$。
+   - 实现点：逐元素门控；对极值做阈值分支（如 `gate>=50`、`gate<=-50`）避免 `exp` 数值问题。
+
+### 第3段：工程质量与结果（约60秒）
+
+- 统一校验：device、dtype、shape、contiguous；
+- 统一 dtype 分发：F32/F16/BF16；
+- 统一数值策略：关键累加用 `float`；
+- 统一边界处理：mask 后 softmax 防 NaN；
+- 最终满足任务二算子实现目标，并为任务三推理链路打下基础。
+
+---
+
+## 3）高频追问（直接背答案）
+
+### Q1：为什么 RMSNorm 不减均值也能工作？
+A：RMSNorm主要控制向量尺度，不做中心化能减少计算且在大模型中表现稳定；配合残差与后续线性层，表达能力仍足够。
+
+### Q2：RoPE 为什么适合长上下文外推？
+A：RoPE本质是按频率旋转方向，编码的是相对位置信号；相比绝对位置表，更容易在更长序列上保持一致的相对关系。
+
+### Q3：Self-Attention 为什么要乘 $1/\sqrt{d}$？
+A：防止点积随维度增大而数值过大，避免 softmax 过早饱和，保持梯度和概率分布更稳定。
+
+### Q4：你怎么处理数值稳定？
+A：softmax 前减 `max_score`；mask 的 `-inf` 在 exp 前置零；`sum_exp==0` 做保护；半精度运算时转 `float` 累加。
+
+### Q5：为什么支持 F16/BF16 还要转 float 计算？
+A：F16/BF16 动态范围和精度有限，直接累加误差会大；转 float 可显著降低舍入误差和溢出风险。
+
+### Q6：GQA 在实现上怎么做？
+A：约束 `nhead % nkvhead == 0`，然后 `kv_head = q_head / group_size`，让一组 Q 头共享同一个 KV 头。
+
+### Q7：你这版实现的复杂度如何？
+A：朴素 attention 主要是 $O(seqlen\cdot nhead\cdot total\_len\cdot d)$，优先保证正确性与可读性，后续再做并行/SIMD/块化优化。
+
+---
+
+## 4）一句话亮点（简历/自我介绍可用）
+
+我把 Transformer 的关键算子链路从数学公式落到了可运行 C++ 实现，覆盖多 dtype、完整边界检查和数值稳定处理，并能直接支撑后续模型推理。
+
+---
+
+## 5）临场回答模板（10秒组织语言）
+
+你可以按这个固定句式答：
+
+1. **先讲目标**：我实现的是 Transformer 的核心算子链路；
+2. **再讲方法**：按 RMSNorm→RoPE→Attention→SwiGLU 逐步落地；
+3. **再讲工程**：统一校验 + 多 dtype + 数值稳定；
+4. **最后讲结果**：满足任务要求并为推理阶段复用。
+
+这样回答结构清楚，不容易跑题。
\ No newline at end of file
diff --git "a/\345\244\215\350\257\225-\345\244\247\346\250\241\345\236\213\344\270\252\344\272\272\347\254\224\350\256\260.md" "b/\345\244\215\350\257\225-\345\244\247\346\250\241\345\236\213\344\270\252\344\272\272\347\254\224\350\256\260.md"
new file mode 100644
index 000000000..f62b36f9b
--- /dev/null
+++ "b/\345\244\215\350\257\225-\345\244\247\346\250\241\345\236\213\344\270\252\344\272\272\347\254\224\350\256\260.md"
@@ -0,0 +1,182 @@
+# 复试-大模型个人笔记（精简版）
+
+> 目标：达到“复试可清晰口述 + 可追问展开 + 可结合工程实践”。
+
+---
+
+## 0. 使用规则（只看这3条）
+
+- 只维护这一份主笔记。
+- 按主题复习，不按天数死记进度。
+- 每个主题固定按：**定义 → 公式 → 流程 → 优缺点 → 工程实践**。
+
+---
+
+## 1. LMCC 全程路线图（总览）
+
+```text
+数学与学习理论
+  ↓
+数据与 Tokenizer
+  ↓
+Transformer 架构（Embedding / RoPE / Attention / Norm / FFN）
+  ↓
+预训练（Causal LM）
+  ↓
+指令微调与对齐（SFT / RLHF / DPO）
+  ↓
+参数高效微调（LoRA / QLoRA）
+  ↓
+推理优化与部署（KV Cache / 量化 / 并发）
+  ↓
+RAG / Agent / 工具调用
+  ↓
+评测、安全、伦理
+```
+
+---
+
+## 2. 主题进度看板
+
+| 主题 | 名称 | 状态 | 备注 |
+|---|---|---|---|
+| A | 数学与深度学习基础 | 未开始 | 交叉熵、反向传播 |
+| B | Tokenizer 与数据工程 | 未开始 | BPE/SentencePiece、数据质量 |
+| C | Transformer 架构核心 | 已完成（当前阶段） | 已可复试口述 |
+| D | 预训练目标与训练优化 | 进行中（下一步） | Causal LM、交叉熵、稳定训练 |
+| E | SFT/RLHF/DPO 对齐链路 | 未开始 | 分工与取舍 |
+| F | LoRA/QLoRA 微调实践 | 未开始 | 低秩更新与成本 |
+| G | 推理优化与部署指标 | 未开始 | TTFT/吞吐/延迟 |
+| H | RAG/Agent 系统化落地 | 未开始 | 检索-生成-工具闭环 |
+| I | 评测、安全与伦理 | 未开始 | 幻觉与防护 |
+
+---
+
+## 3. 主题C 复试核心卡（最终版）
+
+### 3.1 一句话定义
+
+Self-Attention 是“当前 token 对历史 token 的可微检索机制”。
+
+### 3.2 核心公式
+
+$$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
+
+- $Q$：我想找什么（查询）
+- $K$：我可被怎么匹配（键）
+- $V$：我能提供什么内容（值）
+
+### 3.3 五步流程（必须会背）
+
+1. Token Embedding（得到隐藏向量）
+2. 线性投影得到 $Q,K,V$
+3. 对 $Q,K$ 应用 RoPE（位置旋转）
+4. 计算打分并加 Causal Mask，再 softmax 得权重
+5. 用权重对 $V$ 加权和，得到输出并做输出投影
+
+### 3.4 高频易错点（已澄清）
+
+- 打分是 $Q$ 和 $K$ 内积，不是和 $V$。
+- 权重是“按 token 位置”的，不是“按向量维度”的。
+- 注意力 softmax 给的是“历史位置权重分布”；输出层 softmax 才是“词表概率分布”。
+
+### 3.5 RoPE（复试版）
+
+- RoPE 主要作用在 $Q,K$，因为它影响的是打分项 $QK^T$。
+- 核心机制词：**相对相位差**。
+- 结论：RoPE 更自然表达相对位置关系，对长上下文更友好。
+
+### 3.6 Causal Mask（复试版）
+
+- 位置 $t$ 只能看见 $\le t$。
+- 实现上把未来位置 logits 置极小值（近似 $-\infty$），softmax 后未来权重约为 0。
+
+### 3.7 KV Cache（复试版）
+
+- 推理时复用历史 $K,V$，每步仅增量计算当前 token 的 $k_t,v_t$。
+- 优点：显著减少重复计算、提速。
+- 代价：显存占用随上下文长度增长。
+
+### 3.8 RMSNorm vs LayerNorm（复试版）
+
+- LayerNorm：减均值 + 方差归一化。
+- RMSNorm：不减均值，只做 RMS 缩放。
+- 工程上：RMSNorm 计算更轻，常用于大模型 Pre-Norm 结构。
+- 常见位置：Attention 子层前、FFN 子层前。
+
+---
+
+## 4. 主题C 状态总结（结论）
+
+### 已掌握
+
+- Attention 公式书写与变量含义
+- Q/K/V 角色边界
+- RoPE 作用对象与相对位置机理
+- Causal Mask 原理
+- KV Cache 复用机制
+- Attention 全流程口述
+- RMSNorm vs LayerNorm（差异与位置）
+
+### 暂缓（后续在主题A补）
+
+- 残差稳定梯度的严格反向传播推导（先记结论，后补数学）
+
+---
+
+## 5. 当前薄弱点台账（只保留未完成项）
+
+| 日期 | 主题 | 薄弱点 | 当前状态 | 下一动作 |
+|---|---|---|---|---|
+| 2026-03-02 | 主题A/C | 残差梯度推导（数学细节） | 暂缓 | 主题A再回补链式法则 |
+
+---
+
+## 6. 下一步：主题D（预训练目标与训练优化）
+
+### 本轮目标
+
+- 60秒口述：什么是 Causal LM，为什么用 next-token 目标
+- 45秒口述：交叉熵为何比 MSE 更适合语言建模
+- 45秒口述：训练稳定性三件套（Norm、学习率调度、梯度裁剪）
+
+### 启动问题（现在就答）
+
+- 为什么语言模型训练目标通常是“预测下一个 token”？
+
+---
+
+## 7. 复试答题万能模板
+
+1) 定义（一句话）  
+2) 关键公式（1个）  
+3) 流程（3~5步）  
+4) 优缺点（各2条）  
+5) 工程实践（1~2个优化手段）
+
+---
+
+## 8. 单次主题复盘模板
+
+### 主题：__（A~I）__
+### 日期：____
+
+#### 本次目标
+- 
+
+#### 本次输入（学了什么）
+- 
+
+#### 本次输出（我能讲什么）
+- 60秒口述：
+- 120秒口述：
+
+#### 本次追问表现
+- 会答：
+- 卡壳：
+
+#### 薄弱点更新
+- 
+
+#### 下一主题或下一子主题
+- 
diff --git "a/\350\257\255\346\263\225\351\224\231\350\257\257\350\246\201\347\202\271\346\200\273\347\273\223.md" "b/\350\257\255\346\263\225\351\224\231\350\257\257\350\246\201\347\202\271\346\200\273\347\273\223.md"
new file mode 100644
index 000000000..d730cc834
--- /dev/null
+++ "b/\350\257\255\346\263\225\351\224\231\350\257\257\350\246\201\347\202\271\346\200\273\347\273\223.md"
@@ -0,0 +1,59 @@
+# 语法错误要点总结（仅语法，不含题目思路）
+
+## 1) `for` 语句分隔符写错
+- 错误写法：`for (j = 0, j < n - 1, j++)`
+- 正确写法：`for (j = 0; j < n - 1; j++)`
+- 说明：`for (初始化; 条件; 迭代)` 三部分必须用分号 `;` 分隔。
+
+## 2) 函数定义位置错误（标准 C）
+- 你在 `f1` 函数内部定义了 `int temp(...)`。
+- 标准 C 不允许在函数内部再定义函数。
+- 应将 `temp` 提到文件作用域（与 `f1`、`main` 同级）单独定义。
+
+## 3) 函数声明/定义语法不规范
+- 原写法：`int temp (int chushi, int pianyi){ ... }` 放在函数内（见第 2 点）。
+- `main` 原写法：`int main { ... }` 缺少参数括号。
+- 正确写法示例：
+  - `int temp(int chushi, int pianyi) { ... }`
+  - `int main(void) { ... }` 或 `int main() { ... }`
+
+## 4) 变量未声明即使用
+- `for (j = 0; ... )` 中的 `j` 未声明。
+- 正确：`for (int j = 0; ... )`（C99+），或先在函数开头 `int j;`。
+
+## 5) 变长数组使用时机与可移植性问题
+- 你写了：`int arr[n * n];`
+- 若 `n` 在运行期输入，这属于 VLA（变长数组），要求编译器支持 C99/C11 可选特性。
+- 在部分编译环境（如某些 MSVC 配置）可能报语法/标准兼容错误。
+- 更稳妥写法是动态分配：`int *arr = malloc(sizeof(int) * n * n);`
+
+## 6) 输入与主函数骨架缺失导致编译错误风险
+- 你注释写“从标准输入读入”，但代码里未给出对应 `scanf` 与头文件。
+- 至少需要：
+  - `#include <stdio.h>`
+  - 合法的 `main` 函数体
+
+## 7) 函数返回路径要完整
+- `temp` 函数在循环结束后若未触发 `else`，也应保证有 `return`。
+- 否则会出现“控制到达非 void 函数末尾”类错误/警告。
+
+## 8) 注释写了但未落实为代码的部分（语法落地清单）
+- 你写了“从标准输入读入”，但代码中缺少实际读入语句：
+  - 读入 `n`：`scanf("%d", &n);`
+  - 读入 `n*n` 个整数到一维数组：循环中 `scanf("%d", &arr[i]);`
+- 你写了“接下来 n 行，每行 n 个正整数”，代码中应有双层循环或单层 `n*n` 循环来完成输入。
+- 你写了“输出 temp1 temp2 temp3 temp4”，代码中应有实际输出语句：
+  - `printf("%d %d %d %d", temp1, temp2, temp3, temp4);`
+- 你写了“换行”，代码中应明确写出：
+  - `printf("\n");` 或把换行并入上一句 `printf("%d %d %d %d\n", ...);`
+- `main` 作为程序入口，除了函数头语法正确外，还需要完整流程代码：
+  - 读入数据
+  - 循环调用 `f1(i)`
+  - 结束返回 `return 0;`
+- 若使用 `malloc` 动态分配数组，注释之外还需补齐：
+  - 头文件：`#include <stdlib.h>`
+  - 释放内存：`free(arr);`
+
+---
+
+以上仅归纳语法与语言规则层面的错误，不涉及算法与题意逻辑。
\ No newline at end of file