Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
2f844c7
fix(build): add missing macOS configuration to xmake.lua
MeurSol Jan 15, 2026
b20f581
feat1.1 load
MeurSol Jan 16, 2026
75540e6
feat1.2 isContiguous
MeurSol Jan 16, 2026
033370f
feat1.3 view
MeurSol Jan 16, 2026
af08a87
feat1.4 permute
MeurSol Jan 16, 2026
6c879d8
feat1.5 slice
MeurSol Jan 16, 2026
6ba7df6
fix llaisys python init and pass the test1
MeurSol Jan 16, 2026
adfb14e
feat2.1 argmax
MeurSol Jan 17, 2026
717ffeb
feat2.2 embedding
MeurSol Jan 17, 2026
332f197
feat2.3 linear
MeurSol Jan 18, 2026
259ef7a
Improve the implementation of linear_cpu.cpp
MeurSol Jan 18, 2026
f3ef7f5
feat2.4 RMS_norm
MeurSol Jan 18, 2026
7bd82b4
feat2.5 rope
MeurSol Jan 18, 2026
74428e0
feat2.6 self_attention
MeurSol Feb 3, 2026
0224bd5
feat2.7 SwiGLU
MeurSol Feb 3, 2026
dc8c149
fix rms_norm precision problem
MeurSol Feb 3, 2026
2b35673
test infra
MeurSol Feb 4, 2026
07fa690
fix linear and tensor bugs
MeurSol Feb 4, 2026
9395ebe
feat Assignment3: Inference
MeurSol Feb 4, 2026
098b6a9
trigger ci
MeurSol Feb 4, 2026
d2e8809
test ci
MeurSol Feb 4, 2026
8c030c0
fix windows compile error
MeurSol Feb 4, 2026
ade2094
create cuda runtime api
MeurSol Mar 14, 2026
e5704a1
fix lib link bugs and pass the nvidia runtime check
MeurSol Mar 14, 2026
dc0ee4a
test add
MeurSol Mar 14, 2026
a638ec9
fix: cast utils to device compatibility
MeurSol Mar 14, 2026
c420840
pass add test
MeurSol Mar 14, 2026
dc946c6
pass swiglu_nvidia test
MeurSol Mar 14, 2026
621b590
test argmax
MeurSol Mar 14, 2026
7ef2e7d
test: linear, embedding, rms_norm, rope, self_attention nvidia cuda op
MeurSol Mar 14, 2026
aed4679
fix: align cuda op type casting
MeurSol Mar 14, 2026
a7eaf34
feat: rerrange nvidia cuda op
MeurSol Mar 15, 2026
97187ec
pass nvidia infer test
MeurSol Mar 15, 2026
f4d6f67
create mx xmake build
MeurSol Mar 15, 2026
5255882
mx build
MeurSol Mar 15, 2026
6a9e9c7
feat: stabilize metax build and inference baseline
MeurSol Mar 15, 2026
aa1a595
feat: add explicit metax device support
MeurSol Mar 15, 2026
50b5582
add report
MeurSol Mar 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 228 additions & 0 deletions Report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
## 项目 #2:GPU 集成

### 1. 架构设计

本次实现没有改动 LLAISYS 的整体执行框架,只在现有 `device -> ops -> model` 链路中插入 GPU 后端。

```text
Python API / Test
|
v
LLAISYS C API
|
v
Runtime / Tensor / Model
|
+-------------------+-------------------+
| | |
v v v
CPU NVIDIA GPU MetaX GPU
| |
v v
src/device/nvidia/ src/device/metax/
| |
+---------+---------+
|
v
算子分发 src/ops/<op>/op.cpp
|
+-----------------+-----------------+
| |
v v
src/ops/<op>/nvidia/*.cu src/ops/<op>/metax/*.maca
|
v
复用 ../nvidia/*.cu 算子主体
```

- 设备层:
- `src/device/nvidia/` 实现 NVIDIA Runtime API 与设备资源管理
- `src/device/metax/` 实现 MetaX Runtime API 入口
- 算子层:
- `src/ops/*/nvidia/` 实现 CUDA 算子
- `src/ops/*/metax/` 作为 MetaX 编译入口
- 构建层:
- `xmake/nvidia.lua` 管理 CUDA/NVCC 编译
- `xmake/metax.lua` 管理 MACA/MXCC 编译

核心设计是“平台分离、算子复用”:

- NVIDIA 路径使用原生 CUDA 构建与 runtime
- MetaX 路径单独提供设备枚举、构建规则和 runtime 分发
- MetaX 不重写整套算子,而是通过 `.maca` 入口复用 `nvidia/*.cu` 中的 CUDA-like 算子主体

因此,框架层面是两条独立 GPU 后端;算子源码层面只维护一套主实现。

### 2. 实现步骤

#### 2.1 NVIDIA 后端

第一步是补全 NVIDIA Runtime API,对齐 CPU Runtime 接口,包括:

- device count / set device
- malloc / free
- memcpy
- synchronize

随后在 `src/device/runtime_api.cpp` 中注册 NVIDIA runtime,使上层 `Tensor`、`RuntimeAPI` 和模型代码可以直接使用 GPU 设备。

第二步是接入 CUDA 构建链:

- 在 `xmake/nvidia.lua` 中加入 `.cu` 编译与链接规则
- 通过 `--nv-gpu=y` 控制是否启用 GPU 编译

第三步是补全 CUDA 算子。实现上采用统一模式:

- 每个算子在 `src/ops/<op>/nvidia/` 中提供 host 入口
- host 入口完成 dtype 分派、launch 配置与错误检查
- 计算逻辑写在模板化 kernel 中

实现重点在两个热点算子:

- `linear`
- 采用“一线程对应一个输出元素”的映射
- `fp16/bf16` 先转 `float` 再累加
- `self_attention`
- 采用二维 grid,按 `(query, head)` 映射 block
- 在 block 内完成 score 计算、softmax 和 value 加权
- `scores` 使用 shared memory 存储

其余算子如 `add`、`rope`、`rms_norm`、`swiglu`、`embedding`、`argmax`、`rearrange` 按相同方式补齐,形成完整推理执行链。

#### 2.2 MetaX 后端

MetaX 的实现重点不在重新设计算子,而在接入新的设备路径。

实现步骤如下:

1. 新增 `ENABLE_METAX_API`
2. 新增 `LLAISYS_DEVICE_METAX` 与 Python 侧 `DeviceType.METAX`
3. 在 `runtime_api.cpp` 中加入 MetaX runtime 分发
4. 新增 `xmake/metax.lua`,使用 `mxcc` 编译 `.maca`
5. 为每个算子添加 `src/ops/*/metax/*.maca` 入口
6. 在 `.maca` 中复用 `../nvidia/*.cu` 算子主体

这样实现后,MetaX 具备独立设备语义,但不引入第二套重复算子实现。
这一点是本次适配的关键取舍。

### 3. 测试

测试分两层进行。

#### 3.1 单算子测试

先逐个验证 GPU 算子:

```bash
python test/test_runtime.py --device nvidia
python test/ops/add.py --device nvidia
python test/ops/argmax.py --device nvidia
python test/ops/embedding.py --device nvidia
python test/ops/linear.py --device nvidia
python test/ops/rms_norm.py --device nvidia
python test/ops/rope.py --device nvidia
python test/ops/self_attention.py --device nvidia
python test/ops/swiglu.py --device nvidia
```

MetaX 路径使用同样方法,设备改为 `metax`。
这样可以先验证 Runtime、dtype 分派和单算子正确性,再进入整模型测试。

#### 3.2 端到端推理测试

最终使用 `test/test_infer.py --test` 验证整条执行链。判断标准不是只看程序是否运行,而是:

- 生成 token 是否与参考一致
- 文本输出是否一致
- 测试是否通过

Nvidia推理测试结果如下:
```
(base) machine@dsw-607126-85f54bdf75-5lzlx:~/llaisys$ python test/test_infer.py --model ../models/DeepSeek-R1-Distill-Qwen-1.5B/ --test --device nvidia
`torch_dtype` is deprecated! Use `dtype` instead!
Loading model from local path: ../models/DeepSeek-R1-Distill-Qwen-1.5B/
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:03<00:00, 95.80it/s]
The module name (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

=== Answer ===

Tokens:
[151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<|User|>Who are you?<|Assistant|><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 9.36s


=== Your Result ===

Tokens:
[151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<|User|>Who are you?<|Assistant|><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 83.64s

Test passed!
```

曦云 C500推理结果如下:
```
(base) root@d3871d5ad673:/home/machine/llaisys# python test/test_infer.py --test --device metax
Loading model from Hugging Face: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 102023.61it/s]
`torch_dtype` is deprecated! Use `dtype` instead!
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5912: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:649.)
return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa)

=== Answer ===

Tokens:
[151646, 151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<|User|>Who are you?<|Assistant|><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 2.85s


=== Your Result ===

Tokens:
[151646, 151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<|User|>Who are you?<|Assistant|><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 31.70s

Test passed!
```
1 change: 1 addition & 0 deletions include/llaisys.h
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ typedef enum {
LLAISYS_DEVICE_CPU = 0,
//// TODO: Add more device types here. Numbers need to be consecutive.
LLAISYS_DEVICE_NVIDIA = 1,
LLAISYS_DEVICE_METAX = 2,
LLAISYS_DEVICE_TYPE_COUNT
} llaisysDeviceType_t;

Expand Down
62 changes: 62 additions & 0 deletions include/llaisys/qwen2.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#ifndef LLAISYS_QWEN2_H
#define LLAISYS_QWEN2_H

#include "../llaisys.h"
#include "tensor.h"

__C {
typedef struct LlaisysQwen2Model *llaisysQwen2Model_t;

struct LlaisysQwen2Meta {
size_t nlayer; // num_hidden_layers
size_t hs; // hidden_size
size_t nh; // num_attention_heads
size_t nkvh; // num_key_value_heads
size_t dh; // head_dim = hs / nh
size_t di; // intermediate_size
size_t maxseq; // max_position_embeddings
size_t voc; // vocab_size
float epsilon; // rms_norm_eps
float theta; // rope_theta
int64_t end_token; // eos_token_id
};

struct LlaisysQwen2Weights {
llaisysTensor_t in_embed; // [voc, hs]
llaisysTensor_t out_embed; // [voc, hs]
llaisysTensor_t out_norm_w; // [hs]

// Per-layer weights (arrays of size nlayer)
llaisysTensor_t* attn_norm_w; // [nlayer][hs]
llaisysTensor_t* attn_q_w; // [nlayer][nh*dh, hs]
llaisysTensor_t* attn_q_b; // [nlayer][nh*dh]
llaisysTensor_t* attn_k_w; // [nlayer][nkvh*dh, hs]
llaisysTensor_t* attn_k_b; // [nlayer][nkvh*dh]
llaisysTensor_t* attn_v_w; // [nlayer][nkvh*dh, hs]
llaisysTensor_t* attn_v_b; // [nlayer][nkvh*dh]
llaisysTensor_t* attn_o_w; // [nlayer][hs, nh*dh]

llaisysTensor_t* mlp_norm_w; // [nlayer][hs]
llaisysTensor_t* mlp_gate_w; // [nlayer][di, hs]
llaisysTensor_t* mlp_up_w; // [nlayer][di, hs]
llaisysTensor_t* mlp_down_w; // [nlayer][hs, di]
};

__export llaisysQwen2Model_t llaisysQwen2ModelCreate(
const struct LlaisysQwen2Meta* meta,
llaisysDeviceType_t device_type,
int device_id);

__export void llaisysQwen2ModelDestroy(llaisysQwen2Model_t model);

__export struct LlaisysQwen2Weights* llaisysQwen2ModelWeights(llaisysQwen2Model_t model);

__export int64_t llaisysQwen2ModelInfer(
llaisysQwen2Model_t model,
int64_t* token_ids,
size_t ntoken);

__export void llaisysQwen2ModelResetCache(llaisysQwen2Model_t model);
}

#endif // LLAISYS_QWEN2_H
7 changes: 6 additions & 1 deletion python/llaisys/libllaisys/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
from .tensor import llaisysTensor_t
from .tensor import load_tensor
from .ops import load_ops
from .models.qwen2 import load_qwen2
from .models.qwen2 import LlaisysQwen2Meta, LlaisysQwen2Weights


def load_shared_library():
Expand All @@ -22,7 +24,7 @@ def load_shared_library():
elif sys.platform == "win32":
libname = "llaisys.dll"
elif sys.platform == "darwin":
libname = "llaisys.dylib"
libname = "libllaisys.dylib"
else:
raise RuntimeError("Unsupported platform")

Expand All @@ -38,6 +40,7 @@ def load_shared_library():
load_runtime(LIB_LLAISYS)
load_tensor(LIB_LLAISYS)
load_ops(LIB_LLAISYS)
load_qwen2(LIB_LLAISYS)


__all__ = [
Expand All @@ -52,4 +55,6 @@ def load_shared_library():
"llaisysMemcpyKind_t",
"MemcpyKind",
"llaisysStream_t",
"LlaisysQwen2Meta",
"LlaisysQwen2Weights",
]
3 changes: 2 additions & 1 deletion python/llaisys/libllaisys/llaisys_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
class DeviceType(IntEnum):
CPU = 0
NVIDIA = 1
COUNT = 2
METAX = 2
COUNT = 3


llaisysDeviceType_t = ctypes.c_int
Expand Down
7 changes: 7 additions & 0 deletions python/llaisys/libllaisys/models/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from .qwen2 import load_qwen2, LlaisysQwen2Meta, LlaisysQwen2Weights

__all__ = [
"load_qwen2",
"LlaisysQwen2Meta",
"LlaisysQwen2Weights",
]
Loading