完成项目2:CUDA集成 by MeurSol · Pull Request #45 · InfiniTensor/llaisys

MeurSol · 2026-03-16T09:59:23Z

项目 #2：NVIDIA、沐曦集成

1. 架构设计

本次实现没有改动 LLAISYS 的整体执行框架，只在现有 device -> ops -> model 链路中插入 GPU 后端。

Python API / Test
        |
        v
  LLAISYS C API
        |
        v
  Runtime / Tensor / Model
        |
        +-------------------+-------------------+
        |                   |                   |
        v                   v                   v
       CPU              NVIDIA GPU          MetaX GPU
                           |                   |
                           v                   v
               src/device/nvidia/     src/device/metax/
                           |                   |
                           +---------+---------+
                                     |
                                     v
                           算子分发 src/ops/<op>/op.cpp
                                     |
                   +-----------------+-----------------+
                   |                                   |
                   v                                   v
        src/ops/<op>/nvidia/*.cu          src/ops/<op>/metax/*.maca
                                                    |
                                                    v
                                      复用 ../nvidia/*.cu 算子主体

设备层：
- src/device/nvidia/ 实现 NVIDIA Runtime API 与设备资源管理
- src/device/metax/ 实现 MetaX Runtime API 入口
算子层：
- src/ops/*/nvidia/ 实现 CUDA 算子
- src/ops/*/metax/ 作为 MetaX 编译入口
构建层：
- xmake/nvidia.lua 管理 CUDA/NVCC 编译
- xmake/metax.lua 管理 MACA/MXCC 编译

核心设计是“平台分离、算子复用”：

NVIDIA 路径使用原生 CUDA 构建与 runtime
MetaX 路径单独提供设备枚举、构建规则和 runtime 分发
MetaX 不重写整套算子，而是通过 .maca 入口复用 nvidia/*.cu 中的 CUDA-like 算子主体

因此，框架层面是两条独立 GPU 后端；算子源码层面只维护一套主实现。

2. 实现步骤

2.1 NVIDIA 后端

第一步是补全 NVIDIA Runtime API，对齐 CPU Runtime 接口，包括：

device count / set device
malloc / free
memcpy
synchronize

随后在 src/device/runtime_api.cpp 中注册 NVIDIA runtime，使上层 Tensor、RuntimeAPI 和模型代码可以直接使用 GPU 设备。

第二步是接入 CUDA 构建链：

在 xmake/nvidia.lua 中加入 .cu 编译与链接规则
通过 --nv-gpu=y 控制是否启用 GPU 编译

第三步是补全 CUDA 算子。实现上采用统一模式：

每个算子在 src/ops/<op>/nvidia/ 中提供 host 入口
host 入口完成 dtype 分派、launch 配置与错误检查
计算逻辑写在模板化 kernel 中

实现重点在两个热点算子：

linear
- 采用“一线程对应一个输出元素”的映射
- fp16/bf16 先转 float 再累加
self_attention
- 采用二维 grid，按 (query, head) 映射 block
- 在 block 内完成 score 计算、softmax 和 value 加权
- scores 使用 shared memory 存储

其余算子如 add、rope、rms_norm、swiglu、embedding、argmax、rearrange 按相同方式补齐，形成完整推理执行链。

2.2 MetaX 后端

MetaX 的实现重点不在重新设计算子，而在接入新的设备路径。

实现步骤如下：

新增 ENABLE_METAX_API
新增 LLAISYS_DEVICE_METAX 与 Python 侧 DeviceType.METAX
在 runtime_api.cpp 中加入 MetaX runtime 分发
新增 xmake/metax.lua，使用 mxcc 编译 .maca
为每个算子添加 src/ops/*/metax/*.maca 入口
在 .maca 中复用 ../nvidia/*.cu 算子主体

这样实现后，MetaX 具备独立设备语义，但不引入第二套重复算子实现。
这一点是本次适配的关键取舍。

3. 测试

测试分两层进行。

3.1 单算子测试

先逐个验证 GPU 算子：

python test/test_runtime.py --device nvidia
python test/ops/add.py --device nvidia
python test/ops/argmax.py --device nvidia
python test/ops/embedding.py --device nvidia
python test/ops/linear.py --device nvidia
python test/ops/rms_norm.py --device nvidia
python test/ops/rope.py --device nvidia
python test/ops/self_attention.py --device nvidia
python test/ops/swiglu.py --device nvidia

MetaX 路径使用同样方法，设备改为 metax。
这样可以先验证 Runtime、dtype 分派和单算子正确性，再进入整模型测试。

3.2 端到端推理测试

最终使用 test/test_infer.py --test 验证整条执行链。判断标准不是只看程序是否运行，而是：

生成 token 是否与参考一致
文本输出是否一致
测试是否通过

Nvidia推理测试结果如下：

(base) machine@dsw-607126-85f54bdf75-5lzlx:~/llaisys$ python test/test_infer.py --model ../models/DeepSeek-R1-Distill-Qwen-1.5B/ --test --device nvidia
`torch_dtype` is deprecated! Use `dtype` instead!
Loading model from local path: ../models/DeepSeek-R1-Distill-Qwen-1.5B/
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:03<00:00, 95.80it/s]
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

=== Answer ===

Tokens:
[151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<｜User｜>Who are you?<｜Assistant｜><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 9.36s


=== Your Result ===

Tokens:
[151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<｜User｜>Who are you?<｜Assistant｜><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 83.64s

Test passed!

曦云 C500推理结果如下：

(base) root@d3871d5ad673:/home/machine/llaisys# python test/test_infer.py --test --device metax 
Loading model from Hugging Face: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Fetching 9 files: 100%|████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 102023.61it/s]
`torch_dtype` is deprecated! Use `dtype` instead!
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:5912: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /workspace/framework/mcPytorch/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp:649.)
  return _scaled_dot_product_attention(query, key, value, attn_mask, dropout_p, is_causal, scale = scale, enable_gqa = enable_gqa)

=== Answer ===

Tokens:
[151646, 151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<｜User｜>Who are you?<｜Assistant｜><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 2.85s


=== Your Result ===

Tokens:
[151646, 151646, 151644, 15191, 525, 498, 30, 151645, 151648, 198, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 624, 151649, 271, 91786, 0, 358, 2776, 18183, 39350, 10911, 16, 11, 458, 20443, 11229, 17847, 3465, 553, 18183, 39350, 13, 358, 2776, 518, 697, 2473, 323, 1035, 387, 33972, 311, 7789, 498, 448, 894, 43883, 476, 9079, 498, 1231, 614, 13, 151643]

Contents:
<｜User｜>Who are you?<｜Assistant｜><think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Time elapsed: 31.70s

Test passed!

MeurSol added 30 commits January 15, 2026 19:47

fix(build): add missing macOS configuration to xmake.lua

2f844c7

feat1.1 load

b20f581

feat1.2 isContiguous

75540e6

feat1.3 view

033370f

feat1.4 permute

af08a87

feat1.5 slice

6c879d8

fix llaisys python init and pass the test1

6ba7df6

feat2.1 argmax

adfb14e

feat2.2 embedding

717ffeb

feat2.3 linear

332f197

Improve the implementation of linear_cpu.cpp

259ef7a

feat2.4 RMS_norm

f3ef7f5

feat2.5 rope

7bd82b4

feat2.6 self_attention

74428e0

feat2.7 SwiGLU

0224bd5

fix rms_norm precision problem

dc8c149

test infra

2b35673

fix linear and tensor bugs

07fa690

feat Assignment3: Inference

9395ebe

trigger ci

098b6a9

test ci

d2e8809

fix windows compile error

8c030c0

create cuda runtime api

ade2094

fix lib link bugs and pass the nvidia runtime check

e5704a1

test add

dc0ee4a

fix: cast utils to device compatibility

a638ec9

pass add test

c420840

pass swiglu_nvidia test

dc946c6

test argmax

621b590

test: linear, embedding, rms_norm, rope, self_attention nvidia cuda op

7ef2e7d

MeurSol added 8 commits March 15, 2026 06:23

fix: align cuda op type casting

aed4679

feat: rerrange nvidia cuda op

a7eaf34

pass nvidia infer test

97187ec

create mx xmake build

f4d6f67

mx build

5255882

feat: stabilize metax build and inference baseline

6a9e9c7

feat: add explicit metax device support

aa1a595

add report

50b5582

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

完成项目2:CUDA集成#45

完成项目2:CUDA集成#45
MeurSol wants to merge 38 commits intoInfiniTensor:mainfrom
MeurSol:main

MeurSol commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MeurSol commented Mar 16, 2026

项目 #2：NVIDIA、沐曦集成

1. 架构设计

2. 实现步骤

2.1 NVIDIA 后端

2.2 MetaX 后端

3. 测试

3.1 单算子测试

3.2 端到端推理测试

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant