Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
项目 #2:NVIDIA、沐曦集成
1. 架构设计
本次实现没有改动 LLAISYS 的整体执行框架,只在现有
device -> ops -> model链路中插入 GPU 后端。src/device/nvidia/实现 NVIDIA Runtime API 与设备资源管理src/device/metax/实现 MetaX Runtime API 入口src/ops/*/nvidia/实现 CUDA 算子src/ops/*/metax/作为 MetaX 编译入口xmake/nvidia.lua管理 CUDA/NVCC 编译xmake/metax.lua管理 MACA/MXCC 编译核心设计是“平台分离、算子复用”:
.maca入口复用nvidia/*.cu中的 CUDA-like 算子主体因此,框架层面是两条独立 GPU 后端;算子源码层面只维护一套主实现。
2. 实现步骤
2.1 NVIDIA 后端
第一步是补全 NVIDIA Runtime API,对齐 CPU Runtime 接口,包括:
随后在
src/device/runtime_api.cpp中注册 NVIDIA runtime,使上层Tensor、RuntimeAPI和模型代码可以直接使用 GPU 设备。第二步是接入 CUDA 构建链:
xmake/nvidia.lua中加入.cu编译与链接规则--nv-gpu=y控制是否启用 GPU 编译第三步是补全 CUDA 算子。实现上采用统一模式:
src/ops/<op>/nvidia/中提供 host 入口实现重点在两个热点算子:
linearfp16/bf16先转float再累加self_attention(query, head)映射 blockscores使用 shared memory 存储其余算子如
add、rope、rms_norm、swiglu、embedding、argmax、rearrange按相同方式补齐,形成完整推理执行链。2.2 MetaX 后端
MetaX 的实现重点不在重新设计算子,而在接入新的设备路径。
实现步骤如下:
ENABLE_METAX_APILLAISYS_DEVICE_METAX与 Python 侧DeviceType.METAXruntime_api.cpp中加入 MetaX runtime 分发xmake/metax.lua,使用mxcc编译.macasrc/ops/*/metax/*.maca入口.maca中复用../nvidia/*.cu算子主体这样实现后,MetaX 具备独立设备语义,但不引入第二套重复算子实现。
这一点是本次适配的关键取舍。
3. 测试
测试分两层进行。
3.1 单算子测试
先逐个验证 GPU 算子:
MetaX 路径使用同样方法,设备改为
metax。这样可以先验证 Runtime、dtype 分派和单算子正确性,再进入整模型测试。
3.2 端到端推理测试
最终使用
test/test_infer.py --test验证整条执行链。判断标准不是只看程序是否运行,而是:Nvidia推理测试结果如下:
曦云 C500推理结果如下: