完成项目1,cpu优化 by usersforsomebody · Pull Request #48 · InfiniTensor/llaisys

usersforsomebody · 2026-03-16T15:47:37Z

详见报告和项目

…Zone.Identifier

…r empty utils target

… int64_t

- 在 include/llaisys/models/qwen2.h 中补充并导出 llaisysQwen2LoadWeight、 llaisysQwen2ModelForward、llaisysQwen2Sample 声明 - Windows 下仅 __declspec(dllexport) 的符号会被 DLL 导出，此前头文件缺少上述声明导致 Assignment-3 (test_infer) 在 Windows CI 报 AttributeError: function 'llaisysQwen2ModelForward' not found - 移除未实现的 llaisysQwen2ModelInfer 声明，与 src/models/qwen2.cpp 保持一致

…致收益受限，作为引入 BLAS 库的性能参照

- xmake.lua 全局启用 -mavx2 -mfma 编译选项 - 新增 src/utils/simd.hpp: hsum256, avx2_dot, bf16x8_to_f32x8, fp16x8_to_f32x8 - linear/op.cpp 使用 AVX2+FMA 重写三种数据类型的内核，代码量从 576 行降至 437 行 - 单线程性能: f32 ~40 GFLOPS, bf16 ~39 GFLOPS, fp16 ~9.6 GFLOPS

- 添加parallel_for模板函数，M<32时完全绕过OpenMP运行时 - 替换全部10处#pragma omp为parallel_for调用 - 模型推理从104s恢复到28s，修复OpenMP导致的性能回退

- 添加 OpenBLAS 编译选项 (xmake.lua), 自动链接 ~/openblas - f32 路径直接调用 cblas_sgemm, 完全替代手写 AVX2 - bf16 路径采用混合策略: - M >= 32: bf16→f32 转换 + cblas_sgemm (权重缓存避免重复转换) - M < 32 (含 M=1 decode): 手写 AVX2 直接读 bf16, 节省内存带宽 - fp16 路径: fp16→f32 + cblas_sgemm (权重缓存) - 发现 OpenBLAS cblas_sbgemm 的 beta 参数和大 K 精度 bug, 已绕过 - 所有 6/6 linear 测试通过, 模型推理 token 精确匹配

添加 blas_runtime.hpp 实现运行时 BLAS 库检测 (MKL 优先, OpenBLAS 备选), 同一份 .so 可在有/无 BLAS 的环境中运行。重构 op.cpp 统一编译期 OpenBLAS 与运行时 BLAS 的调用路径, 修复 MKL Intel 线程层 bug (setenv MKL_THREADING_LAYER=GNU)。服务器 MKL 加速效果: 6.93s (vs 之前 51.99s, 7.5x 加速)

- bf16_linear_avx2: M<32时按N维度并行化(omp parallel for), 解决M=1单线程瓶颈服务器24线程: 36s→6.7s (5.4x加速), 本地6线程: 32s→18s (1.8x加速) - 添加AVX-512 bf16 linear内核(bf16_linear_avx512)及SIMD辅助函数(simd.hpp) 经测试Icelake频率降档导致回退, 实际调度仍使用AVX2版本 - xmake.lua新增--native选项: 启用-march=native以利用AVX-512, 默认-mavx2 -mfma

Copilot

Pull request overview

This PR adds a first end-to-end Qwen2 CPU inference path to the codebase by implementing missing tensor view/contiguity utilities, CPU ops (linear/attention/rope/norm/etc.), and a Python loader that streams .safetensors into the C++ model, plus build-system switches for OpenMP/SIMD/BLAS.

Changes:

Implement core CPU ops (linear with BLAS/runtime-BLAS + SIMD fallbacks, self-attention, RoPE, RMSNorm, SwiGLU, embedding, argmax, rearrange).
Add a C API + C++ implementation for a Qwen2 model (weights loading, forward, sampling) and Python bindings/loader for safetensors.
Update build configuration for OpenMP/SIMD/OpenBLAS options and add misc utilities/debugging assets.

Reviewed changes

Copilot reviewed 22 out of 25 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
xmake.lua	Adds global OpenMP/SIMD flags and optional OpenBLAS integration; includes models in shared lib build.
test/test_tensor.py	Minor formatting tweak.
test/debug.py	Adds an ad-hoc local debug script for loading/running a model.
src/utils/utils_stub.cpp	Adds placeholder TU to keep target non-empty.
src/utils/types.hpp	Adds inline fp16/bf16 conversion helpers.
src/utils/types.cpp.old	Minor formatting change.
src/utils/simd.hpp	Adds AVX2/AVX-512 helper intrinsics for dot/casts/horizontal sums.
src/utils/blas_runtime.hpp	Adds dlopen-based runtime BLAS backend discovery and sgemm dispatch.
src/tensor/tensor.cpp	Implements Tensor contiguity checks, permute/view/slice, load, and contiguous() using rearrange.
src/ops/swiglu/op.cpp	Implements SwiGLU (SiLU(gate) * up) for f16/bf16/f32.
src/ops/self_attention/op.cpp	Implements self-attention with multiple supported input layout modes.
src/ops/rope/op.cpp	Implements RoPE with both unit-test layout and inference layout handling.
src/ops/rms_norm/op.cpp	Implements RMSNorm for f16/bf16/f32.
src/ops/rearrange/op.cpp	Implements generic elementwise rearrangement based on strides.
src/ops/linear/op.cpp	Large CPU linear implementation with BLAS (compile-time or runtime) + AVX2/FMA/OpenMP fallbacks and weight caching.
src/ops/embedding/op.cpp	Implements embedding lookup via memcpy rows.
src/ops/argmax/op.cpp	Implements argmax over last dimension.
src/models/qwen2.cpp	Adds C++ Qwen2 model: weight loading, forward, KV cache handling, and sampling API.
report_linear.pdf	Adds a profiling/report PDF artifact.
python/llaisys/models/qwen2.py	Implements Python Qwen2 loader (config + safetensors streaming) and greedy generation loop.
python/llaisys/libllaisys/llaisys_types.py	Adds ctypes definition for LlaisysQwen2Meta.
python/llaisys/libllaisys/init.py	Wires Qwen2 API signatures and provides wrapper helpers.
include/llaisys/models/qwen2.h	Extends Qwen2 C API: create(dtype), load_weight, forward, sample.
.gitignore	Ignores Windows Zone.Identifier alternate-data-stream artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

xmake.lua

+add_cxflags("-fopenmp", "-O3")
+add_ldflags("-fopenmp")
+add_shflags("-fopenmp")
+add_syslinks("gomp") -- 显式链接 GNU OpenMP 库
+


xmake.lua

+-- OpenBLAS 集成: 从源码编译安装到 ~/openblas
+option("openblas")
+    set_default(true)
+    set_showmenu(true)
+    set_description("Whether to use OpenBLAS for linear algebra acceleration")
+option_end()
+
+if has_config("openblas") then
+    add_defines("USE_OPENBLAS")
+    add_includedirs(os.getenv("HOME") .. "/openblas/include")
+    add_linkdirs(os.getenv("HOME") .. "/openblas/lib")
+    add_links("openblas")
+    add_rpathdirs(os.getenv("HOME") .. "/openblas/lib")


src/utils/simd.hpp

@@ -0,0 +1,140 @@
+#pragma once
+#include <immintrin.h>


src/utils/blas_runtime.hpp

+inline BlasState& state() {
+    static BlasState s;
+    return s;
+}
+
+inline void try_load() {
+#ifdef _WIN32
+    state().initialized = true;
+    return;  // Windows 不支持 dlopen
+#else
+    auto& s = state();
+    if (s.initialized) return;
+    s.initialized = true;
+


src/tensor/tensor.cpp

+    if(order.size()!=this->ndim()){
+        throw std::runtime_error("Order Error!");
+    }
+    const auto&old_shape=this->shape();
+    const auto&old_strides=this->strides();
+    std::vector<size_t> new_shape(old_shape.size());
+    std::vector<ptrdiff_t> new_strides(old_strides.size());
+    for(size_t i=0;i<order.size();i++){
+        size_t order_index=order[i];
+        if(order_index>=this->ndim()){
+            throw std::runtime_error("Index Error!");
+        }
+        new_shape[i]=old_shape[order_index];
+        new_strides[i]=old_strides[order_index];
+    }
+    TensorMeta _meta{this->dtype(),std::move(new_shape),std::move(new_strides)};
+    return std::shared_ptr<Tensor>(new Tensor(_meta, _storage,this->_offset));


src/models/qwen2.cpp

+    tensor_t* heap_logits=new tensor_t(logits);
+    return (void*)heap_logits;
+


src/models/qwen2.cpp

+        std::string layer_id_str=w_name.substr(first_dot,second_dot-first_dot);
+        size_t layer_id=std::stoull(layer_id_str);
+
+        if(layer_id>=0&&layer_id<model->meta.nlayer){


test/debug.py

+    # 1. 直接指定你刚才下载好的、确定的绝对路径
+    # 注意：确保这个文件夹里真的有 .safetensors 文件
+    real_model_path = "/home/cpp/ai-models/DeepSeek-R1-Distill-Qwen-1.5B"
+
+    print(f"1. Using local model at: {real_model_path}")


src/ops/rope/op.cpp

+    // -----------------------------
+    // 模式 2：Qwen2 / DeepSeek 推理路径
+    //   in  : [batch, seq_len, hidden]
+    //   pos : [1, seq_len] 或 [seq_len]
+    // -----------------------------
+    size_t N = in_shape[0]; // batch
+    size_t M = in_shape[1]; // seq_len
+    size_t D = in_shape[2]; // hidden
+


src/ops/self_attention/op.cpp

+    // 【关键】智能判断模式
+    // DeepSeek 的 Hidden Dim 是 1536 或 256
+    size_t last_dim = q->shape().back();
+    bool is_deepseek = (last_dim == 1536 || last_dim == 256);
+
+    if (is_deepseek) {
+        // --------------------------------------------------------
+        // 模式 A: DeepSeek 推理模式 [Batch, Seq, Hidden]
+        // 说明：
+        //   - q 始终是 [B, T, H]，H=nh*head_dim (例如 1536 = 12 * 128)
+        //   - k/v 在两种场景下：
+        //       1) 即时计算时为 [B, T, kv_dim]，其中 kv_dim = nkvh * head_dim
+        //       2) 从 KV Cache 读出后为 [B, T_total, nkvh, head_dim]
+        // --------------------------------------------------------
+        batch = q->shape()[0];
+        seqlen = q->shape()[1];
+        size_t hidden_q = q->shape()[2];
+
+        // 拆 head：Q 的 head 维度始终按 128 处理
+        d = 128;
+        nhead = hidden_q / d; // 例如 1536 / 128 = 12
+
+        total_len = k->shape()[1];
+
+        if (k->shape().size() == 3) {
+            // [B, T_total, kv_dim] 视为 [B, T_total, nkvh, head_dim] 的拍扁形式
+            size_t hidden_kv = k->shape()[2];
+            dv = d;
+            nkvhead = hidden_kv / dv; // 例如 256 / 128 = 2
+        } else if (k->shape().size() == 4) {
+            // [B, T_total, nkvh, head_dim] —— 来自 KV Cache 的 4D 形式
+            nkvhead = k->shape()[2];
+            dv = k->shape()[3];       // 一般为 128
+        } else {
+            throw std::runtime_error("Unsupported K shape for DeepSeek mode");
+        }
+    } 


usersforsomebody · 2026-03-16T16:02:51Z

复现说明

环境要求

依赖	说明
xmake	构建系统
GCC/Clang	支持 C++17 和 AVX2
Python 3.9+	PyTorch, transformers, safetensors
模型	DeepSeek-R1-Distill-Qwen-1.5B

编译命令

# 无 OpenBLAS（推荐，降低依赖）
xmake f --openblas=n
xmake && xmake install

# 有 OpenBLAS 安装在 ~/openblas/
xmake && xmake install

测试命令

pip install ./python/
python test/ops/linear.py
python test/test_infer.py --model /path/to/model --test

注意事项

OpenBLAS 默认启用，若无此库会编译失败，请用 --openblas=n 关闭
CPU 需支持 AVX2（2013年后的 x86 CPU）
无 BLAS 时自动使用 AVX2 fallback，性能稍慢但可运行
若系统有 MKL，运行时会自动检测并加载

usersforsomebody added 27 commits January 23, 2026 01:56

Pass the test,may be need more review and check

7caa504

Finish the taks 2.1 Argmax

0ac6892

use the std::move

f5e61c2

Finish Task 2.2 Embedding

3a1ca5c

Pass Task 2-3 linear

4ed8b98

Finish the Task-2.4 RMS

75306f7

Finish the task 2-5 RoPE and fix the kernel

e9a4937

Finish The Task 2-6 attention is all your need

f3220e0

Finish all the Task 1 Task 2

29bd22f

Finsh the bind

5f8a982

Finish forward init

840d134

kv_cache

424234f

Fix the bug

123f48f

Fix another bug

d99d6cd

fix: remove Windows-invalid path core.hpp:Zone.Identifier and ignore …

687a55a

…Zone.Identifier

fix(win): C4267 size_t to long in tensor view() and add utils_stub fo…

702a9d9

…r empty utils target

fix(win): C4267 size_t to int in rearrange loop; cast argmax index to…

a10c6da

… int64_t

perf: enable OpenMP and parallelize linear operator (4x-5x speedup)

78ea945

perf(ops): 分割 linear 循环块 (Tiling) 试图优化缓存命中，但因内部 utils::cast 阻断 SIMD 导…

7c00cf3

…致收益受限，作为引入 BLAS 库的性能参照

feat(linear): 添加OpenMP多线程并行化，6线程加速约3x

2380d3e

fix(linear): 用C级if替代OpenMP if子句，消除M=1时GOMP运行时开销

f10b579

- 添加parallel_for模板函数，M<32时完全绕过OpenMP运行时 - 替换全部10处#pragma omp为parallel_for调用 - 模型推理从104s恢复到28s，修复OpenMP导致的性能回退

fix(qwen2): 修复 generate 函数多生成 1 个 token 的 bug

08b7fc7

Copilot AI review requested due to automatic review settings March 16, 2026 15:47

Copilot started reviewing on behalf of usersforsomebody March 16, 2026 15:48 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

完成项目1,cpu优化#48

完成项目1,cpu优化#48
usersforsomebody wants to merge 27 commits intoInfiniTensor:mainfrom
usersforsomebody:main

usersforsomebody commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

usersforsomebody commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		tensor_t* heap_logits=new tensor_t(logits);
		return (void*)heap_logits;

Conversation

usersforsomebody commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

usersforsomebody commented Mar 16, 2026

复现说明

环境要求

编译命令

测试命令

注意事项

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants