Skip to content

完成项目1,cpu优化#48

Open
usersforsomebody wants to merge 27 commits intoInfiniTensor:mainfrom
usersforsomebody:main
Open

完成项目1,cpu优化#48
usersforsomebody wants to merge 27 commits intoInfiniTensor:mainfrom
usersforsomebody:main

Conversation

@usersforsomebody
Copy link

详见报告和项目

- 在 include/llaisys/models/qwen2.h 中补充并导出 llaisysQwen2LoadWeight、
  llaisysQwen2ModelForward、llaisysQwen2Sample 声明
- Windows 下仅 __declspec(dllexport) 的符号会被 DLL 导出,此前头文件缺少
  上述声明导致 Assignment-3 (test_infer) 在 Windows CI 报
  AttributeError: function 'llaisysQwen2ModelForward' not found
- 移除未实现的 llaisysQwen2ModelInfer 声明,与 src/models/qwen2.cpp 保持一致
- xmake.lua 全局启用 -mavx2 -mfma 编译选项
- 新增 src/utils/simd.hpp: hsum256, avx2_dot, bf16x8_to_f32x8, fp16x8_to_f32x8
- linear/op.cpp 使用 AVX2+FMA 重写三种数据类型的内核,代码量从 576 行降至 437 行
- 单线程性能: f32 ~40 GFLOPS, bf16 ~39 GFLOPS, fp16 ~9.6 GFLOPS
- 添加parallel_for模板函数,M<32时完全绕过OpenMP运行时
- 替换全部10处#pragma omp为parallel_for调用
- 模型推理从104s恢复到28s,修复OpenMP导致的性能回退
- 添加 OpenBLAS 编译选项 (xmake.lua), 自动链接 ~/openblas
- f32 路径直接调用 cblas_sgemm, 完全替代手写 AVX2
- bf16 路径采用混合策略:
  - M >= 32: bf16→f32 转换 + cblas_sgemm (权重缓存避免重复转换)
  - M < 32 (含 M=1 decode): 手写 AVX2 直接读 bf16, 节省内存带宽
- fp16 路径: fp16→f32 + cblas_sgemm (权重缓存)
- 发现 OpenBLAS cblas_sbgemm 的 beta 参数和大 K 精度 bug, 已绕过
- 所有 6/6 linear 测试通过, 模型推理 token 精确匹配
添加 blas_runtime.hpp 实现运行时 BLAS 库检测 (MKL 优先, OpenBLAS 备选),
同一份 .so 可在有/无 BLAS 的环境中运行。重构 op.cpp 统一编译期
OpenBLAS 与运行时 BLAS 的调用路径, 修复 MKL Intel 线程层 bug
(setenv MKL_THREADING_LAYER=GNU)。

服务器 MKL 加速效果: 6.93s (vs 之前 51.99s, 7.5x 加速)
- bf16_linear_avx2: M<32时按N维度并行化(omp parallel for), 解决M=1单线程瓶颈
  服务器24线程: 36s→6.7s (5.4x加速), 本地6线程: 32s→18s (1.8x加速)
- 添加AVX-512 bf16 linear内核(bf16_linear_avx512)及SIMD辅助函数(simd.hpp)
  经测试Icelake频率降档导致回退, 实际调度仍使用AVX2版本
- xmake.lua新增--native选项: 启用-march=native以利用AVX-512, 默认-mavx2 -mfma
Copilot AI review requested due to automatic review settings March 16, 2026 15:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a first end-to-end Qwen2 CPU inference path to the codebase by implementing missing tensor view/contiguity utilities, CPU ops (linear/attention/rope/norm/etc.), and a Python loader that streams .safetensors into the C++ model, plus build-system switches for OpenMP/SIMD/BLAS.

Changes:

  • Implement core CPU ops (linear with BLAS/runtime-BLAS + SIMD fallbacks, self-attention, RoPE, RMSNorm, SwiGLU, embedding, argmax, rearrange).
  • Add a C API + C++ implementation for a Qwen2 model (weights loading, forward, sampling) and Python bindings/loader for safetensors.
  • Update build configuration for OpenMP/SIMD/OpenBLAS options and add misc utilities/debugging assets.

Reviewed changes

Copilot reviewed 22 out of 25 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
xmake.lua Adds global OpenMP/SIMD flags and optional OpenBLAS integration; includes models in shared lib build.
test/test_tensor.py Minor formatting tweak.
test/debug.py Adds an ad-hoc local debug script for loading/running a model.
src/utils/utils_stub.cpp Adds placeholder TU to keep target non-empty.
src/utils/types.hpp Adds inline fp16/bf16 conversion helpers.
src/utils/types.cpp.old Minor formatting change.
src/utils/simd.hpp Adds AVX2/AVX-512 helper intrinsics for dot/casts/horizontal sums.
src/utils/blas_runtime.hpp Adds dlopen-based runtime BLAS backend discovery and sgemm dispatch.
src/tensor/tensor.cpp Implements Tensor contiguity checks, permute/view/slice, load, and contiguous() using rearrange.
src/ops/swiglu/op.cpp Implements SwiGLU (SiLU(gate) * up) for f16/bf16/f32.
src/ops/self_attention/op.cpp Implements self-attention with multiple supported input layout modes.
src/ops/rope/op.cpp Implements RoPE with both unit-test layout and inference layout handling.
src/ops/rms_norm/op.cpp Implements RMSNorm for f16/bf16/f32.
src/ops/rearrange/op.cpp Implements generic elementwise rearrangement based on strides.
src/ops/linear/op.cpp Large CPU linear implementation with BLAS (compile-time or runtime) + AVX2/FMA/OpenMP fallbacks and weight caching.
src/ops/embedding/op.cpp Implements embedding lookup via memcpy rows.
src/ops/argmax/op.cpp Implements argmax over last dimension.
src/models/qwen2.cpp Adds C++ Qwen2 model: weight loading, forward, KV cache handling, and sampling API.
report_linear.pdf Adds a profiling/report PDF artifact.
python/llaisys/models/qwen2.py Implements Python Qwen2 loader (config + safetensors streaming) and greedy generation loop.
python/llaisys/libllaisys/llaisys_types.py Adds ctypes definition for LlaisysQwen2Meta.
python/llaisys/libllaisys/init.py Wires Qwen2 API signatures and provides wrapper helpers.
include/llaisys/models/qwen2.h Extends Qwen2 C API: create(dtype), load_weight, forward, sample.
.gitignore Ignores Windows Zone.Identifier alternate-data-stream artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +9 to +13
add_cxflags("-fopenmp", "-O3")
add_ldflags("-fopenmp")
add_shflags("-fopenmp")
add_syslinks("gomp") -- 显式链接 GNU OpenMP 库

Comment on lines +26 to +38
-- OpenBLAS 集成: 从源码编译安装到 ~/openblas
option("openblas")
set_default(true)
set_showmenu(true)
set_description("Whether to use OpenBLAS for linear algebra acceleration")
option_end()

if has_config("openblas") then
add_defines("USE_OPENBLAS")
add_includedirs(os.getenv("HOME") .. "/openblas/include")
add_linkdirs(os.getenv("HOME") .. "/openblas/lib")
add_links("openblas")
add_rpathdirs(os.getenv("HOME") .. "/openblas/lib")
@@ -0,0 +1,140 @@
#pragma once
#include <immintrin.h>
Comment on lines +56 to +69
inline BlasState& state() {
static BlasState s;
return s;
}

inline void try_load() {
#ifdef _WIN32
state().initialized = true;
return; // Windows 不支持 dlopen
#else
auto& s = state();
if (s.initialized) return;
s.initialized = true;

Comment on lines +180 to +196
if(order.size()!=this->ndim()){
throw std::runtime_error("Order Error!");
}
const auto&old_shape=this->shape();
const auto&old_strides=this->strides();
std::vector<size_t> new_shape(old_shape.size());
std::vector<ptrdiff_t> new_strides(old_strides.size());
for(size_t i=0;i<order.size();i++){
size_t order_index=order[i];
if(order_index>=this->ndim()){
throw std::runtime_error("Index Error!");
}
new_shape[i]=old_shape[order_index];
new_strides[i]=old_strides[order_index];
}
TensorMeta _meta{this->dtype(),std::move(new_shape),std::move(new_strides)};
return std::shared_ptr<Tensor>(new Tensor(_meta, _storage,this->_offset));
Comment on lines +226 to +228
tensor_t* heap_logits=new tensor_t(logits);
return (void*)heap_logits;

std::string layer_id_str=w_name.substr(first_dot,second_dot-first_dot);
size_t layer_id=std::stoull(layer_id_str);

if(layer_id>=0&&layer_id<model->meta.nlayer){
Comment on lines +14 to +18
# 1. 直接指定你刚才下载好的、确定的绝对路径
# 注意:确保这个文件夹里真的有 .safetensors 文件
real_model_path = "/home/cpp/ai-models/DeepSeek-R1-Distill-Qwen-1.5B"

print(f"1. Using local model at: {real_model_path}")
Comment on lines +64 to +72
// -----------------------------
// 模式 2:Qwen2 / DeepSeek 推理路径
// in : [batch, seq_len, hidden]
// pos : [1, seq_len] 或 [seq_len]
// -----------------------------
size_t N = in_shape[0]; // batch
size_t M = in_shape[1]; // seq_len
size_t D = in_shape[2]; // hidden

Comment on lines +19 to +55
// 【关键】智能判断模式
// DeepSeek 的 Hidden Dim 是 1536 或 256
size_t last_dim = q->shape().back();
bool is_deepseek = (last_dim == 1536 || last_dim == 256);

if (is_deepseek) {
// --------------------------------------------------------
// 模式 A: DeepSeek 推理模式 [Batch, Seq, Hidden]
// 说明:
// - q 始终是 [B, T, H],H=nh*head_dim (例如 1536 = 12 * 128)
// - k/v 在两种场景下:
// 1) 即时计算时为 [B, T, kv_dim],其中 kv_dim = nkvh * head_dim
// 2) 从 KV Cache 读出后为 [B, T_total, nkvh, head_dim]
// --------------------------------------------------------
batch = q->shape()[0];
seqlen = q->shape()[1];
size_t hidden_q = q->shape()[2];

// 拆 head:Q 的 head 维度始终按 128 处理
d = 128;
nhead = hidden_q / d; // 例如 1536 / 128 = 12

total_len = k->shape()[1];

if (k->shape().size() == 3) {
// [B, T_total, kv_dim] 视为 [B, T_total, nkvh, head_dim] 的拍扁形式
size_t hidden_kv = k->shape()[2];
dv = d;
nkvhead = hidden_kv / dv; // 例如 256 / 128 = 2
} else if (k->shape().size() == 4) {
// [B, T_total, nkvh, head_dim] —— 来自 KV Cache 的 4D 形式
nkvhead = k->shape()[2];
dv = k->shape()[3]; // 一般为 128
} else {
throw std::runtime_error("Unsupported K shape for DeepSeek mode");
}
}
@usersforsomebody
Copy link
Author

复现说明

环境要求

依赖 说明
xmake 构建系统
GCC/Clang 支持 C++17 和 AVX2
Python 3.9+ PyTorch, transformers, safetensors
模型 DeepSeek-R1-Distill-Qwen-1.5B

编译命令

# 无 OpenBLAS(推荐,降低依赖)
xmake f --openblas=n
xmake && xmake install

# 有 OpenBLAS 安装在 ~/openblas/
xmake && xmake install

测试命令

pip install ./python/
python test/ops/linear.py
python test/test_infer.py --model /path/to/model --test

注意事项

  1. OpenBLAS 默认启用,若无此库会编译失败,请用 --openblas=n 关闭
  2. CPU 需支持 AVX2(2013年后的 x86 CPU)
  3. 无 BLAS 时自动使用 AVX2 fallback,性能稍慢但可运行
  4. 若系统有 MKL,运行时会自动检测并加载

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants