Skip to content

完成项目2(天数),3,4,5#47

Open
KevinSusan wants to merge 45 commits intoInfiniTensor:mainfrom
KevinSusan:main
Open

完成项目2(天数),3,4,5#47
KevinSusan wants to merge 45 commits intoInfiniTensor:mainfrom
KevinSusan:main

Conversation

@KevinSusan
Copy link

一、已完成项目部分
项目 2:Nvidia + 天数
项目 3:服务器 + 前端 + 流式输出 + 会话管理 + KV 复用
项目 4:调度器 + 连续批处理 + 共享模型池 + KV 感知路由
项目 5:通信层 + NCCL 后端 + 张量并行

具体完成情况看根目录下的REPORT.md文档

Add NVIDIA runtime/operators, GPU tests, server filters and sampling options,
plus frontend sampling controls and build scripts.
- Add KV cache pool with prefix matching and reference counting
- Implement multi-user inference scheduler with queue and workers
- Add packed prefill and decode batch inference (Decoder::decodePacked)
- Support session forking and editing in frontend
- Add continuous batching with PD separation
- Add segmented self-attention for packed sequences
- Include benchmark and integration tests
…itance, logging)

- Fix InfiniTensor#1: Replace _session_worker dict with OrderedDict LRU (max_sticky_sessions=10000)
- Fix InfiniTensor#2: Add best-effort TOCTOU comment on KV-aware routing
- Fix InfiniTensor#3: Add logger.debug for tokenize failures, shallow-copy payload in submit()
- Fix InfiniTensor#4: KVCachePool(IKVCachePool), ChatService(IInferenceService) explicit inheritance
- Fix InfiniTensor#5: Merge double lock in request_stop()
- Fix InfiniTensor#6: Clean _prompt_tokens from payload after routing
- Extract SessionManager (session_manager.py): session message history + cancel events
- Extract KVRuntimeBridge (kv_runtime_bridge.py): native C++ KV context lifecycle
- ChatService slimmed from ~726 to ~506 lines, using delegation pattern
- All IInferenceService interface signatures unchanged
- HTTP API and main() parameters unchanged
- Add test/test_chatservice_split.py with 19 tests covering all split modules
Previously, packed prefill/decode only handled greedy (argmax) requests;
any request with temperature/top_k/top_p fell back to single-sequence
processing. This adds per-sequence sampling params to the batch path
via new C API bindings (PrefillPackedSampling/StepPackedSampling),
with hasattr guards for backward compatibility with older DLLs.
Delete 3 outdated docs (new.md, UPDATE_PLAN.md, QA_REPORT.md) and create
PROJECT_STATUS.md with progress summaries for all 6 project directions.
- server.py: add _wrap_completion/_wrap_chunk/_wrap_error helpers,
  generate/stream/generate_packed_non_stream return OpenAI format,
  SSE streams end with data: [DONE]
- scheduler.py: fix continuous batching worker to parse new format
  (choices[0].finish_reason), convert final chunk to chat.completion
  for non-stream path
- frontend/app.js: switch to /v1/chat/completions, max_tokens,
  parse new SSE format
- 5 test files: update mocks and assertions for OpenAI format
- PROGRESS.md, docs/PROJECT_STATUS.md: document changes
Rewrite scheduler to batch-driven mode so multiple streaming requests
share the model via prepare_batch/step_batch/finalize_sequence, with
dynamic shrinking and automatic fallback to legacy iterator path.
- ChatService supports shared model_lock/kv_pool/kv_bridge across workers
- Add --shared-model CLI flag for single-model multi-worker mode
- Add IKVCachePool.memory_pressure() and --kv-memory-threshold flow control
- Optimize KV-aware routing and debug snapshot for shared pool mode
- Add test/test_shared_model.py (14 tests)
Iluvatar CoreX SDK is fully CUDA-compatible, so kernels are reused
from nvidia:: namespace with zero copy. Adds device enum, runtime
dispatch, build scripts (clang++ -x cuda --cuda-gpu-arch=ivcore10),
and test support for --device iluvatar across all test files.
The on_load hook runs too early - xmake injects cudadevrt after
on_load when it detects CUDA dependencies. Use before_link to
filter out cudadevrt from links, syslinks and ldflags right
before the linker runs.
Root cause: xmake detects .cu files and auto-injects nvcc toolchain +
cudadevrt, completely ignoring our custom iluvatar_cu rule.

Solution: use on_build() to fully control compilation with clang++,
never registering .cu files via add_files(). This prevents xmake from
detecting CUDA and injecting nvcc/cudadevrt.
The linker does single-pass scanning of static libraries. Since
llaisys-ops calls nvidia:: symbols defined in llaisys-ops-iluvatar,
we need --whole-archive to force all symbols to be included.
add_ldflags was silently ignored by xmake. Use add_shflags with
full .a file paths to force whole-archive inclusion of iluvatar
static libraries into the shared library.
-lcudart was placed before the .a files by xmake, causing the
linker to skip it (single-pass scanning). Move all iluvatar
link flags into add_shflags to control exact order, and add
rpath so libcudart.so is found at runtime.
All 9 GPU operators pass on Iluvatar CoreX (ivcore10).
Runtime test detects 2 iluvatar devices and passes.
Added Iluvatar CoreX platform details: runtime, operators, build
system, and test results. Updated summary table from 50% to 90%.
…Iluvatar

test/test_infer.py --device iluvatar produces tokens identical to
PyTorch reference output. Project InfiniTensor#2 now at 100%.
…sor parallelism

- Communication layer: C API (comm.h), C++ dispatcher, NCCL backend
- commInit accepts external unique ID for multi-rank initialization
- llaisysCommGenerateUniqueId API for external ID generation
- Decoder AllReduce: after attn_o and mlp_down projections (Megatron-style)
- llaisysQwen2ModelSetTensorParallel C API
- Python weight splitting (column/row split for Megatron-style TP)
- Multi-process launcher (launch_tp.py + _tp_worker.py)
- Unit tests (test_comm_api.py) and integration tests (test_allreduce.py)
- Documentation: comm_design.md, PROGRESS.md, PROJECT_STATUS.md updated
When TP is enabled, nh is divided by world_size, so nh*dh != hs.
The attn_out3d tensor has shape [len, tp_nh, dh] and must be viewed
as [len, tp_nh*dh], not [len, hs].
Copilot AI review requested due to automatic review settings March 16, 2026 14:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 将 LLAISYS 扩展到更多 GPU/分布式与推理能力:新增 Iluvatar(CoreX) GPU 构建与运行时、补全多项算子 GPU 实现、引入通信层(NCCL)与张量并行相关脚本/接口,并补充 tokenizer 与调度/KV 复用相关的测试与 Python 侧封装。

Changes:

  • 新增/完善 NVIDIA 与 Iluvatar GPU 的 xmake 构建目标、RuntimeAPI 以及部分链接策略
  • 补全多项 Ops(add/argmax/embedding/linear/rearrange/rms_norm/rope/self_attention/swiglu)的 CPU + CUDA 实现,并新增 segmented self-attention C/Python 绑定
  • 增加 SentencePiece tokenizer(C++/C API/Python 封装)、通信层 API(NCCL)与多进程 allreduce/TP 启动与测试脚本、前端静态页面

Reviewed changes

Copilot reviewed 146 out of 148 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
xmake/nvidia.lua NVIDIA 设备与算子静态库构建目标
xmake/iluvatar.lua Iluvatar(CoreX) 目标:用 clang++ CUDA 前端手动编译/归档
xmake.lua 新增 sentencepiece/iluvatar 选项;聚合 target 依赖与链接策略调整
test/test_utils.py 类型注解更新;支持 iluvatar device 映射
test/test_tokenizer.py SentencePiece tokenizer ctypes 冒烟测试脚本
test/test_scheduler_inmemory.py InferenceScheduler 的内存级单测覆盖(stream/timeout/CB/packed)
test/test_runtime.py CLI device 选项扩展至 iluvatar
test/test_kv_cache_pool.py KVCachePool 行为单测(sealed 前缀、引用计数、回滚等)
test/test_infer.py CLI device 选项扩展至 iluvatar
test/test_comm_api.py Comm API(NCCL)ctypes 单测(单卡 nranks=1)
test/test_chat_minimal.py 最小聊天推理脚本(Tokenizer + Qwen2)
test/test_allreduce.py 多进程 allreduce 集成测试驱动脚本
test/ops_gpu/add.py GPU add 对齐测试脚本
test/ops_gpu/argmax.py GPU argmax 对齐测试脚本
test/ops_gpu/embedding.py GPU embedding 对齐测试脚本
test/ops_gpu/linear.py GPU linear 对齐测试脚本
test/ops_gpu/rearrange.py GPU rearrange 对齐测试脚本
test/ops_gpu/rms_norm.py GPU rms_norm 对齐测试脚本
test/ops_gpu/rope.py GPU rope 对齐测试脚本
test/ops_gpu/self_attention.py GPU self_attention 对齐测试脚本
test/ops_gpu/swiglu.py GPU swiglu 对齐测试脚本
test/ops_gpu/run_all.py GPU 算子测试一键运行器
test/ops_gpu/init.py ops_gpu 包占位文件
test/ops/add.py CPU/通用 add 测试增加 iluvatar 选项
test/ops/argmax.py CPU/通用 argmax 测试增加 iluvatar 选项
test/ops/embedding.py CPU/通用 embedding 测试增加 iluvatar 选项
test/ops/linear.py CPU/通用 linear 测试增加 iluvatar 选项
test/ops/rms_norm.py CPU/通用 rms_norm 测试增加 iluvatar 选项
test/ops/rope.py CPU/通用 rope 测试增加 iluvatar 选项
test/ops/self_attention.py CPU/通用 self_attention 测试增加 iluvatar 选项
test/ops/self_attention_segmented.py segmented self-attention 参考实现/对齐测试
test/ops/swiglu.py CPU/通用 swiglu 测试增加 iluvatar 选项
test/_allreduce_worker.py allreduce worker(直接调用 NCCL 生成/初始化 comm)
src/utils/types.hpp 增加 include guard(pragma once)
src/utils/check.hpp 宏定义间增加空行(格式)
src/tokenizer/sentencepiece/sentencepiece.hpp SentencePieceTokenizer C++ 封装声明
src/tokenizer/sentencepiece/sentencepiece.cpp SentencePieceTokenizer 实现(受编译宏控制)
src/tensor/tensor.hpp Tensor 头文件注释与结构整理(中文注释)
src/tensor/tensor.cpp 实现 isContiguous/permute/view/slice/load/contiguous 等(替换 TO_BE_IMPLEMENTED)
src/ops/add/op.cpp add:新增 NVIDIA/ILUVATAR 分支调用 CUDA 实现
src/ops/add/nvidia/add_nvidia.hpp add CUDA 声明
src/ops/add/nvidia/add_nvidia.cu add CUDA kernel 实现
src/ops/add/cpu/add_cpu.hpp add CPU 头文件格式调整
src/ops/add/cpu/add_cpu.cpp add CPU 实现格式调整
src/ops/argmax/op.cpp argmax:补全 device/dtype 校验与 CPU/GPU 分发
src/ops/argmax/nvidia/argmax_nvidia.hpp argmax CUDA 声明
src/ops/argmax/nvidia/argmax_nvidia.cu argmax CUDA kernel(单线程实现)
src/ops/argmax/cpu/argmax_cpu.hpp argmax CPU 声明
src/ops/argmax/cpu/argmax_cpu.cpp argmax CPU 实现
src/ops/embedding/op.cpp embedding:补全校验与 CPU/GPU 分发
src/ops/embedding/nvidia/embedding_nvidia.hpp embedding CUDA 声明
src/ops/embedding/nvidia/embedding_nvidia.cu embedding CUDA kernel
src/ops/embedding/cpu/embedding_cpu.hpp embedding CPU 声明
src/ops/embedding/cpu/embedding_cpu.cpp embedding CPU 实现
src/ops/linear/op.cpp linear:补全 bias 可选、校验与 CPU/GPU 分发
src/ops/linear/nvidia/linear_nvidia.hpp linear CUDA 声明
src/ops/linear/nvidia/linear_nvidia.cu linear CUDA kernel
src/ops/linear/cpu/linear_cpu.hpp linear CPU 声明
src/ops/linear/cpu/linear_cpu.cpp linear CPU 实现
src/ops/rearrange/op.cpp rearrange:CPU/GPU 分发;GPU 侧拷贝 shape/stride 到 device
src/ops/rearrange/nvidia/rearrange_nvidia.hpp rearrange CUDA 声明
src/ops/rearrange/nvidia/rearrange_nvidia.cu rearrange CUDA kernel
src/ops/rearrange/cpu/rearrange_cpu.hpp rearrange CPU 声明
src/ops/rearrange/cpu/rearrange_cpu.cpp rearrange CPU 递归实现
src/ops/rms_norm/op.cpp rms_norm:补全校验与 CPU/GPU 分发
src/ops/rms_norm/nvidia/rms_norm_nvidia.hpp rms_norm CUDA 声明
src/ops/rms_norm/nvidia/rms_norm_nvidia.cu rms_norm CUDA kernel
src/ops/rms_norm/cpu/rms_norm_cpu.hpp rms_norm CPU 声明
src/ops/rms_norm/cpu/rms_norm_cpu.cpp rms_norm CPU 实现
src/ops/rope/op.cpp rope:补全校验与 CPU/GPU 分发(pos_ids int64)
src/ops/rope/nvidia/rope_nvidia.hpp rope CUDA 声明
src/ops/rope/nvidia/rope_nvidia.cu rope CUDA kernel
src/ops/rope/cpu/rope_cpu.hpp rope CPU 声明
src/ops/rope/cpu/rope_cpu.cpp rope CPU 实现
src/ops/self_attention/op.hpp self_attention API 扩展:新增 segmented 声明
src/ops/self_attention/op.cpp self_attention 与 segmented 路径实现与分发
src/ops/self_attention/nvidia/self_attention_nvidia.hpp self_attention CUDA 声明
src/ops/self_attention/nvidia/self_attention_nvidia.cu self_attention CUDA kernel(朴素实现)
src/ops/self_attention/cpu/self_attention_cpu.hpp self_attention CPU 声明(含 segmented)
src/ops/swiglu/op.cpp swiglu:补全校验与 CPU/GPU 分发
src/ops/swiglu/nvidia/swiglu_nvidia.hpp swiglu CUDA 声明
src/ops/swiglu/nvidia/swiglu_nvidia.cu swiglu CUDA kernel
src/ops/swiglu/cpu/swiglu_cpu.hpp swiglu CPU 声明
src/ops/swiglu/cpu/swiglu_cpu.cpp swiglu CPU 实现
src/models/transformer/decoder/decoder.hpp Transformer decoder API(prefill/packed/TP/KV ctx)声明
src/models/qwen2/qwen2.hpp Qwen2 C++ 模型封装声明(packed、sampling、TP、KV ctx)
src/llaisys/tokenizer.cc Tokenizer C API 实现(SentencePiece)
src/llaisys/ops.cc C API:linear 支持 bias=null;新增 segmented self-attention 导出
src/llaisys/models/qwen2_kv_internal.hpp Qwen2 KV block/context 内部结构(refcount 等)
src/llaisys/comm.cc comm C API glue(getCommAPI / generateUniqueId)
src/device/runtime_api.hpp RuntimeAPI:增加 iluvatar namespace 声明
src/device/runtime_api.cpp RuntimeAPI dispatcher:支持 LLAISYS_DEVICE_ILUVATAR
src/device/nvidia/nvidia_runtime_api.cu NVIDIA RuntimeAPI:补全 CUDA 实现(memcpy/stream/malloc 等)
src/device/nvidia/nvidia_comm.cu NCCL 后端 comm API 实现(allreduce/bcast/send/recv 等)
src/device/nvidia/devlink_stub.cu CUDA devlink stub(触发 device linking)
src/device/nvidia/cuda_utils.hpp CUDA error check + ScalarOps(fp16/bf16/f32)
src/device/iluvatar/iluvatar_utils.hpp Iluvatar CUDA-like utils + ScalarOps
src/device/iluvatar/iluvatar_runtime_api.cu Iluvatar RuntimeAPI:CUDA runtime 封装实现
src/device/iluvatar/iluvatar_resource.cuh Iluvatar DeviceResource 声明
src/device/iluvatar/iluvatar_resource.cu Iluvatar DeviceResource 构造实现
src/device/iluvatar/devlink_stub.cu Iluvatar devlink stub
src/device/comm_api.hpp 通信层抽象 API 声明(含 NCCL/IXCCL 条件声明)
src/device/comm_api.cpp 通信层 dispatcher + unsupported 默认实现
src/core/context/context.hpp Context 增加注释
src/core/context/context.cpp Context 生命周期/切设备增加注释
scripts/run_gpu.ps1 Windows GPU build/test/server 一键脚本
scripts/launch_tp.py TP 多进程 launcher(通过 commGenerateUniqueId)
scripts/benchmark_chat_scheduler.py scheduler 压测脚本(HTTP chat)
python/llaisys/tokenizer.py Python Tokenizer:SentencePiece + tokenizer.json(HF tokenizers)
python/llaisys/tensor_parallel.py Qwen2 权重切分(column/row split)
python/llaisys/session_manager.py Session 历史与取消事件管理
python/llaisys/ops.py Python Ops:新增 self_attention_segmented 封装
python/llaisys/models/init.py 导出 format_chat_prompt 等
python/llaisys/libllaisys/tokenizer.py ctypes tokenizer API 绑定加载
python/llaisys/libllaisys/ops.py ctypes ops:可选加载 segmented self-attention
python/llaisys/libllaisys/llaisys_types.py DeviceType 增加 ILUVATAR 枚举值
python/llaisys/libllaisys/init.py lib 加载:新增 models/comm/tokenizer loader 与导出
python/llaisys/kv_runtime_bridge.py Python 侧 native KV ctx 复用桥接
python/llaisys/interfaces.py 调度器/服务/KV 池接口抽象定义
python/llaisys/init.py 顶层导出 Tokenizer
include/llaisys/tokenizer.h Tokenizer C API 头文件
include/llaisys/ops.h Ops C API:新增 segmented self-attention 声明
include/llaisys/models/qwen2.h Qwen2 C API 扩展:sampling/packed/TP/KV block+context 等
include/llaisys/comm.h comm C API 头文件(backend/op/API struct)
include/llaisys.h 设备枚举增加 LLAISYS_DEVICE_ILUVATAR
frontend/style.css 前端样式
frontend/index.html 前端页面骨架
Untitled 新增单行命令文件

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +52 to +59
-- Archive into static library
local targetfile = target:targetfile()
local targetdir = path.directory(targetfile)
if not os.isdir(targetdir) then
os.mkdir(targetdir)
end
os.vrunv("ar", {"-cr", targetfile, table.unpack(objectfiles)})
end)
Comment on lines +203 to 208
TensorMeta new_meta{dtype(), new_shape, new_strides};
return tensor_t(new Tensor(new_meta, _storage, _offset)); // 零拷贝


return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
}
Comment on lines 210 to +218
tensor_t Tensor::view(const std::vector<size_t> &shape) const {
TO_BE_IMPLEMENTED();
return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
if(isContiguous() == true){
tensor_t tmp = create(shape, this->dtype(), this->deviceType(), this->deviceId());
tmp->_storage = this->_storage;
return tmp;
}else{
//非连续存储
return contiguous()->view(shape);
}
Comment on lines 255 to +284
tensor_t Tensor::contiguous() const {
TO_BE_IMPLEMENTED();
return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
if(isContiguous()){
return std::shared_ptr<Tensor>(new Tensor(_meta, _storage));
}else{
//形状
const auto& sh = shape();
//维度
const auto dim = sh.size();

//创建一个新的连续步长数组
std::vector<ptrdiff_t> c_str(dim, 1);
for (size_t i = dim - 1; i-- > 0;) {
c_str[i] = c_str[i + 1] * sh[i + 1];
}

//申请同设备新存储
size_t bytes = numel() * elementSize();
core::storage_t st = (deviceType() == LLAISYS_DEVICE_CPU)
? core::context().runtime().allocateHostStorage(bytes)
: core::context().runtime().allocateDeviceStorage(bytes);

//创建新连续张量
tensor_t dst(new Tensor(TensorMeta{dtype(), sh, c_str}, st, 0));

// 4. 拷贝数据(H2H 或 H2D 视设备而定)
core::context().setDevice(deviceType(), deviceId());
core::context().runtime().api()->memcpy_sync(
dst->data(), data(), bytes,
deviceType() == LLAISYS_DEVICE_CPU ? LLAISYS_MEMCPY_H2H : LLAISYS_MEMCPY_H2D);

Comment on lines +62 to +67
case LLAISYS_COMM_IXCCL:
#ifdef ENABLE_ILUVATAR_API
return llaisys::device::ixccl::getCommAPI();
#else
return getUnsupportedCommAPI();
#endif
Comment on lines +6 to +10
class Resource : public llaisys::device::DeviceResource {
public:
Resource(int device_id);
~Resource();
};
Comment on lines +154 to +168
elseif has_config("iluvatar-gpu") then
-- No .cu files in this target, no CUDA toolchain
-- Use add_shflags to control exact link order:
-- 1. whole-archive iluvatar static libs (defines nvidia:: symbols)
-- 2. -lcudart AFTER the .a files (so cudart symbols are resolved)
add_shflags(
"-Wl,--whole-archive",
"build/linux/x86_64/release/libllaisys-ops-iluvatar.a",
"build/linux/x86_64/release/libllaisys-device-iluvatar.a",
"-Wl,--no-whole-archive",
"-L/usr/local/corex/lib64",
"-Wl,-rpath,/usr/local/corex/lib64",
"-lcudart",
{force = true}
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants